问题描述
请让我知道在 ajax 加载 PHP 后是否可以删除一些信息?我只将 SIMPLE_HTML_DOM 用于静态页面.
感谢您的建议.
抓取整个网站
抓取动态内容需要您实际呈现页面.PHP 服务器端抓取工具只会执行简单的 file_get_contents 或类似操作.大多数基于服务器的抓取工具不会呈现整个站点,因此不会加载由 Ajax 调用生成的动态内容.
像 Selenium 这样的东西应该可以解决问题.快速谷歌搜索找到了许多关于如何设置它的例子.
然后您应该会看到如上所示的窗口.点击 network
标签,然后点击 chrome 的刷新按钮.这将显示您与网站之间提出的每个请求.然后您可以过滤掉特定的请求.
例如,如果您对 Ajax 调用感兴趣,您可以选择 XHR
然后,您可以点击表格部分中列出的任何项目以获取更多信息.
文件在 AJAX 调用中获取内容根据 API 对这些 ajax 调用的稳健程度,您可以执行以下操作.
如果返回的是JSON则添加
$data = json_decode($content);
但是,您必须为站点上的每个 AJAX 请求执行此操作.除此之外,您将不得不使用类似于 [此处].
最后,您还可以实施 PhantomJS 来渲染整个网站.
总结
如果您想要的只是特定 ajax 调用返回的数据,您可以使用 file_get_contents 获取它们.但是,如果您试图抓取整个站点,并且碰巧也使用 AJAX 来操作文档,那么您将无法使用 SIMPLE_HTML_DOM.
Please let me know is it possible to scrap some info after ajax loaded with PHP? I had only used SIMPLE_HTML_DOM for static pages.
Thanks for advice.
Scraping the entire site
Scraping Dynamic content requires you to actually render the page. A PHP server-side scraper will just do a simple file_get_contents or similar. Most server based scrappers wont render the entire site and therefore don't load the dynamic content generated by the Ajax calls.
Something like Selenium should do the trick. Quick google search found numerous examples on how to set it up. Here is one
Scraping JUST the Ajax calls
Though I wouldn't consider this scraping you can always examine an ajax call by using your browsers dev tools. In chrome while on the site hit F12
to open up the dev tools console.
You should then see a window like the above. Hit the network
tab and then hit chrome's refresh button. This will show every request made between you and the site. You can then filter out specific requests.
For example if you are interested in Ajax calls you can select XHR
You can then click on any of the listed items in the tabled section to get more information.
File get content on AJAX callDepending on how robust the APIs are on these ajax calls you could do something like the following.
<?php
$url = "http://www.example.com/test.php?ajax=call";
$content = file_get_contents($url);
?>
If the return is JSON then add
$data = json_decode($content);
However, you are going to have to do this for each AJAX request on a site. Beyond that you are going to have to use a solution similar to the ones presented [here].
Finally you can also implement PhantomJS to render an entire site.
Summary
If all you want is the data returned by specific ajax calls you might be able to get them using file_get_contents. However, if you are trying to scrape the entire site that happens to also use AJAX to manipulate the document then you will NOT be able to use SIMPLE_HTML_DOM.
这篇关于如何在 PHP 中抓取 ajax 调用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!