本文介绍了用php curl抓取动态加载的网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是新手,现在已经正式废弃了两个网站.但是,当我尝试抓取动态加载网站时,问题就出现了.使用JavaScript渲染网站后,我将无法抓取网站内容.

I am new to scraping and have scrapped two websites formally. But the problem appeared to me when I tried to scrape dynamic loading websites. When the website is rendered with JavaScript, I am unable to scrape the contents of the website then.

有什么办法可以使用php curl或与PHP相关的任何其他客户端来抓取该网站的内容?

Is there any way I can scrape the contents of that website using php curl or any other client related to PHP?

这是我到目前为止所做的:

This is what I have done so far :

$link = "https://www.glassdoor.com/Job/jobs.htm?suggestCount=0&suggestChosen=false&clickSource=searchBtn&typedKeyword=android+developer&sc.keyword=android+developer&locT=N&locId=192&jobType=";

$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch,CURLOPT_URL,$link);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13");
$data = curl_exec($ch);


$document = new DOMdocument();
libxml_use_internal_errors(true);
$document->loadHTML($data);
$elements = $document->getElementsByTagName("div");

foreach($elements as $element){
  	echo $element->nodeValue."<br>";;
}

推荐答案

您需要无头浏览器,可以将PHP Wrapper用于 PhantomJS ,这是链接 http://jonnnnyw.github.io/php-phantomjs/.这样可以解决您的问题.它具有以下功能:

You need headless browser for this, you can use PHP Wrapper for PhantomJS , here is the link http://jonnnnyw.github.io/php-phantomjs/. This will solve your problem. It has following features:

  • 通过PhantomJS无头浏览器加载网页
  • 查看详细的响应数据,包括页面内容,标题,状态代码等.
  • 句柄重定向
  • 查看javascript控制台错误

希望这会有所帮助.

这篇关于用php curl抓取动态加载的网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 12:27