问题描述
我使用php和cURL来抓取网页,但是这个网页的设计不好(因为没有类或标签上的ids),所以我需要搜索特定的文本,然后去标签中持有它(即< p>
),然后移动到下一个子节点(或下一个< p>
文本。
I'm using php and cURL to scrape a web page, but this web page is poorly designed (as in no classes or ids on tags), so I need to search for specific text, then go to the tag holding it (ie <p>
) then move to the next child (or next <p>
) and get the text.
我需要从网页中获取各种东西,有些也是< a onclick = >
。所以基本上我觉得我需要使用cURL把源代码写到一个php变量,然后我可以使用php来解析通过,找到我需要的东西。
There are various things I need to get from the page, some also being the text within an <a onclick="get this stuff here">
. So basically I feel that I need to use cURL to scrape the source code to a php variable, then I can use php to kind of parse through and find the stuff I need.
这听起来像是最好的方法吗?有没有人有任何指针或可以演示如何我可以把源代码从cURL变量?
Does this sound like the best method to do this? Does anyone have any pointers or can demonstrate how I can put source code from cURL into a variable?
谢谢!
EDIT(工作/当前代码)
EDIT (Working/Current Code) -----------
<?php
class Scrape
{
public $cookies = 'cookies.txt';
private $user = null;
private $pass = null;
/*Data generated from cURL*/
public $content = null;
public $response = null;
/* Links */
private $url = array(
'login' => 'https://website.com/login.jsp',
'submit' => 'https://website.com/LoginServlet',
'page1' => 'https://website.com/page1',
'page2' => 'https://website.com/page2',
'page3' => 'https://website.com/page3'
);
/* Fields */
public $data = array();
public function __construct ($user, $pass)
{
$this->user = $user;
$this->pass = $pass;
}
public function login()
{
$this->cURL($this->url['login']);
if($form = $this->getFormFields($this->content, 'login'))
{
$form['login'] = $this->user;
$form['password'] =$this->pass;
// echo "<pre>".print_r($form,true);exit;
$this->cURL($this->url['submit'], $form);
//echo $this->content;//exit;
}
//echo $this->content;//exit;
}
// NEW TESTING
public function loadPage($page)
{
$this->cURL($this->url[$page]);
echo $this->content;//exit;
}
/* Scan for form */
private function getFormFields($data, $id)
{
if (preg_match('/(<form.*?name=.?'.$id.'.*?<\/form>)/is', $data, $matches)) {
$inputs = $this->getInputs($matches[1]);
return $inputs;
} else {
return false;
}
}
/* Get Inputs in form */
private function getInputs($form)
{
$inputs = array();
$elements = preg_match_all('/(<input[^>]+>)/is', $form, $matches);
if ($elements > 0) {
for($i = 0; $i < $elements; $i++) {
$el = preg_replace('/\s{2,}/', ' ', $matches[1][$i]);
if (preg_match('/name=(?:["\'])?([^"\'\s]*)/i', $el, $name)) {
$name = $name[1];
$value = '';
if (preg_match('/value=(?:["\'])?([^"\']*)/i', $el, $value)) {
$value = $value[1];
}
$inputs[$name] = $value;
}
}
}
return $inputs;
}
/* Perform curl function to specific URL provided */
public function cURL($url, $post = false)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13");
// "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, $this->cookies);
curl_setopt($ch, CURLOPT_COOKIEFILE, $this->cookies);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 120);
curl_setopt($ch, CURLOPT_TIMEOUT, 120);
if($post) //if post is needed
{
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($post));
}
curl_setopt($ch, CURLOPT_URL, $url);
$this->content = curl_exec($ch);
$this->response = curl_getinfo( $ch );
$this->url['last_url'] = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
curl_close($ch);
}
}
$sc = new Scrape('user','pass');
$sc->login();
$sc->loadPage('page1');
echo "<h1>TESTTESTEST</h1>";
$sc->loadPage('page2');
echo "<h1>TESTTESTEST</h1>";
$sc->loadPage('page3');
echo "<h1>TESTTESTEST</h1>";
(注意:@Ramz的信用)
(note: credit to @Ramz scrape a website with secured login)
推荐答案
p>我建议你使用已经制作的scaper。我使用Goutte(),它允许我加载网站内容并以与使用jQuery相同的方式遍历它。即如果我想要< div id =content>
的内容我使用 $ client-> filter ) - > text()
I suggest you use a rready made scaper. I use Goutte (https://github.com/FriendsOfPHP/Goutte) which allows me to load website content and traverse it in the same way you do with jQuery. i.e. if I want the content of the <div id="content">
I use $client->filter('#content')->text()
它甚至允许我找到并点击链接并提交表单以检索和处理内容。
It even allows me to find and 'click' on links and submit forms to retreive and process the content.
它使生活soooooooo mucn比使用cURL或file_get_contentsa()更容易,并通过手动操作的方式通过html
It makes life soooooooo mucn easier than using cURL or file_get_contentsa() and working your way through the html manually
这篇关于cURL抓取然后解析/查找特定内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!