问题描述
我已经用php编写了一个脚本,以从 hair fall shamboo 中刮出 title 网页。当我执行下面的脚本时,出现以下错误:
I've written a script in php to scrape a title visible as hair fall shamboo from a webpage. When I execute my below script, I get the following error:
我编写的脚本尝试使用:
Script I've tried with:
>
Although the xpath I used within my above script seems to be correct, I pasted here the relevant portion of html elements within which the title can be found:
<h1 _ngcontent-c0="" class="br-hdng"><span _ngcontent-c0="" class="pr dib">hair fall shamboo<!----></span></h1>PostScript: title 我想解析的是动态加载的。由于我是php新手,所以我不了解我尝试的方法是否正确。如果不是,那我该怎么办?
PostScript: The title I wish to parse gets loaded dynamically. As I'm new to php I don't understand whether the way I tried is accurate. If not what I should do then?
我使用 javascript取得了成功:
const puppeteer = require('puppeteer'); function run () { return new Promise(async (resolve, reject) => { try { const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto("https://www.purplle.com/search?q=hair%20fall%20shamboo"); let urls = await page.evaluate(() => { let items = document.querySelector('h1.br-hdng span'); return items.innerText;; }) browser.close(); return resolve(urls); } catch (e) { return reject(e); } }) } run().then(console.log).catch(console.error);再次,我使用 python 获得成功:
Again, I got success using python:
import requests_html with requests_html.HTMLSession() as session: r = session.get('https://www.purplle.com/search?q=hair%20fall%20shamboo') r.html.render() item = r.html.find("h1.br-hdng span",first=True).text print(item)然后 php 怎么了?
推荐答案
很可能是您的代码中的问题比我在此答案中讨论的要多,但是我看到的最突出的问题是:
It could very well be that there are more issues with your code than I have covered in this answer, but the most prominent issue that I see is the following:
不是静态方法,而是实例方法(返回布尔值)。您应该首先创建 DOMDocument 的实例,然后在该实例上调用 loadHTML():
DOMDocument::loadHTML() is not a static method, but an instance method (which returns a boolean). You should first create an instance of DOMDocument and then call loadHTML() on that instance:
$dom = new DOMDocument; $dom->loadHTML($xml);但是,由于您通过 @ 运算符,您没有收到关于此的警告。并且虽然很常见的是使用错误抑制操作符 @ 来抑制HTML验证错误,但您应该考虑使用 ,因为这不会抑制一般的PHP错误。
However, since you have suppressed errors with the @ operator on that particular line, you are not receiving a warning about this. And although it's very commonly seen that the error suppressor operator @ is used to suppress HTML validation errors, like this, you should look into using libxml_use_internal_errors() instead, as this does not suppress general PHP errors.
$dom = new DOMDocument; $oldSetting = libxml_use_internal_errors(true); $dom->loadHTML($xml); libxml_use_internal_errors($oldSetting);最后一点:
可以从带有,如果您的PHP安装被配置为允许通过配置设置 。请注意,尽管出于安全原因通常会禁用此设置,但如果打算使用它,请谨慎使用。As a final note:
It's possible to load a DOM document from a URL directly (without the need for cURL) with DOMDocument::loadHTMLFile(), if your PHP installation is configured to allow loading of URLs via the configuration setting allow_url_fopen. Be aware though that this setting is often disabled for security reasons, so use it with care, if you plan on using it.这是一个简单的测试用例,应能按预期工作:
Here's a simple test-case which should work as expected:
<?php $html = ' <html> <head> <title>DOMDocument test-case</title> </head> <body> <div class="dummy-container"> <h1 _ngcontent-c0="" class="br-hdng"><span _ngcontent-c0="" class="pr dib">hair fall shamboo<!----></span></h1> </div> </body>'; $dom = new DOMDocument; $oldSetting = libxml_use_internal_errors(true); $dom->loadHTML( $html ); libxml_use_internal_errors($oldSetting); $xpath = new DOMXPath( $dom ); $title = $xpath->query( '//h1[@class="br-hdng"]/span' )->item( 0 )->nodeValue; echo $title;您应替换 $ html 和您的 get_content()调用的输出。如果它不起作用,则可以执行以下操作:
You should replace the contents of $html with the output of your get_content() call. If it doesn't work, then either:
使用 cURL获取HTML时出现问题(执行 var_dump($ html); ,然后再加载到 DOMDocument 中查看您检索到的内容),或者...
there's something wrong with fetching the HTML with cURL (do var_dump( $html ); before loading into DOMDocument, for instance, to see the contents you retrieved), or...
也许您正在命名空间中工作,在这种情况下,应在<$ c $之前加一个反斜杠。 c> DOMDocument 和 DOMXPath ,即: new \DOMDocument; 和新的\DOMXPath($ dom); 。
perhaps you are working inside a namespace, in which case you should prepend a backslash before DOMDocument and DOMXPath, i.e.: new \DOMDocument; and new \DOMXPath( $dom );.
这篇关于无法从网页获取某些标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!