本文介绍了无法从网页获取某些标题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经用php编写了一个脚本,以从 hair fall shamboo 中刮出 title 网页。当我执行下面的脚本时,出现以下错误:

I've written a script in php to scrape a title visible as hair fall shamboo from a webpage. When I execute my below script, I get the following error:

我编写的脚本尝试使用:

Script I've tried with:

 > 

Although the xpath I used within my above script seems to be correct, I pasted here the relevant portion of html elements within which the title can be found:

<h1 _ngcontent-c0="" class="br-hdng"><span _ngcontent-c0="" class="pr dib">hair fall shamboo<!----></span></h1>

PostScript: title 我想解析的是动态加载的。由于我是php新手,所以我不了解我尝试的方法是否正确。如果不是,那我该怎么办?

PostScript: The title I wish to parse gets loaded dynamically. As I'm new to php I don't understand whether the way I tried is accurate. If not what I should do then?

我使用 javascript取得了成功:

const puppeteer = require('puppeteer');
function run () {
    return new Promise(async (resolve, reject) => {
        try {
            const browser = await puppeteer.launch();
            const page = await browser.newPage();
            await page.goto("https://www.purplle.com/search?q=hair%20fall%20shamboo");
            let urls = await page.evaluate(() => {
            let items = document.querySelector('h1.br-hdng span');
            return items.innerText;;
            })
            browser.close();
            return resolve(urls);
        } catch (e) {
            return reject(e);
        }
    })
}
run().then(console.log).catch(console.error);

再次,我使用 python 获得成功:

Again, I got success using python:

import requests_html

with requests_html.HTMLSession() as session:
    r = session.get('https://www.purplle.com/search?q=hair%20fall%20shamboo')
    r.html.render()
    item = r.html.find("h1.br-hdng span",first=True).text
    print(item)

然后 php 怎么了?

推荐答案

很可能是您的代码中的问题比我在此答案中讨论的要多,但是我看到的最突出的问题是:

It could very well be that there are more issues with your code than I have covered in this answer, but the most prominent issue that I see is the following:

不是静态方法,而是实例方法(返回布尔值)。您应该首先创建 DOMDocument 的实例,然后在该实例上调用 loadHTML():

DOMDocument::loadHTML() is not a static method, but an instance method (which returns a boolean). You should first create an instance of DOMDocument and then call loadHTML() on that instance:

$dom = new DOMDocument;
$dom->loadHTML($xml);

但是,由于您通过 @ 运算符,您没有收到关于此的警告。并且虽然很常见的是使用错误抑制操作符 @ 来抑制HTML验证错误,但您应该考虑使用 ,因为这不会抑制一般的PHP错误。

However, since you have suppressed errors with the @ operator on that particular line, you are not receiving a warning about this. And although it's very commonly seen that the error suppressor operator @ is used to suppress HTML validation errors, like this, you should look into using libxml_use_internal_errors() instead, as this does not suppress general PHP errors.

$dom = new DOMDocument;
$oldSetting = libxml_use_internal_errors(true);
$dom->loadHTML($xml);
libxml_use_internal_errors($oldSetting);

最后一点:

可以从带有,如果您的PHP安装被配置为允许通过配置设置 。请注意,尽管出于安全原因通常会禁用此设置,但如果打算使用它,请谨慎使用。

As a final note:
It's possible to load a DOM document from a URL directly (without the need for cURL) with DOMDocument::loadHTMLFile(), if your PHP installation is configured to allow loading of URLs via the configuration setting allow_url_fopen. Be aware though that this setting is often disabled for security reasons, so use it with care, if you plan on using it.

这是一个简单的测试用例,应能按预期工作:

Here's a simple test-case which should work as expected:

<?php

$html = '
<html>
<head>
  <title>DOMDocument test-case</title>
</head>
<body>
  <div class="dummy-container">
    <h1 _ngcontent-c0="" class="br-hdng"><span _ngcontent-c0="" class="pr dib">hair fall shamboo<!----></span></h1>
  </div>
</body>';

$dom = new DOMDocument;

$oldSetting = libxml_use_internal_errors(true);
$dom->loadHTML( $html );
libxml_use_internal_errors($oldSetting);

$xpath = new DOMXPath( $dom );
$title = $xpath->query( '//h1[@class="br-hdng"]/span' )->item( 0 )->nodeValue;
echo $title;

您应替换 $ html 和您的 get_content()调用的输出。如果它不起作用,则可以执行以下操作:

You should replace the contents of $html with the output of your get_content() call. If it doesn't work, then either:


  1. 使用 cURL获取HTML时出现问题(执行 var_dump($ html); ,然后再加载到 DOMDocument 中查看您检索到的内容),或者...

  1. there's something wrong with fetching the HTML with cURL (do var_dump( $html ); before loading into DOMDocument, for instance, to see the contents you retrieved), or...

也许您正在命名空间中工作,在这种情况下,应在<$ c $之前加一个反斜杠。 c> DOMDocument 和 DOMXPath ,即: new \DOMDocument; 和新的\DOMXPath($ dom); 。

perhaps you are working inside a namespace, in which case you should prepend a backslash before DOMDocument and DOMXPath, i.e.: new \DOMDocument; and new \DOMXPath( $dom );.






这篇关于无法从网页获取某些标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-26 16:18
查看更多