问题描述
我正在使用PHP和 DOMXPath 来获取所有< p> 元素的内容的网页:
I'm currently using PHP and DOMXPath to get the contents of all of the <p> elements of a web page:
<?php ... $doc = new DOMDocument(); $doc->loadHTML($html); $xpath = new DOMXPath($doc); $paragraphs = $xpath->evaluate("/html/body//p"); foreach ($paragraphs as $paragraph){ echo $paragraph->textContent . "<br />"; }
我的问题是,从 textContent 不尊重< p> 中存在的 元素。相反,它会删除换行符,并将通常在单独行上的单词相加。例如:
My problem is that the string resulting from textContent does not respect <br /> tags that exist within those <p> elements. Instead it removes the line break and pushes words together that would normally be on separate lines. For example:
示例HTML:
<p> Some happy talk goes here talking about our great product.<br /> We would love for you to buy it! </p> <p> Random information and what not<br /> Isn't that cool? </p>
目前来自PHP的输出:
Current Output from PHP above:
Some happy talk about our great product.We would love for you to buy it! Random information and what notIsn't that cool?
我尝试过 $ paragraph = $ doc-> getElementsByTagName(p ); 以及它给我一样的东西。
I have tried $paragraphs = $doc->getElementsByTagName("p"); as well and it gives me the same thing.
有没有办法使DOMXPath / DOMDocument保留换行符?我需要能够分隔一个段落中的每个单词,而当前的输出不允许。
Is there a way to make DOMXPath/DOMDocument preserve the line breaks? I need to be able to separate each of the words within a paragraph, and the current output disallows that.
如果有一个替代方法来检索< p> 元素,同时保留< br /> 或'\\\
' / code>这也是很棒的。
If there is an alternative method for retrieving the string within <p> elements while preserving <br /> or '\n' that would also be great.
编辑
进一步调查后,相关HTML实际上是由< br> 标签分隔的锚点列表,但没有实际的换行符:
Upon further investigation the HTML in question is actually a list of anchors separated by <br> tags but with no actual line breaks:
<p class="home_page_list"><a href="/home/personal-banking/checking/Category-Page-Classic-Checking/classic-checking.html">Classic Checking</a><br> <a href="/home/personal-banking/checking/Category-Page-Interest-Checking/interest-checking.html">Interest Checking</a><br> <a href="/home/personal-banking/checking/Category-Page-Interest-Checking/interest-premium-checking.html">Premium Checking</a><br> <a href="/home/personal-banking/Savings-Category-Page/Basic-Savings-Category-Page/basic-savings.html">Savings Plans</a><br> <a href="/home/personal-banking/Savings-Category-Page/Money-Market-Accounts-Category-Page/money-market-accounts.html">Money Market Accounts</a><br> <a href="/home/personal-banking/Savings-Category-Page/Certificates-of-Deposit-Category-Page/fixed-rate-CD.html">CDs</a><br> <a href="/home/personal-banking/Savings-Category-Page/Individual-Retirement-Account-Category-Page/individual-retirement-account.html">IRAs</a></p>
证明这可以与给定的原始HTML正常工作。
Turns out that this works properly with the original HTML given.
更新:解决
在@ ircmaxell的帮助下,并且@netcoder和@Gordon留下的评论已经解决了,它不是非常优雅,但现在将会做。
With the help of @ircmaxell's answer, and the comments left by @netcoder and @Gordon this has been solved, it's not very elegant but it will do for now.
示例:
foreach ($paragraphs as $paragraph){ $p_text = new DOMDocument(); $p_text->loadHTML(str_ireplace(array("<br>", "<br />"), "\r\n", DOMinnerHTML($paragraph))); //Do whatever, in this case get all of the words in an array. $words = explode(" ", str_ireplace(array(",", ".", "&", ":", "-", "\r\n"), " ", $p_text->textContent)); print_r($words); }
这使用(由@netcoder建议)将< br> 的实例替换为 \\\\
(由@ircmaxell建议),然后可以在 textContent之后评估。
This makes use of DOMinnerHTML (as suggested by @netcoder) to replace the instances of <br> with "\r\n" (as suggested by @ircmaxell), which can then be evaluated post textContent.
显然有一些改进的空间,但它解决了我目前的问题。
Obviously there's some room for improvement, but it has solved my current issue.
感谢大家帮助,
Ben
推荐答案
嗯,我会做的是用文字换行替换换行符:
Well, what I would do is replace the line-breaks with literal linebreaks:
$doc = new DOMDocument(); $doc->loadHTML($html); $brs = $doc->getElementsByTagName('br'); foreach ($brs as $node) { $node->parentNode->replaceChild($doc->createTextNode("\r\n"), $node); } $xpath = new DOMXPath($doc); $paragraphs = $xpath->evaluate("/html/body//p"); foreach ($paragraphs as $paragraph){ echo $paragraph->textContent . "<br />"; }
这篇关于在< p>内保留换行符使用DOMXPath的标签?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!