我正在尝试使用jsoup解析html字符串:
<div class="test">
<br>From: <b class="sendername">Divya</b>
<span dir="ltr"><<a href="mailto:[email protected]" target="_blank">[email protected]</a>></span>
<br>Date: Wed, May 27, 2015 at 11:10 AM
<br>Subject: Plan for the day 27/05/2015
<br>To: Abhishek<<a href="mailto:[email protected]" target="_blank">abhishek.sharma@abc.<wbr>com</a>>,
<a href="mailto:[email protected]" target="_blank">[email protected]</a>>
<br>Cc: Ram <<a href="mailto:[email protected]" target="_blank">[email protected]</a>>
<br>
<br>
<br>
<div dir="ltr">Hi,</div>
</div>
Document doc = Jsoup.parse( mailBody.getBodyHtml().get( 0 ) );
Elements elem = doc.getElementsByClass( "test" );
int totalElements = 0;
Elements childElements = elem.get( 0 ).;
int brCount = 0;
for( Element childElement : childElements )
{
totalElements++;
if( childElement.tagName().equalsIgnoreCase( "br" ) )
{
brCount++;
if( brCount == 3 )
break;
}
else
brCount = 0;
}
for( int i = 1; i <= totalElements; i++ )
{
childElements.get( i ).remove();
}
我想摆脱三个连续的br标签之前的所有内容,并且它们之间应该没有文本节点。
即在上述情况下,它将删除所有标签(html标签和textnode),输出将如下所示:
<div class="test">
<div dir="ltr">Hi,</div>
</div>
最佳答案
html的结构似乎是恒定的。因此,您可以尝试以下CSS选择器:
div.test br + br + br + div
演示
http://try.jsoup.org/~DiBi9Q_Ye88gi6Hq29Z44ar6xus
样本代码
String html = "<div class=\"test\">\n <br>From: <b class=\"sendername\">Divya</b> \n <span dir=\"ltr\"><<a href=\"mailto:[email protected]\" target=\"_blank\">[email protected]</a>></span>\n <br>Date: Wed, May 27, 2015 at 11:10 AM\n <br>Subject: Plan for the day 27/05/2015\n <br>To: Abhishek<<a href=\"mailto:[email protected]\" target=\"_blank\">abhishek.sharma@abc.<wbr>com</a>>, \n <a href=\"mailto:[email protected]\" target=\"_blank\">[email protected]</a>>\n <br>Cc: Ram <<a href=\"mailto:[email protected]\" target=\"_blank\">[email protected]</a>>\n <br>\n <br>\n <br>\n <div dir=\"ltr\">Hi,</div>\n </div>";
Document doc = Jsoup.parse(html);
Element mailBody = doc.select("div.test br + br + br + div").first();
if (mailBody == null) {
throw new RuntimeException("Unable to locate mail body.");
}
System.out.println("** BEFORE:\n" + doc);
Document tmp = Jsoup.parseBodyFragment("<div class='test'>" + mailBody.outerHtml() + "</div>");
mailBody.parent().replaceWith(tmp.select("div.test").first());
System.out.println("\n** AFTER:\n" + doc);
输出值
** BEFORE:
<html>
<head></head>
<body>
<div class="test">
<br>From:
<b class="sendername">Divya</b>
<span dir="ltr"><<a href="mailto:[email protected]" target="_blank">[email protected]</a>></span>
<br>Date: Wed, May 27, 2015 at 11:10 AM
<br>Subject: Plan for the day 27/05/2015
<br>To: Abhishek<
<a href="mailto:[email protected]" target="_blank">abhishek.sharma@abc.<wbr>com</a>>,
<a href="mailto:[email protected]" target="_blank">[email protected]</a>>
<br>Cc: Ram <
<a href="mailto:[email protected]" target="_blank">[email protected]</a>>
<br>
<br>
<br>
<div dir="ltr">
Hi,
</div>
</div>
</body>
</html>
** AFTER:
<html>
<head></head>
<body>
<div class="test">
<div dir="ltr">
Hi,
</div>
</div>
</body>
</html>
关于java - 删除文本节点并检查html中的备用文本节点:Jsoup,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/30484102/