我正在尝试使用jsoup解析html字符串:

<div class="test">
  <br>From: <b class="sendername">Divya</b>
  <span dir="ltr">&lt;<a href="mailto:[email protected]" target="_blank">[email protected]</a>&gt;</span>
  <br>Date: Wed, May 27, 2015 at 11:10 AM
  <br>Subject: Plan for the day 27/05/2015
  <br>To: Abhishek&lt;<a href="mailto:[email protected]" target="_blank">abhishek.sharma@abc.<wbr>com</a>&gt;,
    <a href="mailto:[email protected]" target="_blank">[email protected]</a>&gt;
  <br>Cc: Ram &lt;<a href="mailto:[email protected]" target="_blank">[email protected]</a>&gt;
  <br>
  <br>
  <br>
  <div dir="ltr">Hi,</div>
 </div>


Document doc = Jsoup.parse( mailBody.getBodyHtml().get( 0 ) );
Elements elem = doc.getElementsByClass( "test" );
int totalElements = 0;
Elements childElements = elem.get( 0 ).;
int brCount = 0;
for( Element childElement : childElements )
{
    totalElements++;
    if( childElement.tagName().equalsIgnoreCase( "br" ) )
    {
        brCount++;
        if( brCount == 3 )
            break;
    }
    else
    brCount = 0;
}
for( int i = 1; i <= totalElements; i++ )
{
    childElements.get( i ).remove();
}

我想摆脱三个连续的br标签之前的所有内容,并且它们之间应该没有文本节点。
即在上述情况下,它将删除所有标签(html标签和textnode),输出将如下所示:

<div class="test">
  <div dir="ltr">Hi,</div>
 </div>


  • 如何检查两个br标签之间是否存在文本节点?
  • 上面的代码只是删除html标签,但是文本节点没有被删除。我该如何删除?
  • 最佳答案

    html的结构似乎是恒定的。因此,您可以尝试以下CSS选择器:

    div.test br + br + br + div
    

    演示

    http://try.jsoup.org/~DiBi9Q_Ye88gi6Hq29Z44ar6xus

    样本代码
    String html = "<div class=\"test\">\n  <br>From: <b class=\"sendername\">Divya</b> \n  <span dir=\"ltr\">&lt;<a href=\"mailto:[email protected]\" target=\"_blank\">[email protected]</a>&gt;</span>\n  <br>Date: Wed, May 27, 2015 at 11:10 AM\n  <br>Subject: Plan for the day 27/05/2015\n  <br>To: Abhishek&lt;<a href=\"mailto:[email protected]\" target=\"_blank\">abhishek.sharma@abc.<wbr>com</a>&gt;, \n    <a href=\"mailto:[email protected]\" target=\"_blank\">[email protected]</a>&gt;\n  <br>Cc: Ram &lt;<a href=\"mailto:[email protected]\" target=\"_blank\">[email protected]</a>&gt;\n  <br>\n  <br>\n  <br>\n  <div dir=\"ltr\">Hi,</div>\n </div>";
    
    Document doc = Jsoup.parse(html);
    
    Element mailBody = doc.select("div.test br + br + br + div").first();
    if (mailBody == null) {
        throw new RuntimeException("Unable to locate mail body.");
    }
    System.out.println("** BEFORE:\n" + doc);
    
    Document tmp = Jsoup.parseBodyFragment("<div class='test'>" + mailBody.outerHtml() + "</div>");
    mailBody.parent().replaceWith(tmp.select("div.test").first());
    System.out.println("\n** AFTER:\n" + doc);
    

    输出值
    ** BEFORE:
    <html>
     <head></head>
     <body>
      <div class="test">
       <br>From:
       <b class="sendername">Divya</b>
       <span dir="ltr">&lt;<a href="mailto:[email protected]" target="_blank">[email protected]</a>&gt;</span>
       <br>Date: Wed, May 27, 2015 at 11:10 AM
       <br>Subject: Plan for the day 27/05/2015
       <br>To: Abhishek&lt;
       <a href="mailto:[email protected]" target="_blank">abhishek.sharma@abc.<wbr>com</a>&gt;,
       <a href="mailto:[email protected]" target="_blank">[email protected]</a>&gt;
       <br>Cc: Ram &lt;
       <a href="mailto:[email protected]" target="_blank">[email protected]</a>&gt;
       <br>
       <br>
       <br>
       <div dir="ltr">
        Hi,
       </div>
      </div>
     </body>
    </html>
    
    ** AFTER:
    <html>
     <head></head>
     <body>
      <div class="test">
       <div dir="ltr">
         Hi,
       </div>
      </div>
     </body>
    </html>
    

    关于java - 删除文本节点并检查html中的备用文本节点:Jsoup,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/30484102/

    10-09 01:53