本文介绍了如何从Jsoup文本中只删除html标签?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用JSOUP从文本中删除只有html标签。我从这里使用了解决方案()
但是在一些检查我发现JSOUP得到JAVA堆异常:大型htmls的OutOfMemoryError,但不是所有。例如,它在html 2Mb和10000行上失败。代码在最后一行引发异常(不在Jsoup.parse上):

I want to remove ONLY html tags from text with JSOUP. I used solution from here (my previous question about JSOUP)But after some checkings I discovered that JSOUP gets JAVA heap exception: OutOfMemoryError for big htmls but not for all. For example, it fails on html 2Mb and 10000 lines. Code throws an exception in the last line (NOT on Jsoup.parse):

public String StripHtml(String html){
  html = html.replace("&lt;", "<").replace("&gt;", ">");
  String[] tags = getAllStandardHtmlTags;
  Document thing = Jsoup.parse(html);
  for (String tag : tags) {
      for (Element elem : thing.getElementsByTag(tag)) {
          elem.parent().insertChildren(elem.siblingIndex(),elem.childNodes());
          elem.remove();
      }
  }
  return thing.html();
}

有没有办法解决它?

推荐答案

在谷歌搜索并经过一些尝试自己实现HTML脱衣舞后,我的解决方案是使用将 escapedTags 替换为 blackList与标准html标记

After many searching in google and after some attempts to implement html stripper by myself, my solution is to use HTMLStripCharFilter class of Solr with replacing escapedTags to blackList with standard html tags.


  1. HTMLStripCharFilter比JSOUP库和大型文件的正则表达式更快

  2. HTMLStripCharFilter没有内存像大型文件JSOUP(内存不足)问题
  3. HTMLStripCharFilter没有像正则表达式那样进入灾难性的回溯。
  1. HTMLStripCharFilter is faster than JSOUP library and regexes for big size files
  2. HTMLStripCharFilter hasn't memory problem like JSOUP (Out of memory exception) for big size files
  3. HTMLStripCharFilter isn't entering to "catastrophic backtracking" like regexes

这篇关于如何从Jsoup文本中只删除html标签?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 12:14
查看更多