如何从Jsoup文本中只删除html标签？ | after

after

如何在Xcode编辑器中增加字体大小?

Java游戏Runescape因这些错误消息而崩溃

数组字符串使用C中的qsort排序

为了实现一致的实施，是否需要在C11中提供附件K的支持?

更新到v4.8.0后phpMyAdmin中出现错误:$ cfg ['TempDir'](./tmp/)无法访问

react js 如何充当 websocket 客户端?

将 c# 事件处理程序代码转换为 vb.net

如何在Jupyter Notebook内部的conda环境中使用特定的Java版本

在OpenLDAP 2.4中如何使用olcAccess向用户添加权限

我无法自动播放vimeo视频（手机）

在不保存文件的情况下读取Blob文件

为python hmmlearn软件包编译C代码时出错

不能在 Android NDK 中包含像矢量这样的 C++ 标头

递归与迭代(斐波那契数列)

pandas :如何对单个列使用apply()函数?

如何从Jsoup文本中只删除html标签？

扫码查看

本文介绍了如何从Jsoup文本中只删除html标签？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想用JSOUP从文本中删除只有html标签。我从这里使用了解决方案（）
但是在一些检查我发现JSOUP得到JAVA堆异常：大型htmls的OutOfMemoryError，但不是所有。例如，它在html 2Mb和10000行上失败。代码在最后一行引发异常（不在Jsoup.parse上）：

I want to remove ONLY html tags from text with JSOUP. I used solution from here (my previous question about JSOUP)But after some checkings I discovered that JSOUP gets JAVA heap exception: OutOfMemoryError for big htmls but not for all. For example, it fails on html 2Mb and 10000 lines. Code throws an exception in the last line (NOT on Jsoup.parse):

public String StripHtml(String html){
  html = html.replace("&lt;", "<").replace("&gt;", ">");
  String[] tags = getAllStandardHtmlTags;
  Document thing = Jsoup.parse(html);
  for (String tag : tags) {
      for (Element elem : thing.getElementsByTag(tag)) {
          elem.parent().insertChildren(elem.siblingIndex(),elem.childNodes());
          elem.remove();
      }
  }
  return thing.html();
}

有没有办法解决它？

推荐答案

在谷歌搜索并经过一些尝试自己实现HTML脱衣舞后，我的解决方案是使用将 escapedTags 替换为 blackList与标准html标记。

After many searching in google and after some attempts to implement html stripper by myself, my solution is to use HTMLStripCharFilter class of Solr with replacing escapedTags to blackList with standard html tags.

HTMLStripCharFilter比JSOUP库和大型文件的正则表达式更快

HTMLStripCharFilter没有内存像大型文件JSOUP（内存不足）问题

HTMLStripCharFilter没有像正则表达式那样进入灾难性的回溯。

HTMLStripCharFilter is faster than JSOUP library and regexes for big size files
HTMLStripCharFilter hasn't memory problem like JSOUP (Out of memory exception) for big size files
HTMLStripCharFilter isn't entering to "catastrophic backtracking" like regexes

这篇关于如何从Jsoup文本中只删除html标签？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

09-05 12:14