问题描述
我想用一个很好用的html解析器来完成下面的工作:
- 提取文本(这是最重要的)
- 提取链接,meta关键字
- 重新构建原始文档(可选但不错的功能)
- Extract text (this is most important)
- Extract links, meta keywords
- Reconstruct original doc (optional but nice feature to have)
从我迄今为止的调查来看,似乎以适应。任何其他开源库,你们会推荐?
我最近尝试了HtmlCleaner和CyberNekoHtml。 CyberNekoHtml是一个DOM / SAX解析器,可以产生可预测的结果。 HtmlCleaner稍微快点,但往往无法产生准确的结果。
我会推荐CyberNekoHtml。 CyberNekoHtml可以完成您提到的所有事情。例如,提取所有元素及其属性的列表非常简单。如果你想重建页面,遍历DOM树将每个元素重新构建为HTML。
这里有一个开源的java html解析器列表:
I want to use an html parser that does the following in a nice, elegant way
From my investigation so far jericho seems to fit. Any other open source libraries you guys would recommend?
I recently experimented with HtmlCleaner and CyberNekoHtml. CyberNekoHtml is a DOM/SAX parser that produces predictable results. HtmlCleaner is a tad faster, but quite often fails to produce accurate results.
I would recommend CyberNekoHtml. CyberNekoHtml can do all of the things you mentioned. It is very easy to extract a list of all elements, and their attributes, for example. It would be possible to traverse the DOM tree building each element back into HTML if you wanted to reconstruct the page.
There's a list of open source java html parsers here:http://java-source.net/open-source/html-parsers
这篇关于用java html解析器进行文本提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!