本文介绍了Java / HTML解析器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
What HTML parsers have the following features:
- 快速
- 线程安全
- 可靠和无bug
- 处理错误的HTML
- 有一个DOM实现
- 支持HTML4,JavaScript和CSS标签
- 相对简单,面向对象的API
- Fast
- Thread-safe
- Reliable and bug-free
- Parses HTML and XML
- Handles erroneous HTML
- Has a DOM implementation
- Supports HTML4, JavaScript, and CSS tags
- Relatively simple, object-oriented API
What parser you think is better?
是最好的选择。 Apache最近从现有项目中提取了许多子项目,并将其公开。 Tika之前是Apache Lucene的一个组成部分。由于Apache的支持和声誉以及广泛使用的父项目Lucene,这一定是一个很好的选择。此外,它是开源的。
Apache Tika is the best choice. Apache has recently extracted many sub-projects out of the existing projects and made them public. Tika is one of them that was previously a component of Apache Lucene. Because of Apache's support and reputation and the widely-used parent project Lucene it must be a very good choice. Furthermore, it is open-source.
Apache Tika网站的简要介绍:
A brief introduction from Apache Tika web site:
HyperText Markup Language
XML and derived formats
Microsoft Office document formats
OpenDocument Format
Portable Document Format
Electronic Publication Format
Rich Text Format
Compression and packaging formats
Text formats
Audio formats
Image formats
Video formats
Java class files and archives
The mbox format
这篇关于Java / HTML解析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!