问题描述
我正在开发一个从网站上抓取数据的应用程序,我想知道我应该如何获取数据.具体来说,我需要使用特定 CSS 类的多个 div 标签中包含的数据 - 目前(出于测试目的)我只是在检查
I'm working on an app which scrapes data from a website and I was wondering how I should go about getting the data. Specifically I need data contained in a number of div tags which use a specific CSS class - Currently (for testing purposes) I'm just checking for
div class = "classname"
在 HTML 的每一行中 - 这有效,但我不禁觉得那里有更好的解决方案.
in each line of HTML - This works, but I can't help but feel there is a better solution out there.
有什么好的方法可以给一个类一行 HTML 并有一些很好的方法,例如:
Is there any nice way where I could give a class a line of HTML and have some nice methods like:
boolean usesClass(String CSSClassname);
String getText();
String getLink();
推荐答案
几年前,我出于同样的目的使用 JTidy:
Several years ago I used JTidy for the same purpose:
"JTidy 是 HTML Tidy 的 Java 端口,是 HTML 语法检查器和漂亮的打印机.就像它的非 Java 兄弟一样,JTidy 可以用作清理格式错误和有缺陷的 HTML 的工具.此外,JTidy 提供了一个 DOM正在处理的文档的接口,这有效地使您能够将 JTidy 用作现实世界 HTML 的 DOM 解析器.
"JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.
JTidy 是由 Andy Quick 编写的,他后来辞去了维护者的职务.现在 JTidy 由一群志愿者维护.
JTidy was written by Andy Quick, who later stepped down from the maintainer position. Now JTidy is maintained by a group of volunteers.
可以在 JTidy SourceForge 项目页面上找到有关 JTidy 的更多信息."
More information on JTidy can be found on the JTidy SourceForge project page ."
这篇关于Java HTML 解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!