本文介绍了Java HTML 解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个从网站上抓取数据的应用程序,我想知道我应该如何获取数据.具体来说,我需要使用特定 CSS 类的多个 div 标签中包含的数据 - 目前(出于测试目的)我只是在检查

I'm working on an app which scrapes data from a website and I was wondering how I should go about getting the data. Specifically I need data contained in a number of div tags which use a specific CSS class - Currently (for testing purposes) I'm just checking for

div class = "classname"

在 HTML 的每一行中 - 这有效,但我不禁觉得那里有更好的解决方案.

in each line of HTML - This works, but I can't help but feel there is a better solution out there.

有什么好的方法可以给一个类一行 HTML 并有一些很好的方法,例如:

Is there any nice way where I could give a class a line of HTML and have some nice methods like:

boolean usesClass(String CSSClassname);
String getText();
String getLink();

推荐答案

几年前,我出于同样的目的使用 JTidy:

Several years ago I used JTidy for the same purpose:

http://jtidy.sourceforge.net/

"JTidy 是 HTML Tidy 的 Java 端口,是 HTML 语法检查器和漂亮的打印机.就像它的非 Java 兄弟一样,JTidy 可以用作清理格式错误和有缺陷的 HTML 的工具.此外,JTidy 提供了一个 DOM正在处理的文档的接口,这有效地使您能够将 JTidy 用作现实世界 HTML 的 DOM 解析器.

"JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.

JTidy 是由 Andy Quick 编写的,他后来辞去了维护者的职务.现在 JTidy 由一群志愿者维护.

JTidy was written by Andy Quick, who later stepped down from the maintainer position. Now JTidy is maintained by a group of volunteers.

可以在 JTidy SourceForge 项目页面上找到有关 JTidy 的更多信息."

More information on JTidy can be found on the JTidy SourceForge project page ."

这篇关于Java HTML 解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-18 22:44