问题描述
我正在研究一个从网站上抓取数据的应用程序,我想知道如何获取数据。具体而言,我需要包含在使用特定CSS类的多个div标签中的数据 - 目前(用于测试目的)我只是检查
<$在每一行HTML中使用c $ c> div class =classname
不禁感到有更好的解决方案。
有没有什么好的方法可以让一个类有一行HTML,并且有一些漂亮的方法:
boolean usesClass(String CSSClassname);
String getText();
String getLink();
几年前,我用JTidy来达到同样的目的:
http://jtidy.sourceforge.net/ b
$ bJTidy是HTML Tidy的一个Java端口,它是一种HTML语法检查器和漂亮的打印机。与其非Java表兄弟类似,JTidy可以用作清理工具修复格式不正确和错误的HTML,另外,JTidy为正在处理的文档提供了一个DOM接口,这使得您可以使用JTidy作为真实世界HTML的DOM解析器。
JTidy是由Andy Quick撰写的,后者从维护者的职位上退出。现在JTidy由一群志愿者维护。
更多关于JTidy的信息可以在JTidy SourceForge项目页面找到。
I'm working on an app which scrapes data from a website and I was wondering how I should go about getting the data. Specifically I need data contained in a number of div tags which use a specific CSS class - Currently (for testing purposes) I'm just checking for
div class = "classname"
in each line of HTML - This works, but I can't help but feel there is a better solution out there.
Is there any nice way where I could give a class a line of HTML and have some nice methods like:
boolean usesClass(String CSSClassname);
String getText();
String getLink();
Several years ago I used JTidy for the same purpose:
"JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.
JTidy was written by Andy Quick, who later stepped down from the maintainer position. Now JTidy is maintained by a group of volunteers.
More information on JTidy can be found on the JTidy SourceForge project page ."
这篇关于Java HTML解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!