问题描述
我想解析一个简单的网站并从该网站抓取信息.
I want to parse a simple web site and scrape information from that web site.
我曾经用 DocumentBuilderFactory 解析 XML 文件,我试图对 html 文件做同样的事情,但它总是进入无限循环.
I used to parse XML files with DocumentBuilderFactory, i tried to do the same thing for the html file but it always get into an infinite loop.
URL url = new URL("http://www.deneme.com");
URLConnection uc = url.openConnection();
InputStreamReader input = new InputStreamReader(uc.getInputStream());
BufferedReader in = new BufferedReader(input);
String inputLine;
FileWriter outFile = new FileWriter("orhancan");
PrintWriter out = new PrintWriter(outFile);
while ((inputLine = in.readLine()) != null) {
out.println(inputLine);
}
in.close();
out.close();
File fXmlFile = new File("orhancan");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
NodeList prelist = doc.getElementsByTagName("body");
System.out.println(prelist.getLength());
有什么问题吗?或者有没有更简单的方法可以从网站上抓取给定 html 标签的数据?
Whats is the problem? Or is there any easier way to scrape data from a web site for a given html tag?
推荐答案
有一个更简单的方法来做到这一点.我建议使用 JSoup.使用 JSoup,您可以执行诸如
There is a much easier way to do this. I suggest using JSoup. With JSoup you can do things like
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
或者如果你想要身体:
Elements body = doc.select("body");
或者如果您想要所有链接:
Or if you want all links:
Elements links = doc.select("body a");
您不再需要获取连接或处理流.简单的.如果您曾经使用过 jQuery,那么它与此非常相似.
You no longer need to get connections or handle streams. Simple. If you have ever used jQuery then it is very similar to that.
这篇关于用 JAVA 解析网站 HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!