问题描述
我正在写一个小爬虫。下载网页的源HTML的最佳方法是什么?我目前在下面使用少量代码,但有时结果只是页面来源的一半!!!我不知道出什么问题了。有人建议我使用Jsoup,但如果Jsoup使用的太长,则使用Jsoup的.get.html()函数也会返回页面源的一半。由于我是在编写搜寻器,因此该方法支持Unicode(UTF-8)非常重要,效率也非常重要。我想知道最好的现代方法,所以我问你们,因为我是Java新手。
I'm writing a little crawler. What is the best way to download a web page's source html? I'm currently using little piece of code below but some times the result is just half of the page's source!!! I don't know what's the problem. Some people suggested that I should use Jsoup but using .get.html() function from Jsoup also returns half of the page's source if it's too long. Since I'm writing a crawler, it's very important that the method support unicode (UTF-8) and the efficiency is also very important. I wanted to know the best modern way to do it so I asked you guys since I'm new to Java. Thanks.
代码:
public static String downloadPage(String url)
{
try
{
URL pageURL = new URL(url);
StringBuilder text = new StringBuilder();
Scanner scanner = new Scanner(pageURL.openStream(), "utf-8");
try {
while (scanner.hasNextLine()){
text.append(scanner.nextLine() + NL);
}
}
finally{
scanner.close();
}
return text.toString();
}
catch(Exception ex)
{
return null;
}
}
推荐答案
个人,我对Apache HTTP库。如果您正在编写网络爬虫(我也是),则可能会非常感谢它提供的控制Cookie和客户端共享等功能的工具。
Personally, I'm very pleased with the Apache HTTP library http://hc.apache.org/httpcomponents-client-ga/. If you're writing a web crawler, which I am also, you may greatly appreciate the control it gives over things like cookies and client sharing and the like.
这篇关于Java-下载网页源HTML的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!