Java-下载网页源HTML的最佳方法

本文介绍了Java-下载网页源HTML的最佳方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在写一个小爬虫。下载网页的源HTML的最佳方法是什么？我目前在下面使用少量代码，但有时结果只是页面来源的一半！！！我不知道出什么问题了。有人建议我使用Jsoup，但如果Jsoup使用的太长，则使用Jsoup的.get.html（）函数也会返回页面源的一半。由于我是在编写搜寻器，因此该方法支持Unicode（UTF-8）非常重要，效率也非常重要。我想知道最好的现代方法，所以我问你们，因为我是Java新手。

I'm writing a little crawler. What is the best way to download a web page's source html? I'm currently using little piece of code below but some times the result is just half of the page's source!!! I don't know what's the problem. Some people suggested that I should use Jsoup but using .get.html() function from Jsoup also returns half of the page's source if it's too long. Since I'm writing a crawler, it's very important that the method support unicode (UTF-8) and the efficiency is also very important. I wanted to know the best modern way to do it so I asked you guys since I'm new to Java. Thanks.

代码：

public static String downloadPage(String url)
    {
        try
        {
            URL pageURL = new URL(url);
            StringBuilder text = new StringBuilder();
            Scanner scanner = new Scanner(pageURL.openStream(), "utf-8");
            try {
                while (scanner.hasNextLine()){
                    text.append(scanner.nextLine() + NL);
                }
            }
            finally{
                scanner.close();
            }
            return text.toString();
        }
        catch(Exception ex)
        {
            return null;
        }
    }

crawler

Java-下载网页源HTML的最佳方法

问题描述

推荐答案