使用Jsoup获取Web元素

本文介绍了使用Jsoup获取Web元素的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用 Jsoup 从名为morningstar的网站获取股票数据。我查看了其他论坛，但未能找出问题所在。

I'm trying to use Jsoup to get stock data from a website called morningstar. I've looked at other forums and haven't been able to find out what's wrong.

我正在尝试更高级的数据报废，但我似乎无法获得价格。我要么返回null，要么一无所获。

I'm trying to do more advanced scrapping of data but I can't seem to even get the price. I either get null returned or nothing at all.

我知道其他语言和API但是我想使用 Jsoup 因为它似乎很有能力。

I am aware of other languages and APIs but I'd like to use Jsoup as it seems to be very capable.

这是我到目前为止所拥有的：

Here's what I have so far:

public class Scrape {
    public static void main(String[] args){
        String URL = "http://www.morningstar.com/stocks/xnas/aapl/quote.html";
        Document d = new Document(URL);
        try{
            d = Jsoup.connect(URL).get();
        }catch(IOException e){
            e.printStackTrace();
        }
        Element stuff = d.select("#idPrice gr_text_bigprice").first();
        System.out.println("Price of AAPL: " + stuff);
        }
}

任何帮助都将不胜感激。

Any help would be appreciated.

推荐答案

由于内容是使用javascript动态创建的，因此可以使用像HtmlUnit这样的无头浏览器

Since the content is created dynamically using javascript, you could use a headless browser like HtmlUnit https://sourceforge.net/projects/htmlunit/

有关价格的信息等。嵌入在iFrame中，因此我们首先获取（也动态构建）iFrame链接并随后解析iFrame。

The information regarding the price, etc. is embedded in an iFrame, so we first grab the (also dynamically build) iFrame link and parse the iFrame afterwards.

java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);

final WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setTimeout(1000);

HtmlPage page = webClient.getPage("http://www.morningstar.com/stocks/xnas/aapl/quote.html");

Document doc = Jsoup.parse(page.asXml());

String title = doc.select(".r_title").select("h1").text();

String iFramePath = "http:" + doc.select("#quote_quicktake").select("iframe").attr("src");

page = webClient.getPage(iFramePath);

doc = Jsoup.parse(page.asXml());

System.out.println(title + " | Last Price [$]: " + doc.select("#last-price-value").text());

打印：

Apple Inc | Last Price [$]: 98.63

HtmlUnit中的javascript引擎相当慢（上面代码大约需要18个）我的机器上的秒数，所以查看其他javascript引擎/无头浏览器（等）可能会有用;查看以下选项列表：）以提高性能，但是HtmlUnit完成了工作。您还可以尝试使用自定义 WebConnectionWrapper 过滤非相关脚本，图像等：

The javascript engine in HtmlUnit is rather slow (above code takes about 18 seconds on my machine), so it might be useful to look into other javascript engines/headless browsers (phantomJs, etc.; check this list of options: https://github.com/dhamaniasad/HeadlessBrowsers) to enhance the performance, but HtmlUnit gets the job done. You could also try to filter non relevant scripts, images, etc. with a custom WebConnectionWrapper:

这篇关于使用Jsoup获取Web元素的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！