本文介绍了使用Jsoup获取Web元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Jsoup 从名为morningstar的网站获取股票数据。我查看了其他论坛,但未能找出问题所在。

I'm trying to use Jsoup to get stock data from a website called morningstar. I've looked at other forums and haven't been able to find out what's wrong.

我正在尝试更高级的数据报废,但我似乎无法获得价格。我要么返回null,要么一无所获。

I'm trying to do more advanced scrapping of data but I can't seem to even get the price. I either get null returned or nothing at all.

我知道其他语言和API但是我想使用 Jsoup 因为它似乎很有能力。

I am aware of other languages and APIs but I'd like to use Jsoup as it seems to be very capable.

这是我到目前为止所拥有的:

Here's what I have so far:

public class Scrape {
    public static void main(String[] args){
        String URL = "http://www.morningstar.com/stocks/xnas/aapl/quote.html";
        Document d = new Document(URL);
        try{
            d = Jsoup.connect(URL).get();
        }catch(IOException e){
            e.printStackTrace();
        }
        Element stuff = d.select("#idPrice gr_text_bigprice").first();
        System.out.println("Price of AAPL: " + stuff);
        }
}

任何帮助都将不胜感激。

Any help would be appreciated.

推荐答案

由于内容是使用javascript动态创建的,因此可以使用像HtmlUnit这样的无头浏览器

Since the content is created dynamically using javascript, you could use a headless browser like HtmlUnit https://sourceforge.net/projects/htmlunit/

有关价格的信息等。嵌入在iFrame中,因此我们首先获取(也动态构建)iFrame链接并随后解析iFrame。

The information regarding the price, etc. is embedded in an iFrame, so we first grab the (also dynamically build) iFrame link and parse the iFrame afterwards.

java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);

final WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setTimeout(1000);

HtmlPage page = webClient.getPage("http://www.morningstar.com/stocks/xnas/aapl/quote.html");

Document doc = Jsoup.parse(page.asXml());

String title = doc.select(".r_title").select("h1").text();

String iFramePath = "http:" + doc.select("#quote_quicktake").select("iframe").attr("src");

page = webClient.getPage(iFramePath);

doc = Jsoup.parse(page.asXml());

System.out.println(title + " | Last Price [$]: " + doc.select("#last-price-value").text());

打印:

Apple Inc | Last Price [$]: 98.63

HtmlUnit中的javascript引擎相当慢(上面代码大约需要18个)我的机器上的秒数,所以查看其他javascript引擎/无头浏览器(等)可能会有用;查看以下选项列表:)以提高性能,但是HtmlUnit完成了工作。您还可以尝试使用自定义 WebConnectionWrapper 过滤非相关脚本,图像等:

The javascript engine in HtmlUnit is rather slow (above code takes about 18 seconds on my machine), so it might be useful to look into other javascript engines/headless browsers (phantomJs, etc.; check this list of options: https://github.com/dhamaniasad/HeadlessBrowsers) to enhance the performance, but HtmlUnit gets the job done. You could also try to filter non relevant scripts, images, etc. with a custom WebConnectionWrapper:

这篇关于使用Jsoup获取Web元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 12:20
查看更多