本文介绍了抓取谷歌财经(BeautifulSoup)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取 Google 财经,并根据 Chrome 中的网页检查器获取相关股票"表,该表的 ID 为cc-table",类为gf-table".(示例链接:https://www.google.com/finance?q=tsla)

I'm trying to scrape Google Finance, and get the "Related Stocks" table, which has id "cc-table" and class "gf-table" based on the webpage inspector in Chrome. (Sample Link: https://www.google.com/finance?q=tsla)

但是当我运行 .find("table") 或 .findAll("table") 时,这个表没有出现.我可以在 Python 的 HTML 内容中找到具有表格内容的 JSON 外观对象,但不知道如何获取它.有什么想法吗?

But when I run .find("table") or .findAll("table"), this table does not come up. I can find JSON-looking objects with the table's contents in the HTML content in Python, but do not know how to get it. Any ideas?

推荐答案

页面使用 JavaScript 呈现.有几种方法可以渲染和抓取它.

The page is rendered with JavaScript. There are several ways to render and scrape it.

我可以用 Selenium 刮它.首先安装 Selenium:

I can scrape it with Selenium.First install Selenium:

sudo pip3 install selenium

然后获取驱动程序 https://sites.google.com/a/chromium.org/chromedriver/downloads

import bs4 as bs
from selenium import webdriver
browser = webdriver.Chrome()
url = ("https://www.google.com/finance?q=tsla")
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(html_source, "lxml")
for el in soup.find_all("table", {"id": "cc-table"}):
    print(el.get_text())

或者 PyQt5

from PyQt5.QtGui import *
from PyQt5.QtCore import *
from PyQt5.QtWebKit import *
from PyQt5.QtWebKitWidgets import QWebPage
from PyQt5.QtWidgets import QApplication
import bs4 as bs
import sys

class Render(QWebPage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.loadFinished.connect(self._loadFinished)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()

    def _loadFinished(self, result):
        self.frame = self.mainFrame()
        self.app.quit()

url = "https://www.google.com/finance?q=tsla"
r = Render(url)
result = r.frame.toHtml()
soup = bs.BeautifulSoup(result,'lxml')
for el in soup.find_all("table", {"id": "cc-table"}):
    print(el.get_text())

或者干刮

import bs4 as bs
import dryscrape

url = "https://www.google.com/finance?q=tsla"
session = dryscrape.Session()
session.visit(url)
dsire_get = session.body()
soup = bs.BeautifulSoup(dsire_get,'lxml')
for el in soup.find_all("table", {"id": "cc-table"}):
    print(el.get_text())

所有输出:

Valuation▲▼Company name▲▼Price▲▼Change▲▼Chg %▲▼d | m | y▲▼Mkt Cap▲▼TSLATesla Inc328.40-1.52-0.46%53.69BDDAIFDaimler AG72.94-1.50-2.01%76.29BFFord Motor Company11.53-0.17-1.45%45.25BGMGeneral Motors Co...36.07-0.34-0.93%53.93BRNSDFRENAULT SA EUR3.8197.000.000.00%28.69BHMCHonda Motor Co Lt...27.52-0.18-0.65%49.47BAUDVFAUDI AG NPV840.400.000.00%36.14BTMToyota Motor Corp...109.31-0.53-0.48%177.79BBAMXFBAYER MOTOREN WER...94.57-2.41-2.48%56.93BNSANYNissan Motor Co L...20.400.000.00%42.85BMMTOFMITSUBISHI MOTOR ...6.86+0.091.26%10.22B

编辑

QtWebKit 在 Qt 5.5 上游被弃用,并在 5.6 中被移除.

QtWebKit got deprecated upstream in Qt 5.5 and removed in 5.6.

你可以切换到 PyQt5.QtWebEngineWidgets

You can switch to PyQt5.QtWebEngineWidgets

这篇关于抓取谷歌财经(BeautifulSoup)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-11 13:08