问题描述
我正在尝试抓取 Google 财经,并根据 Chrome 中的网页检查器获取相关股票"表,该表的 ID 为cc-table",类为gf-table".(示例链接:https://www.google.com/finance?q=tsla)
I'm trying to scrape Google Finance, and get the "Related Stocks" table, which has id "cc-table" and class "gf-table" based on the webpage inspector in Chrome. (Sample Link: https://www.google.com/finance?q=tsla)
但是当我运行 .find("table") 或 .findAll("table") 时,这个表没有出现.我可以在 Python 的 HTML 内容中找到具有表格内容的 JSON 外观对象,但不知道如何获取它.有什么想法吗?
But when I run .find("table") or .findAll("table"), this table does not come up. I can find JSON-looking objects with the table's contents in the HTML content in Python, but do not know how to get it. Any ideas?
推荐答案
页面使用 JavaScript 呈现.有几种方法可以渲染和抓取它.
The page is rendered with JavaScript. There are several ways to render and scrape it.
我可以用 Selenium 刮它.首先安装 Selenium:
I can scrape it with Selenium.First install Selenium:
sudo pip3 install selenium
然后获取驱动程序 https://sites.google.com/a/chromium.org/chromedriver/downloads
import bs4 as bs
from selenium import webdriver
browser = webdriver.Chrome()
url = ("https://www.google.com/finance?q=tsla")
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(html_source, "lxml")
for el in soup.find_all("table", {"id": "cc-table"}):
print(el.get_text())
或者 PyQt5
from PyQt5.QtGui import *
from PyQt5.QtCore import *
from PyQt5.QtWebKit import *
from PyQt5.QtWebKitWidgets import QWebPage
from PyQt5.QtWidgets import QApplication
import bs4 as bs
import sys
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = "https://www.google.com/finance?q=tsla"
r = Render(url)
result = r.frame.toHtml()
soup = bs.BeautifulSoup(result,'lxml')
for el in soup.find_all("table", {"id": "cc-table"}):
print(el.get_text())
或者干刮
import bs4 as bs
import dryscrape
url = "https://www.google.com/finance?q=tsla"
session = dryscrape.Session()
session.visit(url)
dsire_get = session.body()
soup = bs.BeautifulSoup(dsire_get,'lxml')
for el in soup.find_all("table", {"id": "cc-table"}):
print(el.get_text())
所有输出:
Valuation▲▼Company name▲▼Price▲▼Change▲▼Chg %▲▼d | m | y▲▼Mkt Cap▲▼TSLATesla Inc328.40-1.52-0.46%53.69BDDAIFDaimler AG72.94-1.50-2.01%76.29BFFord Motor Company11.53-0.17-1.45%45.25BGMGeneral Motors Co...36.07-0.34-0.93%53.93BRNSDFRENAULT SA EUR3.8197.000.000.00%28.69BHMCHonda Motor Co Lt...27.52-0.18-0.65%49.47BAUDVFAUDI AG NPV840.400.000.00%36.14BTMToyota Motor Corp...109.31-0.53-0.48%177.79BBAMXFBAYER MOTOREN WER...94.57-2.41-2.48%56.93BNSANYNissan Motor Co L...20.400.000.00%42.85BMMTOFMITSUBISHI MOTOR ...6.86+0.091.26%10.22B
编辑
QtWebKit 在 Qt 5.5 上游被弃用,并在 5.6 中被移除.
QtWebKit got deprecated upstream in Qt 5.5 and removed in 5.6.
你可以切换到 PyQt5.QtWebEngineWidgets
You can switch to PyQt5.QtWebEngineWidgets
这篇关于抓取谷歌财经(BeautifulSoup)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!