问题描述
我正在尝试抓取Google财经,并获取相关股票"表,该表基于Chrome中的网页检查器具有ID"cc-table"和类"gf-table". (示例链接: https://www.google.com/finance?q=tsla)
I'm trying to scrape Google Finance, and get the "Related Stocks" table, which has id "cc-table" and class "gf-table" based on the webpage inspector in Chrome. (Sample Link: https://www.google.com/finance?q=tsla)
但是当我运行.find("table")或.findAll("table")时,此表不会显示.我可以在Python的HTML内容中找到带有表内容的JSON外观对象,但不知道如何获取它.有什么想法吗?
But when I run .find("table") or .findAll("table"), this table does not come up. I can find JSON-looking objects with the table's contents in the HTML content in Python, but do not know how to get it. Any ideas?
推荐答案
该页面使用JavaScript呈现.有几种方法可以渲染和刮取它.
The page is rendered with JavaScript. There are several ways to render and scrape it.
我可以用硒刮它.首先安装Selenium:
I can scrape it with Selenium.First install Selenium:
sudo pip3 install selenium
然后获取驱动程序 https://sites.google.com/a/chromium.org/chromedriver/downloads
import bs4 as bs
from selenium import webdriver
browser = webdriver.Chrome()
url = ("https://www.google.com/finance?q=tsla")
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(html_source, "lxml")
for el in soup.find_all("table", {"id": "cc-table"}):
print(el.get_text())
或者 PyQt5
from PyQt5.QtGui import *
from PyQt5.QtCore import *
from PyQt5.QtWebKit import *
from PyQt5.QtWebKitWidgets import QWebPage
from PyQt5.QtWidgets import QApplication
import bs4 as bs
import sys
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = "https://www.google.com/finance?q=tsla"
r = Render(url)
result = r.frame.toHtml()
soup = bs.BeautifulSoup(result,'lxml')
for el in soup.find_all("table", {"id": "cc-table"}):
print(el.get_text())
或者 Dryscrape
import bs4 as bs
import dryscrape
url = "https://www.google.com/finance?q=tsla"
session = dryscrape.Session()
session.visit(url)
dsire_get = session.body()
soup = bs.BeautifulSoup(dsire_get,'lxml')
for el in soup.find_all("table", {"id": "cc-table"}):
print(el.get_text())
所有输出:
Valuation▲▼Company name▲▼Price▲▼Change▲▼Chg %▲▼d | m | y▲▼Mkt Cap▲▼TSLATesla Inc328.40-1.52-0.46%53.69BDDAIFDaimler AG72.94-1.50-2.01%76.29BFFord Motor Company11.53-0.17-1.45%45.25BGMGeneral Motors Co...36.07-0.34-0.93%53.93BRNSDFRENAULT SA EUR3.8197.000.000.00%28.69BHMCHonda Motor Co Lt...27.52-0.18-0.65%49.47BAUDVFAUDI AG NPV840.400.000.00%36.14BTMToyota Motor Corp...109.31-0.53-0.48%177.79BBAMXFBAYER MOTOREN WER...94.57-2.41-2.48%56.93BNSANYNissan Motor Co L...20.400.000.00%42.85BMMTOFMITSUBISHI MOTOR ...6.86+0.091.26%10.22B
编辑
QtWebKit在Qt 5.5中被上游弃用,在5.6中被删除.
QtWebKit got deprecated upstream in Qt 5.5 and removed in 5.6.
您可以切换到PyQt5.QtWebEngineWidgets
You can switch to PyQt5.QtWebEngineWidgets
这篇关于刮Google财经(BeautifulSoup)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!