本文介绍了刮Google财经(BeautifulSoup)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取Google财经,并获取相关股票"表,该表基于Chrome中的网页检查器具有ID"cc-table"和类"gf-table". (示例链接: https://www.google.com/finance?q=tsla)

I'm trying to scrape Google Finance, and get the "Related Stocks" table, which has id "cc-table" and class "gf-table" based on the webpage inspector in Chrome. (Sample Link: https://www.google.com/finance?q=tsla)

但是当我运行.find("table")或.findAll("table")时,此表不会显示.我可以在Python的HTML内容中找到带有表内容的JSON外观对象,但不知道如何获取它.有什么想法吗?

But when I run .find("table") or .findAll("table"), this table does not come up. I can find JSON-looking objects with the table's contents in the HTML content in Python, but do not know how to get it. Any ideas?

推荐答案

该页面使用JavaScript呈现.有几种方法可以渲染和刮取它.

The page is rendered with JavaScript. There are several ways to render and scrape it.

我可以用硒刮它.首先安装Selenium:

I can scrape it with Selenium.First install Selenium:

sudo pip3 install selenium

然后获取驱动程序 https://sites.google.com/a/chromium.org/chromedriver/downloads

import bs4 as bs
from selenium import webdriver
browser = webdriver.Chrome()
url = ("https://www.google.com/finance?q=tsla")
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(html_source, "lxml")
for el in soup.find_all("table", {"id": "cc-table"}):
    print(el.get_text())

或者 PyQt5

from PyQt5.QtGui import *
from PyQt5.QtCore import *
from PyQt5.QtWebKit import *
from PyQt5.QtWebKitWidgets import QWebPage
from PyQt5.QtWidgets import QApplication
import bs4 as bs
import sys

class Render(QWebPage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.loadFinished.connect(self._loadFinished)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()

    def _loadFinished(self, result):
        self.frame = self.mainFrame()
        self.app.quit()

url = "https://www.google.com/finance?q=tsla"
r = Render(url)
result = r.frame.toHtml()
soup = bs.BeautifulSoup(result,'lxml')
for el in soup.find_all("table", {"id": "cc-table"}):
    print(el.get_text())

或者 Dryscrape

import bs4 as bs
import dryscrape

url = "https://www.google.com/finance?q=tsla"
session = dryscrape.Session()
session.visit(url)
dsire_get = session.body()
soup = bs.BeautifulSoup(dsire_get,'lxml')
for el in soup.find_all("table", {"id": "cc-table"}):
    print(el.get_text())

所有输出:

Valuation▲▼Company name▲▼Price▲▼Change▲▼Chg %▲▼d | m | y▲▼Mkt Cap▲▼TSLATesla Inc328.40-1.52-0.46%53.69BDDAIFDaimler AG72.94-1.50-2.01%76.29BFFord Motor Company11.53-0.17-1.45%45.25BGMGeneral Motors Co...36.07-0.34-0.93%53.93BRNSDFRENAULT SA EUR3.8197.000.000.00%28.69BHMCHonda Motor Co Lt...27.52-0.18-0.65%49.47BAUDVFAUDI AG NPV840.400.000.00%36.14BTMToyota Motor Corp...109.31-0.53-0.48%177.79BBAMXFBAYER MOTOREN WER...94.57-2.41-2.48%56.93BNSANYNissan Motor Co L...20.400.000.00%42.85BMMTOFMITSUBISHI MOTOR ...6.86+0.091.26%10.22B

编辑

QtWebKit在Qt 5.5中被上游弃用,在5.6中被删除.

QtWebKit got deprecated upstream in Qt 5.5 and removed in 5.6.

您可以切换到PyQt5.QtWebEngineWidgets

You can switch to PyQt5.QtWebEngineWidgets

这篇关于刮Google财经(BeautifulSoup)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 09:58