本文介绍了BeautifulSoup 解析器无法访问 html 元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取所有列表的 href.我对beautifulsoup相当陌生,以前做过一些刮擦,但以前也做过一些刮擦.但我不能为我的生活提取.请参阅下面我的代码.当我运行这个脚本时,容器的长度为零.

I am trying to scrape the hrefs of all the listings. I am fairly new to beautifulsoup and have done a bit of scraping before, but have done some scraping before. But I can't for the life of me extract. See below my code. the container has length zero when I run this script.

我也尝试选择价格 (soup.findAll("span", {"class":"amount"}) ,但它没有反映.欢迎任何建议:)

I try and select the price too (soup.findAll("span", {"class":"amount"}) , but it doesn't reflect. Any advice most welcome :)

import urllib.request
import urllib.parse
from bs4 import BeautifulSoup

url = 'https://www.takealot.com/computers/laptops-10130'
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"
req = urllib.request.Request(url, headers=headers)
resp = urllib.request.urlopen(req)

respData = str(resp.read())

soup = BeautifulSoup(respData, 'html.parser')

container = soup.find_all("div", {"class": "p-data left"})

推荐答案

页面使用 JavaScript 呈现.有几种方法可以渲染和抓取它.

The page is rendered with JavaScript. There are several ways to render and scrape it.

我可以用 Selenium 刮它.首先安装 Selenium:

I can scrape it with Selenium.First install Selenium:

sudo pip3 install selenium

然后获取驱动程序 https://sites.google.com/a/chromium.org/chromedriver/downloads 如果您使用的是 Windows 或 Mac,则可以使用无头版本的 chromeChrome Canary".

Then get a driver https://sites.google.com/a/chromium.org/chromedriver/downloads you can use a headless version of chrome "Chrome Canary" if you are on Windows or Mac.

from bs4 import BeautifulSoup
from selenium import webdriver

browser = webdriver.Chrome()
url = ('https://www.takealot.com/computers/laptops-10130')
browser.get(url)
respData = browser.page_source
browser.quit()
soup = BeautifulSoup(respData, 'html.parser')
containers = soup.find_all("div", {"class": "p-data left"})
for container in containers:
    print(container.text)
    print(container.find("span", {"class": "amount"}).text)

或者使用 PyQt5

from PyQt5.QtGui import *
from PyQt5.QtCore import *
from PyQt5.QtWebKit import *
from PyQt5.QtWebKitWidgets import QWebPage
from PyQt5.QtWidgets import QApplication
from bs4 import BeautifulSoup
import sys


class Render(QWebPage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.loadFinished.connect(self._loadFinished)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()

    def _loadFinished(self, result):
        self.frame = self.mainFrame()
        self.app.quit()

url = 'https://www.takealot.com/computers/laptops-10130'
r = Render(url)
respData = r.frame.toHtml()
soup = BeautifulSoup(respData, 'html.parser')
containers = soup.find_all("div", {"class": "p-data left"})
for container in containers:
    print (container.text)
    print (container.find("span", {"class":"amount"}).text)

或者使用dryscrape:

from bs4 import BeautifulSoup
import dryscrape

url = 'https://www.takealot.com/computers/laptops-10130'
session = dryscrape.Session()
session.visit(url)
respData = session.body()
soup = BeautifulSoup(respData, 'html.parser')
containers = soup.find_all("div", {"class": "p-data left"})
for container in containers:
    print(container.text)
    print(container.find("span", {"class": "amount"}).text)

所有情况下的输出:

Dell Inspiron 3162 Intel Celeron 11.6" Wifi Notebook (Various Colours)11.6 Inch Display; Wifi Only (Red; White & Blue Available)R 3,999R 4,999i20% OffeB 39,990Discovery Miles 39,990On Credit: R 372 / monthi
3,999
HP 250 G5 Celeron N3060 Notebook - Dark ash silverNBHPW4M70EAR 4,499R 4,999ieB 44,990Discovery Miles 44,990On Credit: R 419 / monthiIn StockShippingThis item is in stock in our CPT warehouse and can be shipped from there. You can also collect it yourself from our warehouse during the week or over weekends.CPT | ShippingThis item is in stock in our JHB warehouse and can be shipped from there. No collection facilities available, sorry!JHBWhen do I get it?
4,499
Asus Vivobook ...

但是,在使用您的 URL 进行测试时,我发现结果并非每次都可重现,有时我在页面呈现后容器"中没有内容.

这篇关于BeautifulSoup 解析器无法访问 html 元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-01 03:51