通过Web抓取Amazon时BeautifulSoup无法正常工作

本文介绍了通过Web抓取Amazon时BeautifulSoup无法正常工作的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是Web爬网的新手，我正在尝试在Amazon上使用基本技能.我想编写一个代码，以查找价格，评级和其他信息排名前10位的今日最大交易".

I'm new to web scraping and i am trying to use basic skills on Amazon. I want to make a code for finding top 10 'Today's Greatest Deals' with prices and rating and other information.

每次我尝试使用find()并指定类来查找特定标签时，它总是说'None'.但是，实际的HTML具有该标记.在手动扫描中，我发现的一半代码未在输出终端中显示.显示的代码是一半，但是body和html标记确实关闭了.正文标记中只有一大部分代码丢失.

Every time I try to find a specific tag using find() and specifying class it keeps saying 'None'. However the actual HTML has that tag.On manual scanning i found out half the code of isn't being displayed in the output terminal. The code displayed is half but then the body and html tag do close. Just a huge chunk of code in body tag is missing.

显示的最后一行代码是:

The last line of code displayed is:

<!--[endif]---->

然后身体标签关闭.

这是我正在尝试的代码:

Here is the code that i'm trying:

from bs4 import BeautifulSoup as bs
import requests

source = requests.get('https://www.amazon.in/gp/goldbox?ref_=nav_topnav_deals')
soup = bs(source.text, 'html.parser')

print(soup.prettify())
#On printing this it misses some portion of html

article = soup.find('div', class_ = 'a-row dealContainer dealTile')
print(article)
#On printing this it shows 'None'

理想情况下，这应该给我div标记内的代码，以便我可以继续获得该产品的名称.但是，输出仅显示"None".并且在打印不带标签的整个代码时，它会丢失大量的html.

Ideally, this should give me the code within the div tag, so that i can continue further to get the name of the product. However the output just shows 'None'. And on printing the whole code without tags it is missing a huge chunk of html inside.

当然，所需的信息在缺少的html代码中.

And of course the information needed is in the missing html code.

亚马逊是否阻止了我的请求?请帮忙.

Is Amazon blocking my request? Please help.

您唯一需要做的就是设置一个合法的用户代理.因此，添加标题以模拟浏览器.:

The only thing you need to do is to set a legitimate user-agent. Therefore add headers to emulate a browser. :

# This is a standard user-agent of Chrome browser running on Windows 10
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' }

示例:

from bs4 import BeautifulSoup
import requests
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
resp = requests.get('https://www.amazon.com', headers=headers).text
soup = BeautifulSoup(resp, 'html.parser')
...
<your code here>

此外，您可以添加另一组标头，使其假装成合法的浏览器.添加一些其他标题，如下所示:

Additionally, you can add another set of headers to pretend like a legitimate browser. Add some more headers like this:

headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' : 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip',
'DNT' : '1', # Do Not Track Request Header
'Connection' : 'close'
}

这篇关于通过Web抓取Amazon时BeautifulSoup无法正常工作的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

that

通过Web抓取Amazon时BeautifulSoup无法正常工作

问题描述

推荐答案