问题描述
我是Web爬网的新手,我正在尝试在Amazon上使用基本技能.我想编写一个代码,以查找价格,评级和其他信息排名前10位的今日最大交易".
I'm new to web scraping and i am trying to use basic skills on Amazon. I want to make a code for finding top 10 'Today's Greatest Deals' with prices and rating and other information.
每次我尝试使用find()并指定类来查找特定标签时,它总是说'None'.但是,实际的HTML具有该标记.在手动扫描中,我发现的一半代码未在输出终端中显示.显示的代码是一半,但是body和html标记确实关闭了.正文标记中只有一大部分代码丢失.
Every time I try to find a specific tag using find() and specifying class it keeps saying 'None'. However the actual HTML has that tag.On manual scanning i found out half the code of isn't being displayed in the output terminal. The code displayed is half but then the body and html tag do close. Just a huge chunk of code in body tag is missing.
显示的最后一行代码是:
The last line of code displayed is:
<!--[endif]---->
然后身体标签关闭.
这是我正在尝试的代码:
Here is the code that i'm trying:
from bs4 import BeautifulSoup as bs
import requests
source = requests.get('https://www.amazon.in/gp/goldbox?ref_=nav_topnav_deals')
soup = bs(source.text, 'html.parser')
print(soup.prettify())
#On printing this it misses some portion of html
article = soup.find('div', class_ = 'a-row dealContainer dealTile')
print(article)
#On printing this it shows 'None'
理想情况下,这应该给我div标记内的代码,以便我可以继续获得该产品的名称.但是,输出仅显示"None".并且在打印不带标签的整个代码时,它会丢失大量的html.
Ideally, this should give me the code within the div tag, so that i can continue further to get the name of the product. However the output just shows 'None'. And on printing the whole code without tags it is missing a huge chunk of html inside.
当然,所需的信息在缺少的html代码中.
And of course the information needed is in the missing html code.
亚马逊是否阻止了我的请求?请帮忙.
Is Amazon blocking my request? Please help.
推荐答案
(来源: http://go-colly.org/articles/scraping_related_http_headers/)
您唯一需要做的就是设置一个合法的用户代理.因此,添加标题以模拟浏览器.:
The only thing you need to do is to set a legitimate user-agent. Therefore add headers to emulate a browser. :
# This is a standard user-agent of Chrome browser running on Windows 10
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' }
示例:
from bs4 import BeautifulSoup
import requests
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
resp = requests.get('https://www.amazon.com', headers=headers).text
soup = BeautifulSoup(resp, 'html.parser')
...
<your code here>
此外,您可以添加另一组标头,使其假装成合法的浏览器.添加一些其他标题,如下所示:
Additionally, you can add another set of headers to pretend like a legitimate browser. Add some more headers like this:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' : 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip',
'DNT' : '1', # Do Not Track Request Header
'Connection' : 'close'
}
这篇关于通过Web抓取Amazon时BeautifulSoup无法正常工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!