本文介绍了抓取亚马逊时被阻止(即使有标题、代理、延迟)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 Python 代码来抓取亚马逊产品列表.我已经设置了代理和标题.在每次抓取之前我也有 sleep() .但是,我仍然无法获取数据.我收到的消息是:

I have a Python code to scrape Amazon product listing. I have set the proxies and headers. I also have sleep() before each crawl. However, I still cannot get the data. The msg I get back is:

要讨论自动访问亚马逊数据,请联系 [email protected]

我的部分代码是:

url = "https://www.amazon.com/Baby-Girls-Shoes/b/ref=sv_sl_fl_7239798011?ie=UTF8&node=7239798011"
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
proxies_list = ["128.199.109.241:8080","113.53.230.195:3128","125.141.200.53:80","125.141.200.14:80","128.199.200.112:138","149.56.123.99:3128","128.199.200.112:80","125.141.200.39:80","134.213.29.202:4444"]
proxies = {'https': random.choice(proxies_list)}
time.sleep(0.5 * random.random())
r = requests.get(url, headers, proxies=proxies)
page_html = r.content
print page_html

这个问题不是 Stackoverflow 上其他问题的重复,因为其他人建议使用代理、标题和延迟(睡眠),而我已经完成了所有这些.即使按照他们的建议,我也无法刮擦.

This question is not a duplicate of others available on Stackoverflow, because the others suggest using proxies, headers and delay(sleep), and I have already done all of that that. I am unable to scrape even after doing what they suggest.

代码最初可以工作,但在抓取几页后停止工作.

The code was working initially, but stopped working after scraping a few pages.

推荐答案

代替:

r = requests.get(url, headers, proxies=proxies)

做:

r = requests.get(url, headers=headers, proxies=proxies)

这暂时为我解决了这个问题.希望该决议能够继续发挥作用.

This resolved the issue for me for now. Hopefully, the resolution will keep working.

这篇关于抓取亚马逊时被阻止(即使有标题、代理、延迟)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-05 07:28