本文介绍了如何在 Scrapy 中处理 429 Too Many Requests 响应?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试运行一个输出日志如下所示的刮刀:

I'm trying to run a scraper of which the output log ends as follows:

2017-04-25 20:22:22 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <429 http://www.apkmirror.com/apk/instagram/instagram-instagram/instagram-instagram-9-0-0-34920-release/instagram-9-0-0-4-android-apk-download/>: HTTP status code is not handled or not allowed
2017-04-25 20:22:22 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-25 20:22:22 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 16048410,
 'downloader/request_count': 32902,
 'downloader/request_method_count/GET': 32902,
 'downloader/response_bytes': 117633316,
 'downloader/response_count': 32902,
 'downloader/response_status_count/200': 121,
 'downloader/response_status_count/429': 32781,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 4, 25, 18, 22, 22, 710446),
 'log_count/DEBUG': 32903,
 'log_count/INFO': 32815,
 'request_depth_max': 2,
 'response_received_count': 32902,
 'scheduler/dequeued': 32902,
 'scheduler/dequeued/memory': 32902,
 'scheduler/enqueued': 32902,
 'scheduler/enqueued/memory': 32902,
 'start_time': datetime.datetime(2017, 4, 25, 17, 54, 36, 621481)}
2017-04-25 20:22:22 [scrapy.core.engine] INFO: Spider closed (finished)

简而言之,在 32,902 个请求中,只有 121 个成功(响应代码 200),而其余请求因请求过多"而收到 429 个(参见 https://httpstatuses.com/429).

In short, of the 32,902 requests, only 121 are successful (response code 200) whereas the remainder receives 429 for 'too many requests' (cf. https://httpstatuses.com/429).

有没有推荐的方法来解决这个问题?首先,我想查看 429 响应的详细信息,而不是忽略它,因为它可能包含一个 Retry-After 标头,指示在创建新响应之前等待多长时间请求.

Are there any recommended ways to get around this? To start with, I'd like to have a look at the details of the 429 response rather than just ignoring it, as it may contain a Retry-After header indicating how long to wait before making a new request.

此外,如果请求是按照 http://blog.michaelyin.info/2014/02/19/scrapy-socket-proxy/,有可能实现重试中间件,使 Tor 在发生这种情况时更改其 IP 地址.是否有此类代码的公开示例?

Also, if the requests are made using Privoxy and Tor as described in http://blog.michaelyin.info/2014/02/19/scrapy-socket-proxy/, it may be possible to implement retry middleware which makes Tor change its IP address when this occurs. Are there any public examples of such code?

推荐答案

哇,您的抓取工具运行速度非常快,30 分钟内处理了超过 30,000 个请求.这是每秒 10 个以上的请求.

Wow, your scraper is going really fast, over 30,000 requests in 30 minutes. That's more than 10 requests per second.

如此高的流量会触发对较大网站的速率限制,并会完全关闭较小的网站.不要那样做.

Such a high volume will trigger rate limiting on bigger sites and will completely bring down smaller sites. Don't do that.

此外,这对于 privoxy 和 tor 来说甚至可能太快了,因此这些也可能是那些回复 429 的候选者.

Also this might even be too fast for privoxy and tor, so these might also be candidates for those replies with a 429.

解决方案:

  1. 开始缓慢.减少并发设置并增加 DOWNLOAD_DELAY,以便每秒最多执行 1 个请求.然后逐步增加这些值,看看会发生什么.这听起来可能很矛盾,但通过放慢速度,您可能会获得更多项目和 200 多个响应.

  1. Start slow. Reduce the concurrency settings and increase DOWNLOAD_DELAY so you do at max 1 request per second. Then increase these values step by step and see what happens. It might sound paradox, but you might be able to get more items and more 200 response by going slower.

如果您正在抓取一个大型站点尝试轮换代理.根据我的经验,tor 网络可能对此有点笨手笨脚,因此您可以尝试使用 Umair 建议的代理服务

If you are scraping a big site try rotating proxies. The tor network might be a bit heavy handed for this in my experience, so you might try a proxy service like Umair is suggesting

这篇关于如何在 Scrapy 中处理 429 Too Many Requests 响应?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-07 05:27