提高scrapy crawlera的爬行速度

本文介绍了提高scrapy crawlera的爬行速度的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

CONCURRENT_REQUESTS = 50
CONCURRENT_REQUESTS_PER_DOMAIN = 50
AUTOTHROTTLE_ENABLED = False
DOWNLOAD_DELAY= 0

检查如何提高Scrapy爬行速度?后，我的scraper是仍然很慢，需要大约 25 小时才能抓取 12000 页(谷歌，亚马逊)，我使用 Crawlera，我还能做更多的事情来提高速度吗，当 CONCURRENT_REQUESTS =50 是否意味着我有 50 个线程请求?

after checking How to increase Scrapy crawling speed?, my scraper is still slow and takes about 25 hours to scrape 12000 pages (Google,Amazon), I use Crawlera, Is there more I can do to increase speed and when CONCURRENT_REQUESTS =50 does this mean I have 50 thread like request?

推荐答案

#如何运行一个蜘蛛的多个实例

#How to run several instances of a spider

你的蜘蛛可以在终端中接受一些参数，如下所示:scrapy crawl spider -a arg=value.

Your spider can take some arguments in the terminal as the following: scrapy crawl spider -a arg=value.

假设您想要启动 10 个实例，因为我猜您从 10 个网址开始(引用:输入通常是 10 个网址).命令可能是这样的:

Let's imagine you want to start 10 instances because I guess you start with 10 urls (quote: input is usually 10 urls). The commands could be like this:

scrapy crawl spider -a arg=url1 &
scrapy crawl spider -a arg=url2 &
...
scrapy crawl spider -a arg=url3

其中 & 表示您在前一个命令之后启动命令，而无需等待前一个命令的结束.据我所知，它在 Windows 或 Ubuntu 中使用相同的语法来满足此特定需求.

Where & indicates you launch a command after the previous one without waiting the end of this previous one. As far as I know it's the same syntax in Windows or Ubuntu for this particular need.

##Spider 源代码为了能够像我展示的那样启动，蜘蛛可以看起来像这样

##Spider source codeTo be able to launch as I showed you, the spider can look like this

class spiderExample(scrapy.Spiper):
    def __init__(arg): #all args in here are able to be entered in terminal with -a
        self.arg = arg #or self.start_urls = [arg] , because it can answer your problematic
        ... #any instructions you want, to initialize variables you need in the proccess
            #by calling them with self.correspondingVariable in any method of the spider.
    def parse(self,response):#will start with start_urls
        ... #any instructions you want to in the current parsing method

#避免被封号据我所知，您使用的是 Crawlera.我个人从未使用过这个.我从来不需要为此使用付费服务.

#To avoid to be bannedAs far as I read you use Crawlera. Personally I never used this. I never needed to use payable services for it.

##每个蜘蛛一个IP这里的目标很明确.正如我在评论中告诉你的，我使用 Tor 和 Polipo.Tor 需要像 Polipo 或 Privoxy 这样的 HTTP 代理才能正确地在爬虫蜘蛛中运行.Tor 将通过 HTTP 代理建立隧道，最后代理将使用 Tor IP.Crawlera 有趣的地方在于 Tor 的 IP 被一些拥有大量流量的网站所熟知(所以很多机器人也通过它......).这些网站可以禁止 Tor 的 IP，因为它们检测到与相同 IP 对应的机器人行为.

##One IP for each spiderThe goal here is clear. As I told you in comment I use Tor and Polipo. Tor needs an HTTP proxy like Polipo or Privoxy to run in a scrapy spider correctly. Tor will be tunneled with the HTTP proxy, and at the end the proxy will work with a Tor IP. Where Crawlera can be interesting is Tor's IPs are well known by some websites with a lot of traffic (so a lot of robots going through it too...). These websites can ban Tor's IP because they detected robots behaviors corresponding with the same IP.

好吧，我不知道 Crawlera 是如何工作的，所以我不知道如何在 Crawlera 中打开多个端口并使用多个 IP.自己看吧.在我使用 Polipo 的情况下，我可以在我自己启动的几个 Tor 电路上运行多个隧道实例(polipo 正在侦听 Tor 的相应袜子端口).每个 Polipo 实例都有自己的监听端口.然后对于每个蜘蛛我可以运行以下

Well, I don't know how work Crawlera, so I don't know how you can open several ports and use several IP's with Crawlera. Look at it by yourself. In my case with Polipo I can run several instances tunneled (polipo is listening the tor's corresponding socks port) on several Tor circuits launched by my own. Each Polipo instances has its own listen port. Then for each spider I can run the following

scrapy crawl spider -a arg=url1 -s HTTP_PROXY:127.0.0.1:30001 &
scrapy crawl spider -a arg=url2 -s HTTP_PROXY:127.0.0.1:30002 &
...
scrapy crawl spider -a arg=url10 -s HTTP_PROXY:127.0.0.1:30010 &

在这里，每个端口将使用不同的 IP 进行侦听，因此对于网站而言，这些是不同的用户.然后你的蜘蛛可以更礼貌(查看设置选项)并且你的整个项目更快.因此无需通过将300 设置为CONCURRENT_REQUESTS 或CONCURRENT_REQUESTS_PER_DOMAIN 来翻墙，它会使网站转动方向盘并产生不必要的事件像 DEBUG: Retrying <GET https://www.website.com/page3000>(失败 5 次):500 内部服务器错误.

Here, each port will listen with different IP's, so for the website these are different users. Then your spider can be more polite (look at settings options) and your whole project be faster. So no need to go through the roof by setting 300 to CONCURRENT_REQUESTS or to CONCURRENT_REQUESTS_PER_DOMAIN, it will make the website turn the wheel and generates unnecessary event like DEBUG: Retrying <GET https://www.website.com/page3000> (failed 5 times): 500 Internal Server Error.

根据我的个人喜好，我喜欢为每个蜘蛛设置不同的日志文件.它避免了终端中的行数激增，并允许让我在更舒适的文本文件中阅读进程的事件.在命令 -s LOG_FILE=thingy1.log 中易于编写.如果某些网址没有按照您的意愿抓取，它会很容易地显示给您.

In my personal preferences I like to set different log file for each spider. It avoids to explodes the amount of lines in the terminal, and allows to let me read the events of processes in a more comfortable text file. Easy to write in the command -s LOG_FILE=thingy1.log. It will show you easily if some urls where not scraped as you wanted.

##Random 用户代理.当我读到 Crawlera 是一个聪明的解决方案时，因为它使用了正确的用户代理来避免被禁止......我很惊讶，因为实际上你可以像这里.当你自己做的时候，最重要的一点是选择流行的用户代理，在这个代理的大量用户中被忽略.有些网站上有可用的列表.此外，请小心并使用计算机用户代理而不是其他设备(如移动设备)，因为呈现的页面(我的意思是源代码)不一定相同，您可能会丢失想要抓取的信息.

##Random user agent.When I read Crawlera is a smart solution because it uses the right user agent to avoid to be banned... I was surprised because actually you can do it by your own-self like here. The most important aspect when you do it by yourself is to choose popular user-agent to be overlooked among the large number of users of this same agent. You have available lists on some websites. Besides, be careful and take computer user-agent and not other device like mobile because the rendered page, I mean source code, is not necessarily the same and you can lose the information you want to scrape.

我的解决方案的主要缺点是它会消耗您的计算机资源.因此，您选择的实例数量将取决于您的计算机容量(RAM、CPU...)和路由器容量.就我个人而言，我仍在使用 ADSL，正如我告诉你的，在 20-30 分钟内完成了 6000 个请求......但我的解决方案不会比在 CONCURRENT_REQUESTS 上设置一个疯狂的数量消耗更多的带宽.

The main cons of my solution is it consumes your computer resources. So your choice of number of instances will depend on your computer capacities (RAM,CPU...), and router capacities too. Personally I am still using ADSL and as I told you 6000 requests were done in 20-30 minutes... But my solution does not consume more bandwidth than setting a crazy amount on CONCURRENT_REQUESTS.

这篇关于提高scrapy crawlera的爬行速度的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

1403页，肝出来的..