如何运行多个 Scrapy 爬虫，每个爬虫爬取不同的 URL?

本文介绍了如何运行多个 Scrapy 爬虫，每个爬虫爬取不同的 URL?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在 Scrapy 项目中有一个 spiders.py，其中包含以下蜘蛛...

I have a spiders.py in a Scrapy project with the following spiders...

class OneSpider(scrapy.Spider):
    name = "s1"

    def start_requests(self):
        urls = ["url1.com",]
        yield scrapy.Request(
            url="http://url1.com",
            callback=self.parse
        )

    def parse(self,response):
        ## Scrape stuff, put it in a dict
        yield dictOfScrapedStuff

class TwoSpider(scrapy.Spider):
    name = "s2"

    def start_requests(self):
        urls = ["url2.com",]
        yield scrapy.Request(
            url="http://url2.com",
            callback=self.parse
        )

    def parse(self,response):
        ## Scrape stuff, put it in a dict
        yield dictOfScrapedStuff

我如何运行爬虫s1 和s2，并将它们抓取的结果写入s1.json 和s2.json?

How do I run spiders s1 and s2, and write their scraped results to s1.json and s2.json?

`推荐答案`

Scrapy 不支持将多个蜘蛛作为单个进程运行，因此您只需运行两个进程:

Scrapy doesn't support running multiple spiders as a single process, so you'd simply run two processes:

scrapy crawl s1 -o s1.json
scrapy crawl s2 -o s2.json

如果您想在同一个终端窗口中执行此操作，则必须:

if you want to do it in the same terminal window you'd have to either:

运行第一个蜘蛛 -> 把它放到后台 (ctrl+z) -> 运行第二个蜘蛛
使用nohup，例如:

nohup scrapy crawl s1 -o s1.json --logfile s1.log &

使用 screen 命令.

$ screen
$ scrapy crawl s1 -o s1.json
$ ctrl+a ctrL+d  # detach screen
$ screen
$ scrapy crawl s2 -o s2.json
$ ctrl+a ctrL+d  # detach screen
$ screen -r  # to reattach to one of your sessions to see how the spider is doing

就我个人而言，我更喜欢 nohup 或 screen 选项，因为它们很干净，并且不会用日志记录和诸如此类的东西弄乱您的终端.

Personally I prefer nohup or screen options as they are clean and do not mess up your terminal with logging and whatnot.

                        这篇关于如何运行多个 Scrapy 爬虫，每个爬虫爬取不同的 URL?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！