问题描述
我在 Scrapy 项目中有一个 spiders.py
,其中包含以下蜘蛛...
I have a spiders.py
in a Scrapy project with the following spiders...
class OneSpider(scrapy.Spider):
name = "s1"
def start_requests(self):
urls = ["url1.com",]
yield scrapy.Request(
url="http://url1.com",
callback=self.parse
)
def parse(self,response):
## Scrape stuff, put it in a dict
yield dictOfScrapedStuff
class TwoSpider(scrapy.Spider):
name = "s2"
def start_requests(self):
urls = ["url2.com",]
yield scrapy.Request(
url="http://url2.com",
callback=self.parse
)
def parse(self,response):
## Scrape stuff, put it in a dict
yield dictOfScrapedStuff
我如何运行爬虫s1
和s2
,并将它们抓取的结果写入s1.json
和s2.json代码>?
How do I run spiders s1
and s2
, and write their scraped results to s1.json
and s2.json
?
推荐答案
Scrapy 不支持将多个蜘蛛作为单个进程运行,因此您只需运行两个进程:
Scrapy doesn't support running multiple spiders as a single process, so you'd simply run two processes:
scrapy crawl s1 -o s1.json
scrapy crawl s2 -o s2.json
如果您想在同一个终端窗口中执行此操作,则必须:
if you want to do it in the same terminal window you'd have to either:
- 运行第一个蜘蛛 -> 把它放到后台 (ctrl+z) -> 运行第二个蜘蛛
使用
nohup
,例如:
nohup scrapy crawl s1 -o s1.json --logfile s1.log &
使用 screen
命令.
$ screen
$ scrapy crawl s1 -o s1.json
$ ctrl+a ctrL+d # detach screen
$ screen
$ scrapy crawl s2 -o s2.json
$ ctrl+a ctrL+d # detach screen
$ screen -r # to reattach to one of your sessions to see how the spider is doing
就我个人而言,我更喜欢 nohup 或 screen 选项,因为它们很干净,并且不会用日志记录和诸如此类的东西弄乱您的终端.
Personally I prefer nohup or screen options as they are clean and do not mess up your terminal with logging and whatnot.
这篇关于如何运行多个 Scrapy 爬虫,每个爬虫爬取不同的 URL?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!