问题描述
我想创建一个调度程序脚本来按顺序多次运行同一个蜘蛛.
I want to create a scheduler script to run the same spider multiple times in a sequence.
到目前为止,我得到了以下内容:
So far I got the following:
#!/usr/bin/python3
"""Scheduler for spiders."""
import time
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from my_project.spiders.deals import DealsSpider
def crawl_job():
"""Job to start spiders."""
settings = get_project_settings()
process = CrawlerProcess(settings)
process.crawl(DealsSpider)
process.start() # the script will block here until the end of the crawl
if __name__ == '__main__':
while True:
crawl_job()
time.sleep(30) # wait 30 seconds then crawl again
现在蜘蛛第一次正确执行,然后在时间延迟后,蜘蛛再次启动,但就在它开始抓取之前,我收到以下错误消息:
For now the first time the spider executes properly, then after the time delay, the spider starts up again but right before it would start scraping I get the following error message:
Traceback (most recent call last):
File "scheduler.py", line 27, in <module>
crawl_job()
File "scheduler.py", line 17, in crawl_job
process.start() # the script will block here until the end of the crawl
File "/usr/local/lib/python3.5/dist-packages/scrapy/crawler.py", line 285, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/usr/local/lib/python3.5/dist-packages/twisted/internet/base.py", line 1193, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "/usr/local/lib/python3.5/dist-packages/twisted/internet/base.py", line 1173, in startRunning
ReactorBase.startRunning(self)
File "/usr/local/lib/python3.5/dist-packages/twisted/internet/base.py", line 684, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
不幸的是,我不熟悉 Twisted
框架和它的 Reactor
,所以任何帮助将不胜感激!
Unfortunately I'm not familiar with the Twisted
framework and its Reactor
s, so any help would be appreciated!
推荐答案
您收到 ReactorNotRestartable
错误,因为 Reactor
无法在 Twisted 中多次启动.基本上,每次 process.start()
被调用时,它都会尝试启动反应器.网上有很多关于这方面的信息.这是一个简单的解决方案:
You're getting the ReactorNotRestartable
error because the Reactor
cannot be started multiple times in Twisted. Basically, each time process.start()
is called, it will try to start the reactor. There's plenty of information around the web about this. Here's a simple solution:
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from my_project.spiders.deals import DealsSpider
def crawl_job():
"""
Job to start spiders.
Return Deferred, which will execute after crawl has completed.
"""
settings = get_project_settings()
runner = CrawlerRunner(settings)
return runner.crawl(DealsSpider)
def schedule_next_crawl(null, sleep_time):
"""
Schedule the next crawl
"""
reactor.callLater(sleep_time, crawl)
def crawl():
"""
A "recursive" function that schedules a crawl 30 seconds after
each successful crawl.
"""
# crawl_job() returns a Deferred
d = crawl_job()
# call schedule_next_crawl(<scrapy response>, n) after crawl job is complete
d.addCallback(schedule_next_crawl, 30)
d.addErrback(catch_error)
def catch_error(failure):
print(failure.value)
if __name__=="__main__":
crawl()
reactor.run()
与您的代码段有一些明显的不同.reactor
被直接调用,用 CrawlerProcess
代替 CrawlerRunner
,time.sleep
已被移除,因此反应器不会't 阻塞,while
循环已被替换为通过 callLater
对 crawl
函数的连续调用.它很短,应该做你想做的.如果有任何部分让您感到困惑,请告诉我,我会详细说明.
There are a few noticeable differences from your snippet. The reactor
is directly called, substitute CrawlerProcess
for CrawlerRunner
, time.sleep
has been removed so that the reactor doesn't block, the while
loop has been replaced with a continuous call to the crawl
function via callLater
. It's short and should do what you want. If any parts confuse you, let me know and I'll elaborate.
import datetime as dt
def schedule_next_crawl(null, hour, minute):
tomorrow = (
dt.datetime.now() + dt.timedelta(days=1)
).replace(hour=hour, minute=minute, second=0, microsecond=0)
sleep_time = (tomorrow - dt.datetime.now()).total_seconds()
reactor.callLater(sleep_time, crawl)
def crawl():
d = crawl_job()
# crawl everyday at 1pm
d.addCallback(schedule_next_crawl, hour=13, minute=30)
这篇关于如何以编程方式安排 Scrapy 爬网执行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!