问题描述
我有一个网址列表.我想抓取其中的每一个.请注意
I have a list of URLs. I want to crawl each of these. Please note
- 将此数组添加为
start_urls
不是我想要的行为.我希望它在单独的抓取会话中一一运行. - 我想在同一个进程中多次运行 Scrapy
- 我想将 Scrapy 作为脚本运行,如 常见做法,而不是来自 CLI.
- adding this array as
start_urls
is not the behavior I'm looking for. I would like this to run one by one in separate crawl sessions. - I want to run Scrapy multiple times in the same process
- I want to run Scrapy as a script, as covered in Common Practices, and not from the CLI.
以下代码是一个完整的、损坏的、可复制粘贴的示例.它基本上尝试遍历 URL 列表并在每个 URL 上启动爬虫.这是基于常见做法 文档.
The following code is a full, broken, copy-pastable example. It basically tries to loop through a list of URLs and start the crawler on each of them. This is based on the Common Practices documentation.
from urllib.parse import urlparse
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.spiders import CrawlSpider
class MySpider(CrawlSpider):
name = 'my-spider'
def __init__(self, start_url, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.start_urls = [start_url]
self.allowed_domains = [urlparse(start_url).netloc]
urls = [
'http://testphp.vulnweb.com/',
'http://testasp.vulnweb.com/'
]
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
for url in urls:
runner.crawl(MySpider, url)
reactor.run()
上面的问题是挂在第一个URL之后;第二个 URL 永远不会被抓取,此后没有任何反应:
The problem with the above is that it hangs after the first URL; the second URL is never crawled and nothing happens after this:
2018-08-13 20:28:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://testphp.vulnweb.com/> (referer: None)
[...]
2018-08-13 20:28:44 [scrapy.core.engine] INFO: Spider closed (finished)
推荐答案
reactor.run()
将从一开始就永远阻塞你的循环.解决这个问题的唯一方法是遵守 twisted
规则.一种方法是用扭曲的特定异步循环替换循环,如下所示:
The reactor.run()
will block your loop forever from the start. The only way around this is to play by the twisted
rules. One way to do so is by replacing your loop with a twisted specific asynchronous loop like so:
# from twisted.internet.defer import inlineCallbacks
...
@inlineCallbacks
def loop_urls(urls):
for url in urls:
yield runner.crawl(MySpider, url)
reactor.stop()
loop_urls(urls)
reactor.run()
这个魔法大致可以理解为:
and this magic roughly translates to:
def loop_urls(urls):
url, *rest = urls
dfd = runner.crawl(MySpider, url)
# crawl() returns a deferred to which a callback (or errback) can be attached
dfd.addCallback(lambda _: loop_urls(rest) if rest else reactor.stop())
loop_urls(urls)
reactor.run()
你也可以使用它,但它远非漂亮.
which you could use also but it's far from pretty.
这篇关于在同一进程中多次运行 Scrapy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!