本文介绍了是否可以从Scrapy蜘蛛运行另一个蜘蛛?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前我有2只蜘蛛,我想做的是

For now I have 2 spiders, what I would like to do is

  1. 蜘蛛1转到url1,如果出现url2,请用url2调用蜘蛛2.还使用管道保存url1的内容.
  2. 蜘蛛2转到url2并执行某些操作.
  1. Spider 1 goes to url1 and if url2 appears, call spider 2 with url2. Also saves the content of url1 by using pipeline.
  2. Spider 2 goes to url2 and do something.

由于两个蜘蛛的复杂性,我想将它们分开.

Due to the complexities of both spiders I would like to have them separated.

我尝试使用scrapy crawl进行的操作:

What I have tried using scrapy crawl:

def parse(self, response):
    p = multiprocessing.Process(
        target=self.testfunc())
    p.join()
    p.start()

def testfunc(self):
    settings = get_project_settings()
    crawler = CrawlerRunner(settings)
    crawler.crawl(<spidername>, <arguments>)

它确实会加载设置,但不会抓取:

It does load the settings but doesn't crawl:

2015-08-24 14:13:32 [scrapy] INFO: Enabled extensions: CloseSpider, LogStats, CoreStats, SpiderState
2015-08-24 14:13:32 [scrapy] INFO: Enabled downloader middlewares: DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, HttpAuthMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-08-24 14:13:32 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-08-24 14:13:32 [scrapy] INFO: Spider opened
2015-08-24 14:13:32 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

文档中有一个有关从脚本启动的示例,但是我想做的是使用scrapy crawl命令启动另一个蜘蛛.

The documentations has a example about launching from script, but what I'm trying to do is launch another spider while using scrapy crawl command.

完整代码

from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from multiprocessing import Process
import scrapy
import os


def info(title):
    print(title)
    print('module name:', __name__)
    if hasattr(os, 'getppid'):  # only available on Unix
        print('parent process:', os.getppid())
    print('process id:', os.getpid())


class TestSpider1(scrapy.Spider):
    name = "test1"
    start_urls = ['http://www.google.com']

    def parse(self, response):
        info('parse')
        a = MyClass()
        a.start_work()


class MyClass(object):

    def start_work(self):
        info('start_work')
        p = Process(target=self.do_work)
        p.start()
        p.join()

    def do_work(self):

        info('do_work')
        settings = get_project_settings()
        runner = CrawlerRunner(settings)
        runner.crawl(TestSpider2)
        d = runner.join()
        d.addBoth(lambda _: reactor.stop())
        reactor.run()
        return

class TestSpider2(scrapy.Spider):

    name = "test2"
    start_urls = ['http://www.google.com']

    def parse(self, response):
        info('testspider2')
        return

我希望是这样的

  1. scrapy爬网测试1(例如,当response.status_code为200时:)
  2. 在test1中,呼叫scrapy crawl test2
  1. scrapy crawl test1(for example, when response.status_code is 200:)
  2. in test1, call scrapy crawl test2

推荐答案

由于这个问题确实很老,我将不做深入介绍,但是我将继续从官方Scrappy文档中删除此摘要....非常接近!大声笑

I won't go in depth given since this question is really old but I'll go ahead drop this snippet from the official Scrappy docs.... You are very close! lol

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished

https://doc.scrapy.org/en/latest/topics /practices.html

然后使用回调,您可以在蜘蛛之间传递项目,并且确实具有您所讨论的逻辑功能

And then using callbacks you can pass items between your spiders do do w.e logic functions your talking about

这篇关于是否可以从Scrapy蜘蛛运行另一个蜘蛛?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

06-11 21:42