如果蜘蛛获得重定向,则它应该再次请求,但参数不同。
第二个请求中的回调未执行。

如果在urlsstart方法中使用不同的checker,则可以正常工作。我认为请求正在使用lazy loads,这就是为什么我的代码无法正常工作的原因,但不确定。

from scrapy.http import Request
from scrapy.spider import BaseSpider

class TestSpider(BaseSpider):

    def start(self, response):
        return Request(url = 'http://localhost/', callback=self.checker, meta={'dont_redirect': True})

    def checker(self, response):
        if response.status == 301:
            return Request(url = "http://localhost/", callback=self.results, meta={'dont_merge_cookies': True})
        else:
            return self.results(response)

    def results(self, response):
        # here I work with response

最佳答案

不知道您是否仍然需要这个,但我整理了一个例子。如果您有特定的网站,我们绝对可以看一下。

from scrapy.http import Request
from scrapy.spider import BaseSpider

class TestSpider(BaseSpider):

    name = "TEST"
    allowed_domains = ["example.com", "example.iana.org"]

    def __init__(self, **kwargs):
        super( TestSpider, self ).__init__(**kwargs)\
        self.url      = "http://www.example.com"
        self.max_loop = 3
        self.loop     = 0  # We want it to loop 3 times so keep a class var

    def start_requests(self):
        # I'll write it out more explicitly here
        print "OPEN"
        checkRequest = Request(
            url      = self.url,
            meta     = {"test":"first"},
            callback = self.checker
        )
        return [ checkRequest ]

    def checker(self, response):
        # I wasn't sure about a specific website that gives 302
        # so I just used 200. We need the loop counter or it will keep going

        if(self.loop<self.max_loop and response.status==200):
            print "RELOOPING", response.status, self.loop, response.meta['test']
            self.loop += 1

            checkRequest = Request(
                url = self.url,
                callback = self.checker
            ).replace(meta = {"test":"not first"})
            return [checkRequest]
        else:
            print "END LOOPING"
            self.results(response) # No need to return, just call method

    def results(self, response):
        print "DONE"  # Do stuff here


在settings.py中,设置此选项

DUPEFILTER_CLASS = 'scrapy.dupefilter.BaseDupeFilter'


实际上,这是关闭重复站点请求筛选器的原因。这很令人困惑,因为BaseDupeFilter实际上并不是默认值,因为它实际上并未过滤任何内容。这意味着我们将提交3个不同的请求,这些请求将通过checker方法循环。另外,我正在使用scrapy 0.16:

>scrapy crawl TEST
>OPEN
>RELOOPING 200 0 first
>RELOOPING 200 1 not first
>RELOOPING 200 2 not first
>END LOOPING
>DONE

关于python - 如何在scrapy中的两个顺序请求中进行回调,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/16590110/

10-09 18:30
查看更多