我正在开发一个Scrapy项目,在该项目上我写了一个DOWNLOADER MIDDLEWARE,以避免对数据库中已有的URL发出请求。

DOWNLOADER_MIDDLEWARES = {
   'imobotS.utilities.RandomUserAgentMiddleware': 400,
   'imobotS.utilities.DupFilterMiddleware': 500,
   'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}


这个想法是在__init__上连接并加载一个当前存储在数据库中的所有URL的不同列表,如果被删除的项已经在数据库中,则引发IgnoreRequests。

class DuplicateFilterMiddleware(object):

    def __init__(self):
        connection = pymongo.Connection('localhost', 12345)
        self.db = connection['my_db']
        self.db.authenticate('scott', '*****')

        self.url_set = self.db.ad.find({'site': 'WEBSITE_NAME'}).distinct('url')

    def process_request(self, request, spider):
        print "%s - process Request URL: %s" % (spider._site_name, request.url)
        if request.url in self.url_set:
            raise IgnoreRequest("Duplicate --db-- item found: %s" % request.url)
        else:
            return None


因此,由于我想限制由WEBSITE_NAME在init上定义的url_list,有没有办法在Download Middleware __init__方法内标识当前的蜘蛛名称?

最佳答案

您可以将获取的url集移到process_request下,然后检查以前是否已获取。

class DuplicateFilterMiddleware(object):

    def __init__(self):
        connection = pymongo.Connection('localhost', 12345)
        self.db = connection['my_db']
        self.db.authenticate('scott', '*****')

        self.url_sets = {}

    def process_request(self, request, spider):
        if not self.url_sets.get(spider._site_name):
            self.url_sets[spider._site_name] = self.db.ad.find({'site': spider._site_name}).distinct('url')

        print "%s - process Request URL: %s" % (spider._site_name, request.url)
        if request.url in self.url_sets[spider._site_name]:
            raise IgnoreRequest("Duplicate --db-- item found: %s" % request.url)
        else:
            return None

关于python - Scrapy-在下载MIDDLEWARE __init__中获取蜘蛛变量,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/25677901/

10-10 11:21