编写自定义下载器中间件

编写自定义下载器中间件

本文介绍了如何为 selenium 和 Scrapy 编写自定义下载器中间件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 selenium 和 scrapy 对象之间进行通信时遇到问题.

I am having issue communicating between selenium and scrapy object.

我正在使用 selenium 登录某个站点,一旦我得到该响应,我想使用 scrape 的函数来解析和处理.请有人帮助我编写中间件,以便每个请求都应该通过 selenium web 驱动程序,并且响应应该传递给scrapy.

I am using selenium to login to some site, once I get that response I want to use scrape's functionaries to parse and process. Please can some one help me writing middleware so that every request should go through selenium web driver and response should be pass to scrapy.

谢谢!

推荐答案

非常简单,使用 webdriver 创建一个中间件,并使用 process_request 拦截请求,丢弃它并使用它拥有的 url将其传递给您的 selenium 网络驱动程序:

It's pretty straightforward, create a middleware with a webdriver and use process_request to intercept the request, discard it and use the url it had to pass it to your selenium webdriver:

from scrapy.http import HtmlResponse
from selenium import webdriver


class DownloaderMiddleware(object):
    def __init__(self):
        self.driver = webdriver.Chrome()  # your chosen driver

    def process_request(self, request, spider):
        # only process tagged request or delete this if you want all
        if not request.meta.get('selenium'):
            return
        self.driver.get(request.url)
        body = self.driver.page_source
        response = HtmlResponse(url=self.driver.current_url, body=body)
        return response

这样做的缺点是你必须摆脱蜘蛛中的并发,因为 selenium webdrive 一次只能处理一个 url.为此,请参阅设置文档页面.

The downside of this is that you have to get rid of the concurrency in your spider since selenium webdrive can only handle one url at a time. For that see settings documentation page.

这篇关于如何为 selenium 和 Scrapy 编写自定义下载器中间件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-31 00:04