本文介绍了使用 Privoxy 和 Tor 进行 Scrapy:如何更新 IP的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理 Scrapy、Privoxy 和 Tor.我已经全部安装并正常工作.但是 Tor 每次都使用相同的 IP 连接,所以我很容易被禁止.是否可以告诉 Tor 每 X 秒或连接重新连接一次?

I am dealing with Scrapy, Privoxy and Tor. I have all installed and properly working. But Tor connects with the same IP everytime, so I can easily be banned. Is it possible to tell Tor to reconnect each X seconds or connections?

谢谢!

编辑关于配置:对于用户代理池,我这样做了:http://tangww.com/2013/06/UsingRandomAgent/(我不得不放一个 _ init _.py 文件,正如评论中所说的那样),对于 Privoxy 和 Tor,我遵循了 http://www.andrewwatters.com/privoxy/(我必须使用终端手动创建私人用户和私人组).它奏效了:)

EDIT about the configuration:For the user agent pool i did this: http://tangww.com/2013/06/UsingRandomAgent/ (I had to put a _ init _.py file as it is said in the comments), and for the Privoxy and Tor I followed http://www.andrewwatters.com/privoxy/ (I had to create the private user and private group manually with the terminal). It worked :)

我的蜘蛛是这样的:

from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request

class YourCrawler(CrawlSpider):
    name = "spider_name"
    start_urls = [
    'https://example.com/listviews/titles.php',
    ]
    allowed_domains = ["example.com"]

    def parse(self, response):
        # go to the urls in the list
        s = Selector(response)
        page_list_urls = s.xpath('///*[@id="tab7"]/article/header/h2/a/@href').extract()
        for url in page_list_urls:
            yield Request(response.urljoin(url), callback=self.parse_following_urls, dont_filter=True)

        # Return back and go to bext page in div#paginat ul li.next a::attr(href) and begin again
        next_page = response.css('ul.pagin li.presente ~ li a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield Request(next_page, callback=self.parse)

    # For the urls in the list, go inside, and in div#main, take the div.ficha > div.caracteristicas > ul > li
    def parse_following_urls(self, response):
        #Parsing rules go here
        for each_book in response.css('main#main'):
            yield {
                'editor': each_book.css('header.datos1 > ul > li > h5 > a::text').extract(),
            }

在 settings.py 中,我有一个用户代理轮换和隐私:

In settings.py I have an user agent rotation and privoxy:

DOWNLOADER_MIDDLEWARES = {
        #user agent
        'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
        'spider_name.comm.rotate_useragent.RotateUserAgentMiddleware' :400,
        #privoxy
        'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
        'spider_name.middlewares.ProxyMiddleware': 100
    }

在 middlewares.py 中我添加了:

In middlewares.py I added:

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        request.meta['proxy'] = 'http://127.0.0.1:8118'
        spider.log('Proxy : %s' % request.meta['proxy'])

我认为仅此而已……

编辑二---

好的,我更改了我的 middlewares.py 文件,正如@Tomáš Linhart 在博客中所说:

Ok, I changed my middlewares.py file as in the blog @Tomáš Linhart said from:

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        request.meta['proxy'] = 'http://127.0.0.1:8118'
        spider.log('Proxy : %s' % request.meta['proxy'])

from stem import Signal
from stem.control import Controller

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        request.meta['proxy'] = 'http://127.0.0.1:8118'
        spider.log('Proxy : %s' % request.meta['proxy'])

    def set_new_ip():
        with Controller.from_port(port=9051) as controller:
            controller.authenticate(password='tor_password')
            controller.signal(Signal.NEWNYM)

但是现在真的很慢,而且似乎没有改变ip...我做的很好还是有什么问题?

But now is really slow, and doesn't appear to change the ip… I did it ok or is something wrong?

推荐答案

这个 博文 可能对您有所帮助,因为它处理了同样的问题.

This blog post might help you a bit as it deals with the same issue.

根据具体要求(每个请求或 N 个请求后的新 IP),在 set_new_ip 中适当调用 >process_request 中间件的方法.但是请注意,对 set_new_ip 函数的调用不必总是确保新 IP(有一个指向常见问题解答的链接并附有解释).

Based on concrete requirement (new IP for each request or after N requests), put appropriate call to set_new_ip in process_request method of the middleware. Note, however, that call to set_new_ip function doesn't have to always ensure new IP (there's a link to the FAQ with explanation).

带有 ProxyMiddleware 类的模块如下所示:

The module with ProxyMiddleware class would look like this:

from stem import Signal
from stem.control import Controller

def _set_new_ip():
    with Controller.from_port(port=9051) as controller:
        controller.authenticate(password='tor_password')
        controller.signal(Signal.NEWNYM)

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        _set_new_ip()
        request.meta['proxy'] = 'http://127.0.0.1:8118'
        spider.log('Proxy : %s' % request.meta['proxy'])

这篇关于使用 Privoxy 和 Tor 进行 Scrapy:如何更新 IP的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-15 22:56