问题描述
我正在尝试运行具有两个扩展名的Scrapy蜘蛛:
I'm trying to run a Scrapy spider with two 'extensions':
- 用于呈现JavaScript,
- 提供匿名。
- Splash for rendering JavaScript,
- Tor-Privoxy to provide anonymity.
例如,我在quotes.toscrape.com 的抓取器/ tree / master / example rel = noreferrer> https://github.com/scrapy-plugins/scrapy-splash/tree/master/example 。这是我的目录结构:
As an example, I'm using the scraper of quotes.toscrape.com
in https://github.com/scrapy-plugins/scrapy-splash/tree/master/example. Here is my directory structure:
.
├── docker-compose.yml
└── example
├── Dockerfile
├── scrapy.cfg
└── scrashtest
├── __init__.py
├── settings.py
└── spiders
├── __init__.py
└── quotes.py
其中 example
目录是从克隆的scrapy-splash
存储库。我添加了以下 docker-compose.yml
文件:
where the example
directory is cloned from the scrapy-splash
repository. I've added the following docker-compose.yml
file:
version: '3'
services:
scraper:
build: ./example
environment:
- http_proxy=http://tor-privoxy:8118
links:
- tor-privoxy
- splash
tor-privoxy:
image: rdsubhas/tor-privoxy-alpine
splash:
image: scrapinghub/splash
其中 settings.py
文件我已经更改了 SPLASH_URL
:
where in the settings.py
file I've changed the SPLASH_URL
:
# SPLASH_URL = 'http://127.0.0.1:8050/'
SPLASH_URL = 'http://splash:8050'
因为Splash不在本地主机上运行,而是在名为 splash
的单独的链接容器中运行。 刮板
的 Dockerfile
是
Because Splash is running not on the localhost, but in a separate linked container named splash
. The Dockerfile
for the scraper
is
FROM python:alpine
RUN apk --update add libxml2-dev libxslt-dev libffi-dev gcc musl-dev libgcc openssl-dev curl bash
RUN pip install scrapy scrapy-splash
COPY . /scraper
WORKDIR /scraper
CMD ["scrapy", "crawl", "quotes"]
问题是,当我使用 docker-compose build
和 docker-compose up
运行此程序时,我得到以下日志:
The problem is that when I run this using docker-compose build
and docker-compose up
, I get the following logs:
Starting examplecompose_tor-privoxy_1
Starting examplecompose_splash_1
Recreating examplecompose_scraper_1
Attaching to examplecompose_splash_1, examplecompose_tor-privoxy_1, examplecompose_scraper_1
splash_1 | 2017-07-11 16:10:13+0000 [-] Log opened.
splash_1 | 2017-07-11 16:10:13.794595 [-] Splash version: 3.0
tor-privoxy_1 | 2017-07-11 16:10:13.568 7f08e999eee8 Info: Privoxy version 3.0.23
tor-privoxy_1 | 2017-07-11 16:10:13.568 7f08e999eee8 Info: Program name: privoxy
tor-privoxy_1 | Jul 11 16:10:13.578 [notice] Tor v0.2.6.10 (git-58c51dc6087b0936) running on Linux with Libevent 2.0.22-stable, OpenSSL 1.0.2d and Zlib 1.2.8.
tor-privoxy_1 | Jul 11 16:10:13.578 [notice] Tor can't help you if you use it wrong! Learn how to be safe at https://www.torproject.org/download/download#warning
splash_1 | 2017-07-11 16:10:13.795925 [-] Qt 5.9.1, PyQt 5.9, WebKit 602.1, sip 4.19.3, Twisted 16.1.1, Lua 5.2
splash_1 | 2017-07-11 16:10:13.796204 [-] Python 3.5.2 (default, Nov 17 2016, 17:05:23) [GCC 5.4.0 20160609]
tor-privoxy_1 | Jul 11 16:10:13.578 [notice] Configuration file "/etc/tor/torrc" not present, using reasonable defaults.
tor-privoxy_1 | Jul 11 16:10:13.581 [notice] Opening Socks listener on 127.0.0.1:9050
splash_1 | 2017-07-11 16:10:13.796541 [-] Open files limit: 1048576
tor-privoxy_1 | Jul 11 16:10:13.000 [notice] Parsing GEOIP IPv4 file /usr/share/tor/geoip.
splash_1 | 2017-07-11 16:10:13.796706 [-] Can't bump open files limit
tor-privoxy_1 | Jul 11 16:10:13.000 [notice] Parsing GEOIP IPv6 file /usr/share/tor/geoip6.
splash_1 | 2017-07-11 16:10:13.903844 [-] Xvfb is started: ['Xvfb', ':1896918638', '-screen', '0', '1024x768x24', '-nolisten', 'tcp']
splash_1 | QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-root'
tor-privoxy_1 | Jul 11 16:10:13.000 [warn] You are running Tor as root. You don't need to, and you probably shouldn't.
splash_1 | 2017-07-11 16:10:13.984515 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles
tor-privoxy_1 | Jul 11 16:10:13.000 [notice] Bootstrapped 0%: Starting
splash_1 | 2017-07-11 16:10:14.041562 [-] verbosity=1
splash_1 | 2017-07-11 16:10:14.041732 [-] slots=50
tor-privoxy_1 | Jul 11 16:10:13.000 [notice] Bootstrapped 5%: Connecting to directory server
splash_1 | 2017-07-11 16:10:14.041806 [-] argument_cache_max_entries=500
tor-privoxy_1 | Jul 11 16:10:13.000 [notice] Bootstrapped 80%: Connecting to the Tor network
splash_1 | 2017-07-11 16:10:14.043083 [-] Web UI: enabled, Lua: enabled (sandbox: enabled)
splash_1 | 2017-07-11 16:10:14.044088 [-] Site starting on 8050
splash_1 | 2017-07-11 16:10:14.044240 [-] Starting factory <twisted.web.server.Site object at 0x7f73a4e4b3c8>
tor-privoxy_1 | Jul 11 16:10:14.000 [notice] Bootstrapped 85%: Finishing handshake with first hop
scraper_1 | 2017-07-11 16:10:15 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrashtest)
scraper_1 | 2017-07-11 16:10:15 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'scrashtest', 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage', 'NEWSPIDER_MODULE': 'scrashtest.spiders', 'SPIDER_MODULES': ['scrashtest.spiders']}
scraper_1 | 2017-07-11 16:10:15 [scrapy.middleware] INFO: Enabled extensions:
scraper_1 | ['scrapy.extensions.corestats.CoreStats',
scraper_1 | 'scrapy.extensions.telnet.TelnetConsole',
scraper_1 | 'scrapy.extensions.memusage.MemoryUsage',
scraper_1 | 'scrapy.extensions.logstats.LogStats']
scraper_1 | 2017-07-11 16:10:15 [scrapy.middleware] INFO: Enabled downloader middlewares:
scraper_1 | ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
scraper_1 | 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
scraper_1 | 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
scraper_1 | 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
scraper_1 | 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
scraper_1 | 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
scraper_1 | 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
scraper_1 | 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
scraper_1 | 'scrapy_splash.SplashCookiesMiddleware',
scraper_1 | 'scrapy_splash.SplashMiddleware',
scraper_1 | 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
scraper_1 | 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
scraper_1 | 'scrapy.downloadermiddlewares.stats.DownloaderStats']
scraper_1 | 2017-07-11 16:10:15 [scrapy.middleware] INFO: Enabled spider middlewares:
scraper_1 | ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
scraper_1 | 'scrapy_splash.SplashDeduplicateArgsMiddleware',
scraper_1 | 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
scraper_1 | 'scrapy.spidermiddlewares.referer.RefererMiddleware',
scraper_1 | 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
scraper_1 | 'scrapy.spidermiddlewares.depth.DepthMiddleware']
scraper_1 | 2017-07-11 16:10:15 [scrapy.middleware] INFO: Enabled item pipelines:
scraper_1 | []
scraper_1 | 2017-07-11 16:10:15 [scrapy.core.engine] INFO: Spider opened
scraper_1 | 2017-07-11 16:10:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
scraper_1 | 2017-07-11 16:10:15 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
tor-privoxy_1 | Jul 11 16:10:16.000 [notice] Bootstrapped 90%: Establishing a Tor circuit
tor-privoxy_1 | Jul 11 16:10:17.000 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.
tor-privoxy_1 | Jul 11 16:10:17.000 [notice] Bootstrapped 100%: Done
tor-privoxy_1 | Jul 11 16:10:17.000 [warn] Received http status code 404 ("Not found") from server '216.218.222.10:443' while fetching "/tor/keys/fp/585769C78764D58426B8B52B6651A5A71137189A+80550987E1D626E3EBA5E5E75A458DE0626D088C".
scraper_1 | 2017-07-11 16:10:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
scraper_1 | 2017-07-11 16:10:29 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.goodreads.com': <GET https://www.goodreads.com/quotes>
scraper_1 | 2017-07-11 16:10:29 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'scrapinghub.com': <GET https://scrapinghub.com>
tor-privoxy_1 | Jul 11 16:10:44.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up.
tor-privoxy_1 | Jul 11 16:10:44.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up.
scraper_1 | 2017-07-11 16:10:44 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/tag/adulthood/page/1/ via http://splash:8050/render.json> (failed 1 times): 500 Internal Server Error
scraper_1 | 2017-07-11 16:10:44 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/tag/be-yourself/page/1/ via http://splash:8050/render.json> (failed 1 times): 500 Internal Server Error
tor-privoxy_1 | Jul 11 16:10:55.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up.
tor-privoxy_1 | Jul 11 16:10:55.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up.
scraper_1 | 2017-07-11 16:10:55 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/tag/success/page/1/ via http://splash:8050/render.json> (failed 1 times): 500 Internal Server Error
scraper_1 | 2017-07-11 16:10:55 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/tag/books/page/1/ via http://splash:8050/render.json> (failed 1 times): 500 Internal Server Error
tor-privoxy_1 | Jul 11 16:10:56.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up.
scraper_1 | 2017-07-11 16:10:56 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/ via http://splash:8050/render.json> (failed 1 times): 500 Internal Server Error
tor-privoxy_1 | Jul 11 16:10:57.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up.
tor-privoxy_1 | Jul 11 16:10:57.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up.
scraper_1 | 2017-07-11 16:10:57 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/tag/classic/page/1/ via http://splash:8050/render.json> (failed 1 times): 500 Internal Server Error
scraper_1 | 2017-07-11 16:10:57 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/tag/aliteracy/page/1/ via http://splash:8050/render.json> (failed 1 times): 500 Internal Server Error
为简便起见,我在此中断了该过程。似乎刮板
和 tor-privoxy
服务交替抱怨 500内部服务错误
并且无法解析或连接到地址。
where I've interrupted the process for brevity. It seems like the scraper
and tor-privoxy
services are alternately complaining about a 500 Internal Service Error
and not being able to 'resolve or connect to address', respectively.
我正在努力弄清为什么 http_proxy
和Splash不能一起工作。有人能指出我正确的方向吗?
I'm struggling to figure out why the http_proxy
and Splash don't 'work together'. Can anyone point me in the right direction?
推荐答案
在Aquarium模板项目之后(),我发现诀窍是使Splash使用Tor,而不是蜘蛛直。
Following the Aquarium template project (https://github.com/TeamHG-Memex/aquarium), I found that the trick is to make Splash use Tor, not the spider directly.
我适应的项目具有以下结构:
My adapted project has the following structure:
.
├── docker-compose.yml
├── example
│ ├── Dockerfile
│ ├── scrapy.cfg
│ └── scrashtest
│ ├── __init__.py
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ └── quotes.py
└── splash
└── proxy-profiles
└── default.ini
和 docker-compose.yml
是
version: '3'
services:
scraper:
build: ./example
links:
- splash
tor-privoxy:
image: rdsubhas/tor-privoxy-alpine
splash:
image: scrapinghub/splash
volumes:
- ./splash/proxy-profiles:/etc/splash/proxy-profiles:ro
links:
- tor-privoxy
我已将代理配置文件
目录作为卷装入到 splash
容器跟随。 default.ini
读取
where I've mounted the proxy-profiles
directory as a volume into the splash
container following http://splash.readthedocs.io/en/stable/api.html#proxy-profiles. The default.ini
reads
[proxy]
host=tor-privoxy
port=8118
(我也注意到这是将其命名为 default.ini
)。
(I also noticed it is essential to call it default.ini
).
通过此设置,在 docker-组成build
和 docker-compose up
,该刮板使用Splash成功运行。
With this setup, upon docker-compose build
and docker-compose up
the scraper runs successfully using Splash.
这篇关于如何在Docker Compose中通过Srivpy和Tor通过Privoxy使用Tor的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!