我试图编写一个小的web解析器来获取javascript代码。为此,我尝试了ScrapyJS通过Javscript扩展Scrapy。
我已经按照the official repository上的安装说明进行了操作。
Scrapy本身工作得很好,但是scrapyJS(获取HTML内容和屏幕截图:)的第二个示例没有。所以希望我的问题能帮助其他遇到同样问题的人;)
我的设置和代码如下(如果需要):
首先,我通过sudo -H pip install scrapyjs
安装了scrapyJS
然后,运行以下命令:sudo docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash
以前,我更改了我的scrapy项目的settings.py。我添加了以下行:DOWNLOADER_MIDDLEWARES = { 'scrapyjs.SplashMiddleware': 725,}DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter'HTTPCACHE_STORAGE = 'scrapyjs.SplashAwareFSCacheStorage'
完整的python代码如下所示:
:
import json
import base64
import scrapy
class MySpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, self.parse_result, meta={
'splash': {
'args': {
'html': 1,
'png': 1,
'width': 600,
'render_all': 1,
}
}
})
def parse_result(self, response):
data = json.loads(response.body_as_unicode())
body = data['html']
png_bytes = base64.b64decode(data['png'])
print body
我得到以下错误:
2016-01-07 14:08:16 [scrapy] INFO: Enabled item pipelines:
2016-01-07 14:08:16 [scrapy] INFO: Spider opened
2016-01-07 14:08:16 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-01-07 14:08:16 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-01-07 14:08:16 [scrapy] DEBUG: Retrying <POST http://127.0.0.1:8050/render.json> (failed 1 times): 400 Bad Request
2016-01-07 14:08:16 [scrapy] DEBUG: Retrying <POST http://127.0.0.1:8050/render.json> (failed 2 times): 400 Bad Request
2016-01-07 14:08:16 [scrapy] DEBUG: Gave up retrying <POST http://127.0.0.1:8050/render.json> (failed 3 times): 400 Bad Request
2016-01-07 14:08:16 [scrapy] DEBUG: Crawled (400) <POST http://127.0.0.1:8050/render.json> (referer: None)
2016-01-07 14:08:16 [scrapy] DEBUG: Ignoring response <400 http://127.0.0.1:8050/render.json>: HTTP status code is not handled or not allowed
2016-01-07 14:08:16 [scrapy] INFO: Closing spider (finished)
所以我真的不知道错误在哪里。斯皮奇一个人工作。
如果我添加
SPLASH_URL = 'http://192.168.59.103:8050'
,我将得到一个超时错误。那时什么也没发生。Localhost:8050不工作。将SPLASH_URL保留为空可以解决该错误,但随后我会得到上面的错误。 最佳答案
您需要通过非零“wait”才能呈现完整的网页。
所以只要加上“wait”:0.5就可以了。
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, self.parse_result, meta={
'splash': {
'args': {
'html': 1,
'png': 1,
'width': 600,
'render_all': 1,
'wait': 0.5
}
}
})