我修改了这个蜘蛛,但它给出了这个错误
检索器代码:
Gave up retrying <GET https://lib.maplelegends.com/robots.txt> (failed 3 times): 503 Service Unavailable
2019-01-06 23:43:56 [scrapy.core.engine] DEBUG: Crawled (503) <GET https://lib.maplelegends.com/robots.txt> (referer: None)
2019-01-06 23:43:56 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://lib.maplelegends.com/?p=etc&id=4004003> (failed 1 times): 503 Service Unavailable
2019-01-06 23:43:56 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://lib.maplelegends.com/?p=etc&id=4004003> (failed 2 times): 503 Service Unavailable
2019-01-06 23:43:56 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://lib.maplelegends.com/?p=etc&id=4004003> (failed 3 times): 503 Service Unavailable
2019-01-06 23:43:56 [scrapy.core.engine] DEBUG: Crawled (503) <GET https://lib.maplelegends.com/?p=etc&id=4004003> (referer: None)
2019-01-06 23:43:56 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <503 https://lib.maplelegends.com/?p=etc&id=4004003>: HTTP status code is not handled or not allowed
---它无需项目即可运行并保存在
output.csv
中-#!/usr/bin/env python3
import scrapy
import time
start_url = 'https://lib.maplelegends.com/?p=etc&id=4004003'
class MySpider(scrapy.Spider):
name = 'MySpider'
start_urls = [start_url]
def parse(self, response):
# print('url:', response.url)
products = response.xpath('.//div[@class="table-responsive"]/table/tbody')
for product in products:
item = {
#'name': product.xpath('./tr/td/b[1]/a/text()').extract(),
'link': product.xpath('./tr/td/b[1]/a/@href').extract(),
}
# url = response.urljoin(item['link'])
# yield scrapy.Request(url=url, callback=self.parse_product, meta={'item': item})
yield response.follow(item['link'], callback=self.parse_product, meta={'item': item})
time.sleep(5)
# execute with low
yield scrapy.Request(start_url, dont_filter=True, priority=-1)
def parse_product(self, response):
# print('url:', response.url)
# name = response.xpath('(//strong)[1]/text()').re(r'(\w+)')
hp = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "image", " " ))] | //img').re(r':(\d+)')
scrolls = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "image", " " ))] | //strong+//a//img/@title').re(r'\bScroll\b')
for price, hp, scrolls in zip(name, hp, scrolls):
yield {'name': name.strip(), 'hp': hp.strip(), 'scroll':scrolls.strip()}
最佳答案
Robots.txt
您的搜寻器正在尝试检查robots.txt
文件,但该网站上没有该文件。
为避免这种情况,可以在ROBOTSTXT_OBEY
文件中将settings.py
设置设置为false。
默认情况下为False,但是使用scrapy startproject
命令生成的新的scrapy项目已从模板生成了ROBOTSTXT_OBEY = True
。
503个回应
此外,该网站似乎在每个第一个请求上均以503响应。该网站正在使用某种机器人保护:
第一个请求是503,然后正在执行一些javascript发出AJAX请求以生成__shovlshield
cookie:
好像正在使用https://shovl.io/ ddos保护。
为了解决这个问题,您需要对javascript如何生成cookie或使用javascript渲染技术/服务(例如selenium
或splash
)进行反向工程
关于python - scrapy 503服务在starturl上不可用,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/54069663/