我这样从多个网址抓取数据:
import scrapy
from pogba.items import PogbaItem
class DmozSpider(scrapy.Spider):
name = "pogba"
allowed_domains = ["fourfourtwo.com"]
start_urls = [
"http://www.fourfourtwo.com/statszone/21-2012/matches/459525/player-stats/74208/OVERALL_02",
"http://www.fourfourtwo.com/statszone/21-2012/matches/459571/player-stats/74208/OVERALL_02",
"http://www.fourfourtwo.com/statszone/21-2012/matches/459585/player-stats/74208/OVERALL_02",
"http://www.fourfourtwo.com/statszone/21-2012/matches/459614/player-stats/74208/OVERALL_02",
"http://www.fourfourtwo.com/statszone/21-2012/matches/459635/player-stats/74208/OVERALL_02",
"http://www.fourfourtwo.com/statszone/21-2012/matches/459644/player-stats/74208/OVERALL_02",
"http://www.fourfourtwo.com/statszone/21-2012/matches/459662/player-stats/74208/OVERALL_02",
"http://www.fourfourtwo.com/statszone/21-2012/matches/459674/player-stats/74208/OVERALL_02",
"http://www.fourfourtwo.com/statszone/21-2012/matches/459686/player-stats/74208/OVERALL_02",
"http://www.fourfourtwo.com/statszone/21-2012/matches/459694/player-stats/74208/OVERALL_02",
"http://www.fourfourtwo.com/statszone/21-2012/matches/459705/player-stats/74208/OVERALL_02",
"http://www.fourfourtwo.com/statszone/21-2012/matches/459710/player-stats/74208/OVERALL_02",
"http://www.fourfourtwo.com/statszone/21-2012/matches/459737/player-stats/74208/OVERALL_02",
"http://www.fourfourtwo.com/statszone/21-2012/matches/459744/player-stats/74208/OVERALL_02",
"http://www.fourfourtwo.com/statszone/21-2012/matches/459765/player-stats/74208/OVERALL_02"
]
def parse(self, response):
Coords = []
for sel in response.xpath('//*[@id="pitch"]/*[contains(@class,"success")]'):
item = PogbaItem()
item['x'] = sel.xpath('(@x|@x1)').extract()
item['y'] = sel.xpath('(@y|@y1)').extract()
Coords.append(item)
return Coords
问题是,在这种情况下,我的csv大约有200行,而每个网址的csv大约有50行。一次抓取一个网址效果很好,但是如果我设置多个网址,为什么会有不同的结果?
最佳答案
我会尝试通过增加请求(DOWNLOAD_DELAY
setting)之间的延迟和减少并发请求(CONCURRENT_REQUESTS
setting)的数量来调整抓取速度并放慢一点,例如:
DOWNLOAD_DELAY = 1
CONCURRENT_REQUESTS = 4
关于python - 多个网址的问题,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/36986116/