本文介绍了为什么Scrapy会跳过某些URL,而不是其他URL?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在写一个令人毛骨悚然的爬虫,以便从亚马逊那里获取衬衫信息。搜寻器从亚马逊页面开始进行搜索,例如搜索有趣的衬衫,然后收集所有结果项容器。然后,它分析每个结果项,以收集衬衫上的数据。

I am writing a scrapy crawler to grab info on shirts from Amazon. The crawler starts on an amazon page for some search, "funny shirts" for example, and collects all the result item containers. It then parses through each result item collecting data on the shirts.

我使用ScraperAPI和Scrapy-user-agents躲避亚马逊。我的蜘蛛的代码是:

I use ScraperAPI and Scrapy-user-agents to dodge amazon. The code for my spider is:

class AmazonSpiderSpider(scrapy.Spider):
    name = 'amazon_spider'
    page_number = 2

    keyword_file = open("keywords.txt", "r+")
    all_key_words = keyword_file.readlines()
    keyword_file.close()
    all_links = []
    keyword_list = []

    for keyword in all_key_words:
        keyword_list.append(keyword)
        formatted_keyword = keyword.replace('\n', '')
        formatted_keyword = formatted_keyword.strip()
        formatted_keyword = formatted_keyword.replace(' ', '+')
        all_links.append("http://api.scraperapi.com/?api_key=mykeyd&url=https://www.amazon.com/s?k=" + formatted_keyword + "&ref=nb_sb_noss_2")

    start_urls = all_links

def parse(self, response):
    print("========== starting parse ===========")

    all_containers = response.css(".s-result-item")
    for shirts in all_containers:
        next_page = shirts.css('.a-link-normal::attr(href)').extract_first()
        if next_page is not None:
            if "https://www.amazon.com" not in next_page:
                next_page = "https://www.amazon.com" + next_page
            yield scrapy.Request('http://api.scraperapi.com/?api_key=mykey&url=' + next_page, callback=self.parse_dir_contents)

    second_page = response.css('li.a-last a::attr(href)').get()
    if second_page is not None and AmazonSpiderSpider.page_number < 3:
        AmazonSpiderSpider.page_number += 1
        yield response.follow(second_page, callback=self.parse)


def parse_dir_contents(self, response):
    items = ScrapeAmazonItem()

    print("============= parsing page ==============")

    temp = response.css('#productTitle::text').extract()
    product_name = ''.join(temp)
    product_name = product_name.replace('\n', '')
    product_name = product_name.strip()

    temp = response.css('#priceblock_ourprice::text').extract()
    product_price = ''.join(temp)
    product_price = product_price.replace('\n', '')
    product_price = product_price.strip()

    temp = response.css('#SalesRank::text').extract()
    product_score = ''.join(temp)
    product_score = product_score.strip()
    product_score = re.sub(r'\D', '', product_score)

    product_ASIN = re.search(r'(?<=/)B[A-Z0-9]{9}', response.url)
    product_ASIN = product_ASIN.group(0)

    items['product_ASIN'] = product_ASIN
    items['product_name'] = product_name
    items['product_price'] = product_price
    items['product_score'] = product_score

    yield items

看起来像这样:

我要返回200,所以我知道我从网页上获取了数据,但有时它不会进入parse_dir_contents,或者它只获取一些衬衫的信息,然后移至下一个关键字,而无需进行分页。

I'm getting a 200 returned so I know I'm getting the data from the webpage, but sometimes it does not go into parse_dir_contents, or it only grabs info on a few shirts and then moves on to the next keyword without following pagination.

使用两个关键字:加载了我文件(keywords.txt)中的第一个关键字,它可能会找到1-3件衬衫,然后移至下一个关键字。然后,第二个关键字完全成功,找到所有衬衫并进行分页。在具有5个以上关键字的关键字文件中,将跳过前2-3个关键字,然后加载下一个关键字,仅找到2-3个衬衫,然后再移至下一个单词,这再次完全成功。在包含10个以上关键字的文件中,我的行为非常零散。

Working with two keywords: the first keyword in my file (keywords.txt) is loaded, it may find 1-3 shirts, then it moves on to the next keyword. The second keyword is then completely successful, finding all shirts and following pagination. In a keyword file with 5+ keywords, the first 2-3 keywords are skipped, then the next keyword is loaded and only 2-3 shirts are found before it moves onto the next word which is again completely successful. In a file with 10+ keywords I get very sporadic behavior.

我不知道为什么会这样吗?有人可以解释吗?

I have no idea why this is happening can anyone explain?

推荐答案

尝试在您的拼凑请求中使用 dont_filter = True 。我遇到了同样的问题,似乎这名爬虫爬虫正在忽略某些URL,因为它认为这些URL是重复的。

Try to make use of dont_filter=True in your scrapy Requests. I had the same problem, seemed like the scrapy crawler was ignoring some URLs because it thought they were duplicate.

dont_filter=True

这可确保scrapy不会使用其dupefilter过滤任何URL。

This makes sure that scrapy doesn't filter any URLS with its dupefilter.

这篇关于为什么Scrapy会跳过某些URL,而不是其他URL?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-27 04:44