我只是在学习 Scrapy 和网站爬虫的基础知识,所以非常感谢您的意见.在教程的指导下,我从 Scrapy 构建了一个简单明了的爬虫.

I'm just picking up the basics of Scrapy and website crawlers so I would really appreciate your input. I've built a plain and simple crawler from Scrapy, guided by a tutorial.


It works fine but it won't crawl all the pages as it should.


from scrapy.spider       import BaseSpider
from scrapy.selector     import HtmlXPathSelector
from scrapy.http.request import Request
from fraist.items        import FraistItem
import re

class fraistspider(BaseSpider):
    name = "fraistspider"
    allowed_domain = ["99designs.com"]
    start_urls = ["http://99designs.com/designer-blog/"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        links = hxs.select("//div[@class='pagination']/a/@href").extract()

        #We stored already crawled links in this list
        crawledLinks    = []

        #Pattern to check proper link
        linkPattern     = re.compile("^(?:ftp|http|https):\/\/(?:[\w\.\-\+]+:{0,1}[\w\.\-\+]*@)?(?:[a-z0-9\-\.]+)(?::[0-9]+)?(?:\/|\/(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+)|\?(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+))?$")

        for link in links:
            # If it is a proper link and is not checked yet, yield it to the Spider
            if linkPattern.match(link) and not link in crawledLinks:
                yield Request(link, self.parse)

        posts = hxs.select("//article[@class='content-summary']")
        items = []
        for post in posts:
            item = FraistItem()
            item["title"] = post.select("div[@class='summary']/h3[@class='entry-title']/a/text()").extract()
            item["link"] = post.select("div[@class='summary']/h3[@class='entry-title']/a/@href").extract()
            item["content"] = post.select("div[@class='summary']/p/text()").extract()
        for item in items:
            yield item


         'title': [u'Design a poster in the style of Saul Bass']}
2015-05-20 16:22:41+0100 [fraistspider] DEBUG: Scraped from <200 http://nnbdesig
        {'content': [u'Helping a company come up with a branding strategy can be
 exciting\xa0and intimidating, all at once. It gives a designer the opportunity
to make a great visual impact with a brand, but requires skills in logo, print a
nd digital design. If you\u2019ve been hesitating to join a 99designs Brand Iden
tity Pack contest, here are a... '],
         'link': [u'http://99designs.com/designer-blog/2015/05/07/tips-brand-ide
         'title': [u'99designs\u2019 tips for a successful Brand Identity Pack d
2015-05-20 16:22:41+0100 [fraistspider] DEBUG: Redirecting (301) to <GET http://
nnbdesigner.wpengine.com/> from <GET http://99designs.com/designer-blog/page/10/
2015-05-20 16:22:41+0100 [fraistspider] DEBUG: Redirecting (301) to <GET http://
nnbdesigner.wpengine.com/> from <GET http://99designs.com/designer-blog/page/11/
2015-05-20 16:22:41+0100 [fraistspider] INFO: Closing spider (finished)
2015-05-20 16:22:41+0100 [fraistspider] INFO: Stored csv feed (100 items) in: da
2015-05-20 16:22:41+0100 [fraistspider] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 4425,
         'downloader/request_count': 16,
         'downloader/request_method_count/GET': 16,
         'downloader/response_bytes': 126915,
         'downloader/response_count': 16,
         'downloader/response_status_count/200': 11,
         'downloader/response_status_count/301': 5,
         'dupefilter/filtered': 41,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2015, 5, 20, 15, 22, 41, 738000),
         'item_scraped_count': 100,
         'log_count/DEBUG': 119,
         'log_count/INFO': 8,
         'request_depth_max': 5,
         'response_received_count': 11,
         'scheduler/dequeued': 16,
         'scheduler/dequeued/memory': 16,
         'scheduler/enqueued': 16,
         'scheduler/enqueued/memory': 16,
         'start_time': datetime.datetime(2015, 5, 20, 15, 22, 40, 718000)}
2015-05-20 16:22:41+0100 [fraistspider] INFO: Spider closed (finished)

如您所见,'item_scraped_count' 是 100,但应该更多,因为总共有 122 页,每页 10 篇文章.

As you can see the 'item_scraped_count' is 100 although it should be much more since there are 122 pages in total, 10 articles per page.

从输出中我可以看到存在 301 重定向问题,但我不明白为什么这会导致问题.我尝试了另一种方法来重写我的蜘蛛代码,但在相同部分的几个条目之后它再次中断.

From the output I can see that there is a 301 redirect issue but I don't understand why is this causing problems. I've tried another approach rewriting my spider's code, but again it breaks after a few entries, around the same part.


Any help would be much appreciated. Thank you!


似乎您达到了 http://doc.scrapy.org/en/latest/topics/settings.html#concurrent-items.

对于这种情况,我将使用 CrawlSpider抓取多个页面,所以你必须定义一个 规则与 99designs.com 中的页面匹配,并修改您的解析函数以处理该项目.

For this case I'll go with an CrawlSpider to crawl multiple pages, so you have to define a rule that match the pages in 99designs.com and sightly modify your parse function to process the item.

C&P 来自 Scrapy 文档的示例代码:

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),

    def parse_item(self, response):
        self.log('Hi, this is an item page! %s' % response.url)
        item = scrapy.Item()
        item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
        item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
        item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
        return item

我刚刚发现 这篇博文,其中包含一个有用的例子.

I just found this blog post that contain an useful example.

