Scrapy 不会抓取页面

本文介绍了Scrapy 不会抓取页面的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想抓取一个页面 http://www.jcvi.org/charprotdb/index.cgi/l_search?terms.1.field=all&terms.1.search_text=cancer&submit=++通过scrapy +Search+++&sort.key=organism&sort.order=%2B.但是好像有一个问题，我爬的时候没有得到任何数据.

I want to crawl a page http://www.jcvi.org/charprotdb/index.cgi/l_search?terms.1.field=all&terms.1.search_text=cancer&submit=+++Search+++&sort.key=organism&sort.order=%2B by scrapy. But seems there is a problem that I didn't get any data when crawling it.

这是我的蜘蛛代码:

import scrapy
from scrapy.selector import Selector
from scrapy_Data.items import CharProt


class CPSpider(scrapy.Spider):

    name = "CharProt"
    allowed_domains = ["jcvi.org"]
    start_urls = ["http://www.jcvi.org/charprotdb/index.cgi/l_search?terms.1.field=all&terms.1.search_text=cancer&submit=+++Search+++&sort.key=organism&sort.order=%2B"]

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//*[@id="middle_content_template"]/table/tbody/tr')

        for site in sites:
            item = CharProt()
            item['protein_name'] = site.xpath('td[1]/a/text()').extract()
            item['pn_link'] = site.xpath('td[1]/a/@href').extract()
            item['organism'] = site.xpath('td[2]/a/text()').extract()
            item['organism_link'] = site.xpath('td[2]/a/@href').extract()
            item['status'] = site.xpath('td[3]/a/text()').extract()
            item['status_link'] = site.xpath('td[3]/a/@href').extract()
            item['references'] = site.xpath('td[4]/a').extract()
            item['source'] = "CharProt"
            # collection.update({"protein_name": item['protein_name']}, dict(item), upsert=True)
            yield item

这是日志:

2016-05-28 17:25:06 [scrapy] INFO: Spider opened
2016-05-28 17:25:06 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-05-28 17:25:06 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-05-28 17:25:07 [scrapy] DEBUG: Crawled (200) <GET http://www.jcvi.org/charprotdb/index.cgi/l_search?terms.1.field=all&terms.1.search_text=cancer&submit=+++Search+++&sort.key=organism&sort.order=%2B> (referer: None)
<200 http://www.jcvi.org/charprotdb/index.cgi/l_search?terms.1.field=all&terms.1.search_text=cancer&submit=+++Search+++&sort.key=organism&sort.order=%2B>
2016-05-28 17:25:08 [scrapy] INFO: Closing spider (finished)
2016-05-28 17:25:08 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 337,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 26198,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 5, 28, 9, 25, 8, 103577),
 'log_count/DEBUG': 2,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2016, 5, 28, 9, 25, 6, 55848)}

当我运行其他蜘蛛时，它们都运行良好.那么有人可以告诉我我的代码有什么问题吗?还是这个网页有问题?

And when I run other spiders, they all works fine. So can anybody tell me what's wrong with my code? Or there is something wrong with this webpage?

推荐答案

您正在抓取它，但您的 xpath 是错误的

You are crawling it but your xpath is wrong

当您使用浏览器检查元素时，会出现 <tbody> 标记，但它不在源代码中的任何位置，因此，无需抓取任何内容！

When you inspect an element with your browser the <tbody> tag appears but it's not anywhere in the source code therefore, nothing is to be crawled!

sites = sel.xpath('//*[@id="middle_content_template"]/table/tr')

应该可以

编辑

作为旁注 extract() 返回一个 list 而不是您想要的元素，因此您需要使用 extract_first() 方法或 extract()[0]

As a side note extract() returns a list rather than the element you want so you need to use the extract_first() method or extract()[0]

例如

item['protein_name'] = site.xpath('td[1]/a/text()').extract_first()

这篇关于Scrapy 不会抓取页面的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！