问题描述
我想抓取一个页面 http://www.jcvi.org/charprotdb/index.cgi/l_search?terms.1.field=all&terms.1.search_text=cancer&submit=++通过scrapy +Search+++&sort.key=organism&sort.order=%2B
.但是好像有一个问题,我爬的时候没有得到任何数据.
I want to crawl a page http://www.jcvi.org/charprotdb/index.cgi/l_search?terms.1.field=all&terms.1.search_text=cancer&submit=+++Search+++&sort.key=organism&sort.order=%2B
by scrapy. But seems there is a problem that I didn't get any data when crawling it.
这是我的蜘蛛代码:
import scrapy
from scrapy.selector import Selector
from scrapy_Data.items import CharProt
class CPSpider(scrapy.Spider):
name = "CharProt"
allowed_domains = ["jcvi.org"]
start_urls = ["http://www.jcvi.org/charprotdb/index.cgi/l_search?terms.1.field=all&terms.1.search_text=cancer&submit=+++Search+++&sort.key=organism&sort.order=%2B"]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//*[@id="middle_content_template"]/table/tbody/tr')
for site in sites:
item = CharProt()
item['protein_name'] = site.xpath('td[1]/a/text()').extract()
item['pn_link'] = site.xpath('td[1]/a/@href').extract()
item['organism'] = site.xpath('td[2]/a/text()').extract()
item['organism_link'] = site.xpath('td[2]/a/@href').extract()
item['status'] = site.xpath('td[3]/a/text()').extract()
item['status_link'] = site.xpath('td[3]/a/@href').extract()
item['references'] = site.xpath('td[4]/a').extract()
item['source'] = "CharProt"
# collection.update({"protein_name": item['protein_name']}, dict(item), upsert=True)
yield item
这是日志:
2016-05-28 17:25:06 [scrapy] INFO: Spider opened
2016-05-28 17:25:06 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-05-28 17:25:06 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-05-28 17:25:07 [scrapy] DEBUG: Crawled (200) <GET http://www.jcvi.org/charprotdb/index.cgi/l_search?terms.1.field=all&terms.1.search_text=cancer&submit=+++Search+++&sort.key=organism&sort.order=%2B> (referer: None)
<200 http://www.jcvi.org/charprotdb/index.cgi/l_search?terms.1.field=all&terms.1.search_text=cancer&submit=+++Search+++&sort.key=organism&sort.order=%2B>
2016-05-28 17:25:08 [scrapy] INFO: Closing spider (finished)
2016-05-28 17:25:08 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 337,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 26198,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 5, 28, 9, 25, 8, 103577),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 5, 28, 9, 25, 6, 55848)}
当我运行其他蜘蛛时,它们都运行良好.那么有人可以告诉我我的代码有什么问题吗?还是这个网页有问题?
And when I run other spiders, they all works fine. So can anybody tell me what's wrong with my code? Or there is something wrong with this webpage?
推荐答案
您正在抓取它,但您的 xpath 是错误的
You are crawling it but your xpath is wrong
当您使用浏览器检查元素时,会出现 <tbody>
标记,但它不在源代码中的任何位置,因此,无需抓取任何内容!
When you inspect an element with your browser the <tbody>
tag appears but it's not anywhere in the source code therefore, nothing is to be crawled!
sites = sel.xpath('//*[@id="middle_content_template"]/table/tr')
应该可以
编辑
作为旁注 extract()
返回一个 list
而不是您想要的元素,因此您需要使用 extract_first()
方法或 extract()[0]
As a side note extract()
returns a list
rather than the element you want so you need to use the extract_first()
method or extract()[0]
例如
item['protein_name'] = site.xpath('td[1]/a/text()').extract_first()
这篇关于Scrapy 不会抓取页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!