我试图通过以下URL编写蜘蛛以跨多个页面进行爬网:http://bookshop.lawsociety.org.uk/ecom_lawsoc/public/saleproducts.jsf?catId=EBOOK我正在使用Scrapy版本0.22.1来做到这一点。但是,我得到一个
“无法导入名称CrawlSpider”消息。我已经在下面粘贴了蜘蛛的代码。有人可以确定我在这里出问题了吗?
from scrapy.spider import CrawlSpider, Rule
from scrapy.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import BookpagesItem
class BookpagesSpider(CrawlSpider):
name = "book_sample"
allowed_domains = ["bookshop.lawsociety.org.uk"]
start_urls = ["http://bookshop.lawsociety.org.uk/ecom_lawsoc/public/saleproducts.jsf?catId=EBOOK",
]
rules = (
Rule(SgmlLinkExtractor(allow=('//*[@id="productList:scrollernext"]', )), callback='parse_item', follow= True),
Rule(SgmlLinkExtractor(allow=('//p/a[contains(@id, "productList")]', )), callback='parse_item', follow= True),
)
def parse_item(self, response):
sel = Selector(response)
sites = sel.xpath('//div[@class="dataListDiv"]')
items = []
for site in sites:
item = BooksItem()
item['title'] = site.xpath('//div/a/h3[@class="saleProductsTitle"]/text()').extract()
item['link'] = site.xpath('//p/a[contains(@id, "productList")]').extract()
item['price'] = site.xpath('//*[@class="saleProductsPrice"]/text()').extract()
item['category'] = site.xpath('//span[contains(@id, "category")]/text()').extract()
item['authors'] = site.xpath('//span[contains(@id, "author")]/text()').extract()
item['date'] = site.xpath('//span[contains(@id, "publicationDate")]/text()').extract()
item['publisher'] = site.xpath('//span[contains(@id, "publisher")]/text()').extract()
item['isbn'] = site.xpath('//span[contains(@id, "isbn")]/text()').extract()
items.append(item)
return items
items.py代码为:
from scrapy.item import Item, Field
class BookpagesItem(Item):
# define the fields for your item here like:
# name = Field()
title = Field()
link = Field()
price = Field()
category = Field()
authors = Field()
date = Field()
publisher = Field()
isbn = Field()
最佳答案
这表示from scrapy.spider import CrawlSpider, Rule
不正确。
查看Scrapy文档,可能应该是from scrapy.contrib.spiders import CrawlSpider
每当出现NameError-无法导入name foo错误时,您都在查看不正确的导入,因此可以将其范围缩小到仅import语句。您可以在库的文档中查找正确的位置,或者在源代码本身(如果有)中查找。
我搜索了草率的文档,发现了这一点:http://doc.scrapy.org/en/0.24/topics/spiders.html#crawlspider