问题描述
您好,我有以下代码来扫描给定网站中的所有链接。
Hello there I have the following code to scan all links in a give site.
from scrapy.item import Field, Item
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class SampleItem(Item):
link = Field()
class SampleSpider(CrawlSpider):
name = "sample_spider"
allowed_domains = ["domain.com"]
start_urls = ["http://domain.com"]
rules = (
Rule(LinkExtractor(), callback='parse_page', follow=True),
)
def parse_page(self, response):
item = SampleItem()
item['link'] = response.url
return item
如果我想只查看全球网站的一部分,我该怎么办呢?例如,我试图仅扫描其域名结构为:domain.com/fr/fr的国际站点的法语部分。所以我尝试过:
If I'like to check only part of a global site how could I do it? I have tried for example to scan only the french part of an international site whose domain is structured as : domain.com/fr/fr. So I have tried doing :
from scrapy.item import Field, Item
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class SampleItem(Item):
link = Field()
class SampleSpider(CrawlSpider):
name = "sample_spider"
allowed_domains = ["domain.com/fr/fr"]
start_urls = ["http://domain.com/fr/fr"]
rules = (
Rule(LinkExtractor(), callback='parse_page', follow=True),
)
def parse_page(self, response):
item = SampleItem()
item['link'] = response.url
return item
但蜘蛛只返回3个结果而不是数千个结果。我究竟做错了什么?
But the spider only returns 3 results instead of thousands. What am I doing wrong?
推荐答案
要仅抓取网站的一部分,您必须使用LinkExtractor。您可以通过发出 scrapy genspider -t crawl domain domain.com
来获取样本。
To crawl only part of a website, you have to use the LinkExtractor. You can get a sample by issueing scrapy genspider -t crawl domain domain.com
.
# -*- coding: utf-8 -*-
import scrapy
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from test.items import testItem
class DomainSpider(CrawlSpider):
name = 'domain'
allowed_domains = ['domain.com']
start_urls = ['http://www.domain.com/fr/fr']
rules = (
Rule(LinkExtractor(allow=r'fr/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
i = testItem()
#i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
#i['name'] = response.xpath('//div[@id="name"]').extract()
#i['description'] = response.xpath('//div[@id="description"]').extract()
return i
这篇关于Scrapy只抓取网站的一部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!