问题描述
我遇到了这个问题,我已经 4 天了.我想爬取 "http://www.ledcor.com/careers/search-careers".在每个职位列表页面(即 http://www.ledcor.com/careers/search-careers?page=2) 我进入每个职位链接并获得职位名称.到目前为止我有这个工作.
I am hitting a dead end with this problem I am having for 4 days. I want to crawl "http://www.ledcor.com/careers/search-careers". On each job listing page (i.e. http://www.ledcor.com/careers/search-careers?page=2) I go inside each job link and get the job title. I have this working so far.
现在,我试图让蜘蛛进入下一个职位列表页面(来自 http://www.ledcor.com/careers/search-careers?page=2 到 http://www.ledcor.com/careers/search-careers?page=3 并抓取所有工作).我的爬网规则不起作用,我不知道出了什么问题,遗漏了什么.请帮忙.
Now, I am trying to make the spider go to next job listing page (i.g. from http://www.ledcor.com/careers/search-careers?page=2 to http://www.ledcor.com/careers/search-careers?page=3 and crawl all the jobs). My crawl rule does not work and I have no clues what is wrong, what is missing. Please help.
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
class LedcorSpider(CrawlSpider):
name = "ledcor"
allowed_domains = ["www.ledcor.com"]
start_urls = ["http://www.ledcor.com/careers/search-careers"]
rules = [
Rule(SgmlLinkExtractor(allow=("http://www.ledcor.com/careers/search-careers\?page=\d",),restrict_xpaths=('//div[@class="pager"]/a',)), follow=True),
Rule(SgmlLinkExtractor(allow=("http://www.ledcor.com/job\?(.*)",)),callback="parse_items")
]
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
item = CraigslistSampleItem()
item['title'] = hxs.select('//h1/text()').extract()[0].encode('utf-8')
item['link'] = response.url
return item
这里是 Items.py
here is Items.py
from scrapy.item import Item, Field
class CraigslistSampleItem(Item):
title = Field()
link = Field()
desc = Field()
这里是 Pipelines.py
Here is Pipelines.py
class CraigslistSamplePipeline(object):
def process_item(self, item, spider):
return item
更新:(@blender 建议)它不会爬行
Updated: (@blender suggestion) It doesnt crawl
rules = [
Rule(SgmlLinkExtractor(allow=(r"http://www.ledcor.com/careers/search-careers\?page=\d",),restrict_xpaths=('//div[@class="pager"]/a',)), follow=True),
Rule(SgmlLinkExtractor(allow=("http://www.ledcor.com/job\?(.*)",)),callback="parse_items")
]
推荐答案
您的 restrict_xpaths
参数是错误的.删除它,它会起作用.
Your restrict_xpaths
argument is wrong. Remove it and it will work.
$ scrapy shell http://www.ledcor.com/careers/search-careers
In [1]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
In [2]: lx = SgmlLinkExtractor(allow=("http://www.ledcor.com/careers/search-careers\?page=\d",),restrict_xpaths=('//div[@class="pager"]/a',))
In [3]: lx.extract_links(response)
Out[3]: []
In [4]: lx = SgmlLinkExtractor(allow=("http://www.ledcor.com/careers/search-careers\?page=\d",))
In [5]: lx.extract_links(response)
Out[5]:
[Link(url='http://www.ledcor.com/careers/search-careers?page=1', text=u'', fragment='', nofollow=False),
Link(url='http://www.ledcor.com/careers/search-careers?page=2', text=u'2', fragment='', nofollow=False),
Link(url='http://www.ledcor.com/careers/search-careers?page=3', text=u'3', fragment='', nofollow=False),
Link(url='http://www.ledcor.com/careers/search-careers?page=4', text=u'4', fragment='', nofollow=False),
Link(url='http://www.ledcor.com/careers/search-careers?page=5', text=u'5', fragment='', nofollow=False),
Link(url='http://www.ledcor.com/careers/search-careers?page=6', text=u'6', fragment='', nofollow=False),
Link(url='http://www.ledcor.com/careers/search-careers?page=7', text=u'7', fragment='', nofollow=False),
Link(url='http://www.ledcor.com/careers/search-careers?page=8', text=u'8', fragment='', nofollow=False),
Link(url='http://www.ledcor.com/careers/search-careers?page=9', text=u'9', fragment='', nofollow=False),
Link(url='http://www.ledcor.com/careers/search-careers?page=10', text=u'10', fragment='', nofollow=False)]
这篇关于Scrapy 在第一页后不爬行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!