i want to use recursion to crawl all the links in a website. and parse all the link pages, to extract all the detail links in the link pages.if the page link confroms to a rule, the page link is a item i want to parse detail.i use the code below:
class DmovieSpider(BaseSpider):
name = "dmovie"
allowed_domains = ["movie.douban.com"]
start_urls = ['http://movie.douban.com/']
def parse(self, response):
item = DmovieItem()
hxl = HtmlXPathSelector(response)
urls = hxl.select("//a/@href").extract()
all_this_urls = []
for url in urls:
if re.search("movie.douban.com/subject/\d+/$",url):
yield Request(url=url, cookies = cookies ,callback=self.parse_detail)
elif ("movie.douban.com" in url) and ("movie.douban.com/people" not in url) and ("movie.douban.com/celebrity" not in url) and ("comment" not in url):
if ("update" not in url) and ("add" not in url) and ("trailer" not in url) and ("cinema" not in url) and (not redis_conn.sismember("crawledurls", url)):
all_this_urls.append(Request(url=url, cookies = cookies , callback=self.parse))
for i in all_this_urls:
yield i
def parse_detail(self, response):
hxl = HtmlXPathSelector(response)
title = hxl.select("//span[@property='v:itemreviewed']/text()").extract()
title = select_first(title)
img = hxl.select("//div[@class='grid-16-8 clearfix']//a[@class='nbgnbg']/img/@src").extract()
img = select_first(img)
info = hxl.select("//div[@class='grid-16-8 clearfix']//div[@id='info']")
director = info.select("//a[@rel='v:directedBy']/text()").extract()
director = select_first(director)
actors = info.select("//a[@rel='v:starring']/text()").extract()
m_type = info.select("//span[@property='v:genre']/text()").extract()
release_date = info.select("//span[@property='v:initialReleaseDate']/text()").extract()
release_date = select_first(release_date)
d_rate = info.select("//strong[@class='ll rating_num']/text()").extract()
d_rate = select_first(d_rate)
info = select_first(info)
post = hxl.select("//div[@class='grid-16-8 clearfix']//div[@class='related-info']/div[@id='link-report']").extract()
post = select_first(post)
movie_db = Movie()
movie_db.name = title.encode("utf-8")
movie_db.dis_time = release_date.encode("utf-8")
movie_db.description = post.encode("utf-8")
movie_db.actors = "::".join(actors).encode("utf-8")
movie_db.director = director.encode("utf-8")
movie_db.mtype = "::".join(m_type).encode("utf-8")
movie_db.origin = "movie.douban.com"
movie_db.d_rate = d_rate.encode("utf-8")
exist_item = Movie.where(origin_url=response.url).select().fetchone()
if not exist_item:
movie_db.origin_url = response.url
print "successed!!!!!!!!!!!!!!!!!!!!!!!!!!!"
urls 是页面中的所有链接.如果其中一个 url 是我要解析的详细信息页面,则生成一个回调方法为 parse_detail 的请求.否则产生回调方法解析的请求.
urls is all the links in the page.if one of the urls is the detail page i want to parse, yield a Request which callback method is parse_detail. else yield a request that callback method is parse.
in this way , i crawled some pages, but it seems that the pages is not full, at my result, it seems that some pages is not visited. could you tell me how ?is there some way to crawl all the pages correctly?
尝试 爬行蜘蛛.
然后设置 DEPTH_LIMIT = 0settings.py
And then set DEPTH_LIMIT = 0 in the settings.py
to make sure the spider crawls all pages in the website.