本文介绍了Scrapy 抓取所有站点地图链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我想抓取固定站点的 sitemap.xml 中存在的所有链接.我遇到了 Scrapy 的 SitemapSpider.到目前为止,我已经提取了站点地图中的所有网址.现在我想通过站点地图的每个链接爬行.任何帮助都会非常有用.到目前为止的代码是:
I want to crawl all he links present in the sitemap.xml of a fixed site. I've came across Scrapy's SitemapSpider. So far i've extracted all the urls in the sitemap. Now i want to crawl through each link of the sitemap. Any help would be highly useful. The code so far is:
class MySpider(SitemapSpider):
name = "xyz"
allowed_domains = ["xyz.nl"]
sitemap_urls = ["http://www.xyz.nl/sitemap.xml"]
def parse(self, response):
print response.url
推荐答案
需要添加sitemap_rules来处理爬取的url中的数据,可以随意创建.例如,假设您有一个名为 http://www.xyz.nl//x/ 的页面您要创建规则:
You need to add sitemap_rules to process the data in the crawled urls, and you can create as many as you want.For instance say you have a page named http://www.xyz.nl//x/ you want to create a rule:
class MySpider(SitemapSpider):
name = 'xyz'
sitemap_urls = 'http://www.xyz.nl/sitemap.xml'
# list with tuples - this example contains one page
sitemap_rules = [('/x/', parse_x)]
def parse_x(self, response):
sel = Selector(response)
paragraph = sel.xpath('//p').extract()
return paragraph
这篇关于Scrapy 抓取所有站点地图链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!