本文介绍了前后有空格的链接未正确解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在抓取一个网站,该网站在 URL 前后有一个空格
I have a website I'm crawling which has a white space before and after the URL
<a href=" /c/96894 ">Test</a>
而不是抓取这个:
http://www.stores.com/c/96894/
它抓取这个:
http://www.store.com/c/%0A%0A/c/96894%0A%0A
此外,对于包含相同链接的链接,它会导致无限循环,如下所示:
Moreover, it causes an infinite loop for links that contain the same link like this:
http://www.store.com/cp/%0A%0A/cp/96894%0A%0A/cp/96894%0A%0A
URL 前后的任何空格(\r
、\n
、\t
和空格)都会被所有浏览器忽略.如何修剪已抓取网址的空格?
Any white space(\r
, \n
, \t
and space) before and after the the URL is ignored by all browsers. How do I go about trimming the whitespace of the crawled URLs?
这是我的代码.
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from wallspider.items import Website
class StoreSpider(CrawlSpider):
name = "cpages"
allowed_domains = ["www.store.com"]
start_urls = ["http://www.sore.com",]
rules = (
Rule (SgmlLinkExtractor(allow=('/c/', ),deny=('grid=false', 'sort=', 'stores=', '\|\|', 'page=',))
, callback="parse_items", follow= True, process_links=lambda links: [link for link in links if not link.nofollow],),
Rule(SgmlLinkExtractor(allow=(),deny=('grid=false', 'sort=', 'stores=', '\|\|', 'page='))),
)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//html')
items = []
for site in sites:
item = Website()
item['url'] = response.url
item['referer'] = response.request.headers.get('Referer')
item['anchor'] = response.meta.get('link_text')
item['canonical'] = site.xpath('//head/link[@rel="canonical"]/@href').extract()
item['robots'] = site.select('//meta[@name="robots"]/@content').extract()
items.append(item)
return items
推荐答案
我在 LinkExtractor 实例中使用了 process_value=cleanurl
I used process_value=cleanurl in my LinkExtractor instance
def cleanurl(link_text):
return link_text.strip("\t\r\n ")
如果有人遇到同样问题的代码:
The code if anyone runs into the same problem:
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from wallspider.items import Website
class storeSpider(CrawlSpider):
name = "cppages"
allowed_domains = ["www.store.com"]
start_urls = ["http://www.store.com",]
def cleanurl(link_text):
return link_text.strip("\t\r\n '\"")
rules = (
Rule (SgmlLinkExtractor(allow=('/cp/', ),deny=('grid=false', 'sort=', 'stores=', r'\|\|', 'page=',), process_value=cleanurl)
, callback="parse_items", follow= True, process_links=lambda links: [link for link in links if not link.nofollow],),
Rule(SgmlLinkExtractor(allow=('/cp/', '/browse/', ),deny=('grid=false', 'sort=', 'stores=', r'\|\|', 'page='), process_value=cleanurl)),
)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//html')
items = []
for site in sites:
item = Website()
item['url'] = response.url
item['referer'] = response.request.headers.get('Referer')
item['anchor'] = response.meta.get('link_text')
item['canonical'] = site.xpath('//head/link[@rel="canonical"]/@href').extract()
item['robots'] = site.select('//meta[@name="robots"]/@content').extract()
items.append(item)
return items
这篇关于前后有空格的链接未正确解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!