本文介绍了前后有空格的链接未正确解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在抓取一个网站,该网站在 URL 前后有一个空格

I have a website I'm crawling which has a white space before and after the URL

<a href="   /c/96894   ">Test</a>

而不是抓取这个:

http://www.stores.com/c/96894/

它抓取这个:

http://www.store.com/c/%0A%0A/c/96894%0A%0A

此外,对于包含相同链接的链接,它会导致无限循环,如下所示:

Moreover, it causes an infinite loop for links that contain the same link like this:

http://www.store.com/cp/%0A%0A/cp/96894%0A%0A/cp/96894%0A%0A

URL 前后的任何空格(\r\n\t 和空格)都会被所有浏览器忽略.如何修剪已抓取网址的空格?

Any white space(\r, \n, \t and space) before and after the the URL is ignored by all browsers. How do I go about trimming the whitespace of the crawled URLs?

这是我的代码.

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from wallspider.items import Website

class StoreSpider(CrawlSpider):
    name = "cpages"
    allowed_domains = ["www.store.com"]
    start_urls = ["http://www.sore.com",]

    rules = (
    Rule (SgmlLinkExtractor(allow=('/c/', ),deny=('grid=false', 'sort=', 'stores=', '\|\|', 'page=',))
    , callback="parse_items", follow= True, process_links=lambda links: [link for link in links if not link.nofollow],),
    Rule(SgmlLinkExtractor(allow=(),deny=('grid=false', 'sort=', 'stores=', '\|\|', 'page='))),
    )

    def parse_items(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//html')
        items = []

        for site in sites:
            item = Website()
            item['url'] = response.url
            item['referer'] = response.request.headers.get('Referer')
            item['anchor'] = response.meta.get('link_text')
            item['canonical'] = site.xpath('//head/link[@rel="canonical"]/@href').extract()
            item['robots'] = site.select('//meta[@name="robots"]/@content').extract()
            items.append(item)

        return items

推荐答案

我在 LinkExtractor 实例中使用了 process_value=cleanurl

I used process_value=cleanurl in my LinkExtractor instance

def cleanurl(link_text):
    return link_text.strip("\t\r\n ")

如果有人遇到同样问题的代码:

The code if anyone runs into the same problem:

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from wallspider.items import Website


class storeSpider(CrawlSpider):
    name = "cppages"
    allowed_domains = ["www.store.com"]
    start_urls = ["http://www.store.com",]

    def cleanurl(link_text):
        return link_text.strip("\t\r\n '\"")

    rules = (
    Rule (SgmlLinkExtractor(allow=('/cp/', ),deny=('grid=false', 'sort=', 'stores=', r'\|\|', 'page=',), process_value=cleanurl)
    , callback="parse_items", follow= True, process_links=lambda links: [link for link in links if not link.nofollow],),
    Rule(SgmlLinkExtractor(allow=('/cp/', '/browse/', ),deny=('grid=false', 'sort=', 'stores=', r'\|\|', 'page='), process_value=cleanurl)),
    )

    def parse_items(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//html')
        items = []

        for site in sites:
            item = Website()
            item['url'] = response.url
            item['referer'] = response.request.headers.get('Referer')
            item['anchor'] = response.meta.get('link_text')
            item['canonical'] = site.xpath('//head/link[@rel="canonical"]/@href').extract()
            item['robots'] = site.select('//meta[@name="robots"]/@content').extract()
            items.append(item)

        return items

这篇关于前后有空格的链接未正确解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-02 09:45