本文介绍了使用scrapy,python中的站点地图蜘蛛解析具有不同网址格式的站点地图中的网址的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在scrapy、python 中使用站点地图蜘蛛.站点地图的格式似乎不寻常,网址前面带有//":

I am using sitemap spider in scrapy, python.The sitemap seems to have unusual format with '//' in front of urls:

<url>
    <loc>//www.example.com/10/20-baby-names</loc>
</url>
<url>
    <loc>//www.example.com/elizabeth/christmas</loc>
 </url>

myspider.py

from scrapy.contrib.spiders import SitemapSpider
from myspider.items import *

class MySpider(SitemapSpider):
    name = "myspider"
    sitemap_urls = ["http://www.example.com/robots.txt"]

    def parse(self, response):
        item = PostItem()
        item['url'] = response.url
        item['title'] = response.xpath('//title/text()').extract()

        return item

我收到此错误:

raise ValueError('Missing scheme in request url: %s' % self._url)
    exceptions.ValueError: Missing scheme in request url: //www.example.com/10/20-baby-names

如何使用站点地图蜘蛛手动解析网址?

How can I manually parse the url using sitemap spider?

推荐答案

我认为最好和最干净的解决方案是添加一个下载器中间件,它可以在蜘蛛注意到的情况下更改恶意 URL.

I think the nicest and cleanest solution would be to add a downloader middleware which changes the malicious URLs without the spider noticing.

import re
import urlparse
from scrapy.http import XmlResponse
from scrapy.utils.gz import gunzip, is_gzipped
from scrapy.contrib.spiders import SitemapSpider

# downloader middleware
class SitemapWithoutSchemeMiddleware(object):
    def process_response(self, request, response, spider):
        if isinstance(spider, SitemapSpider):
            body = self._get_sitemap_body(response)

            if body:
                scheme = urlparse.urlsplit(response.url).scheme
                body = re.sub(r'<loc>\/\/(.+)<\/loc>', r'<loc>%s://\1</loc>' % scheme, body)
                return response.replace(body=body)

        return response

    # this is from scrapy's Sitemap class, but sitemap is
    # only for internal use and it's api can change without
    # notice
    def _get_sitemap_body(self, response):
        """Return the sitemap body contained in the given response, or None if the
        response is not a sitemap.
        """
        if isinstance(response, XmlResponse):
            return response.body
        elif is_gzipped(response):
            return gunzip(response.body)
        elif response.url.endswith('.xml'):
            return response.body
        elif response.url.endswith('.xml.gz'):
            return gunzip(response.body)

这篇关于使用scrapy,python中的站点地图蜘蛛解析具有不同网址格式的站点地图中的网址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

06-11 20:48