使用 Scrapy LinkExtractor() 定位特定的域扩展

本文介绍了使用 Scrapy LinkExtractor() 定位特定的域扩展的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想使用 Scrapy 的 LinkExtractor() 只关注 .th 域中的链接

I want to use Scrapy's LinkExtractor() to only follow links in the .th domain

我看到有一个 deny_extensions(list) 参数，但没有 allow_extensions() 参数.

I see there is a deny_extensions(list) parameter, but no allow_extensions() parameter.

鉴于此，我如何限制链接以允许 .th 中的域?

Given that, how do I restrict links just to allow domains in .th ?

推荐答案

deny_extensions 是过滤掉以.gz, .exe结尾的URL> 等等.

deny_extensions is to filter out URLs ending with .gz, .exe and so on.

您可能正在寻找 allow_domains:

You are probably looking for allow_domains:

allow_domains (str or list) – 一个单独的值或一个包含域的字符串列表，这些域将被考虑用于提取链接

deny_domains (str or list) – 一个单独的值或一个字符串列表，其中包含不会被考虑用于提取链接的域

deny_domains (str or list) – a single value or a list of strings containing domains which won’t be considered for extracting the links

我的评论中提到的另一个选项是使用自定义 LinkExtractor.下面是此类链接提取器的示例，它与标准链接提取器执行相同的操作，但另外会过滤掉域名与 Unix 文件名模式不匹配的链接(它使用 fnmatch 模块:

Another option mentioned in my comments is to use a custom LinkExtractor.Below is an example of such a link extractor which does the same thing as the standard link extractor, but additionally filters out links where the domain name does not match a Unix filename pattern (it uses the fnmatch module for this):

from six.moves.urllib.parse import urlparse
import fnmatch
import re

from scrapy.linkextractors import LinkExtractor

class DomainPatternLinkExtractor(LinkExtractor):

    def __init__(self, domain_pattern, *args, **kwargs):
        super(DomainPatternLinkExtractor, self).__init__(*args, **kwargs)

        # take a Unix file pattern string and translate
        # it to a regular expression to match domains against
        regex = fnmatch.translate(domain_pattern)
        self.reobj = re.compile(regex)

    def extract_links(self, response):
        return list(
            filter(
                lambda link: self.reobj.search(urlparse(link.url).netloc),
                super(DomainPatternLinkExtractor, self).extract_links(response)
            )
        )

在你的情况下，你可以这样使用它:DomainPatternLinkExtractor('*.th').

In your case you could use it like this: DomainPatternLinkExtractor('*.th').

使用此链接提取器的 Scrapy shell 会话示例:

Sample scrapy shell session using this link extractor:

$ scrapy shell http://www.dmoz.org/News/Weather/
2016-11-21 17:14:51 [scrapy] INFO: Scrapy 1.2.1 started (bot: issue2401)
(...)
2016-11-21 17:14:52 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/News/Weather/> (referer: None)

>>> from six.moves.urllib.parse import urlparse
>>> import fnmatch
>>> import re
>>>
>>> from scrapy.linkextractors import LinkExtractor
>>>
>>>
>>> class DomainPatternLinkExtractor(LinkExtractor):
...
...     def __init__(self, domain_pattern, *args, **kwargs):
...         super(DomainPatternLinkExtractor, self).__init__(*args, **kwargs)
...         regex = fnmatch.translate(domain_pattern)
...         self.reobj = re.compile(regex)
...     def extract_links(self, reponse):
...         return list(
...             filter(
...                 lambda link: self.reobj.search(urlparse(link.url).netloc),
...                 super(DomainPatternLinkExtractor, self).extract_links(response)
...             )
...         )
...
>>> from pprint import pprint


>>> pprint([l.url for l in DomainPatternLinkExtractor('*.co.uk').extract_links(response)])
['http://news.bbc.co.uk/weather/',
 'http://freemeteo.co.uk/',
 'http://www.weatheronline.co.uk/']


>>> pprint([l.url for l in DomainPatternLinkExtractor('*.gov*').extract_links(response)])
['http://www.metoffice.gov.uk/', 'http://www.weather.gov/']


>>> pprint([l.url for l in DomainPatternLinkExtractor('*.name').extract_links(response)])
['http://www.accuweather.name/']

这篇关于使用 Scrapy LinkExtractor() 定位特定的域扩展的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！