在我的爬虫蜘蛛中,我只想选择带有文本内容的 <p> :

item['Description'] = response.xpath('//*[@id="textepresentation"]//p[string(.)]').extract()

它工作正常,但不幸的是,这样做,我也会得到空的 <p> 和不间断的空间
u'<p>\xa0</p>',

如何避免使用 xpath 选择带有不间断空格的 <p>

最佳答案

您可以将 XPath's normalize-space() 字符串函数与几个谓词一起使用:

  • [normalize-space()] 以便您获得具有非空字符串表示的元素,不包括前导和尾随空格
  • [not(contains(normalize-space(), "\u00a0"))] 因为 NO-BREAK SPACE 没有被删除(见 this other answer where I checked which ones work ,你可能想添加其他字符来测试)

  • 样本:
    >>> import scrapy
    >>> selector = scrapy.Selector(text=u'''
    ... <html>
    ...     <p>&nbsp;</p>
    ...     <p>something</p>
    ...     <p>  </p>
    ...     <p><a href="http://example.com">some link</a></p>
    ... </html>
    ... ''')
    >>> selector.xpath(u'''
    ...     //p[normalize-space()]
    ...        [not(contains(normalize-space(), "\u00a0"))]
    ... ''').extract()
    [u'<p>something</p>', u'<p><a href="http://example.com">some link</a></p>']
    >>>
    

    编辑:

    在@Kimmy 的回答之后,这里有一个带有 1 个谓词的替代方法,也适用于其他空白字符:
  • 采用未被 normalize-space() 替换的空白字符
  • 并将它们放入带有 ' '
  • 的 XPath translate() 调用中
  • 规范化空格,修剪前导和尾随

  • 它是这样的:
    >>> chars = '''
    ... #CHARACTER TABULATION
    ... #LINE FEED
    ... #LINE TABULATION
    ... #FORM FEED
    ... #CARRIAGE RETURN
    ... #SPACE
    ... #NEXT LINE
    ... NO-BREAK SPACE
    ... OGHAM SPACE MARK
    ... MONGOLIAN VOWEL SEPARATOR
    ... EN QUAD
    ... EM QUAD
    ... EN SPACE
    ... EM SPACE
    ... THREE-PER-EM SPACE
    ... FOUR-PER-EM SPACE
    ... SIX-PER-EM SPACE
    ... FIGURE SPACE
    ... PUNCTUATION SPACE
    ... THIN SPACE
    ... HAIR SPACE
    ... ZERO WIDTH SPACE
    ... ZERO WIDTH NON-JOINER
    ... ZERO WIDTH JOINER
    ... LINE SEPARATOR
    ... PARAGRAPH SEPARATOR
    ... NARROW NO-BREAK SPACE
    ... MEDIUM MATHEMATICAL SPACE
    ... WORD JOINER
    ... IDEOGRAPHIC SPACE
    ... ZERO WIDTH NO-BREAK SPACE
    ... '''
    >>> import unicodedata
    >>> wsp = [unicodedata.lookup(c)
    ...        for c in chars.splitlines()
    ...        if c.strip() and not c.startswith('#')]
    >>>
    >>> # somehow NEXT LINE (U+0085) does not work with unicodedata
    ... wsp.append(u'\u0085')
    >>>
    >>> selector.xpath(u'''
    ...     //p[normalize-space(translate(., "%(in)s", "%(out)s"))]
    ...     ''' % {'in': ''.join(wsp),
    ...            'out': ' '*len(wsp)
    ...     }).extract()
    [u'<p>something</p>', u'<p><a href="http://example.com">some link</a></p>']
    >>>
    

    关于xpath - Scrapy : Select tag with non-breaking space with xpath,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/35364069/

    10-14 18:17
    查看更多