本文介绍了在BeautifulSoup中扩展CSS选择器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:

BeautifulSoup 提供了非常有限的支持 。例如,唯一支持的伪类是 nth-of-type ,它只能接受数值 - 甚至

BeautifulSoup provides a very limited support for CSS selectors. For instance, the only supported pseudo-class is nth-of-type and it can only accept numerical values - arguments like even or odd are not allowed.

可以扩展 BeautifulSoup CSS选择器或允许其在内部使用作为底层CSS选择机制?

Is it possible to extend BeautifulSoup CSS selectors or let it use lxml.cssselect internally as an underlying CSS selection mechanism?

让我们来看一个示例问题/用例。只找到以下HTML中的偶数行:

Let's take a look at an example problem/use case. Locate only even rows in the following HTML:

<table>
    <tr>
        <td>1</td>
    <tr>
        <td>2</td>
    </tr>
    <tr>
        <td>3</td>
    </tr>
    <tr>
        <td>4</td>
    </tr>
</table>

lxml.html lxml.cssselect ,很容易通过:nth-​​of-type(even)

from lxml.html import fromstring
from lxml.cssselect import CSSSelector

tree = fromstring(data)

sel = CSSSelector('tr:nth-of-type(even)')

print [e.text_content().strip() for e in sel(tree)]

但是,在 BeautifulSoup

print(soup.select("tr:nth-of-type(even)"))

会抛出一个错误:






注意我们可以使用:

print([row.get_text(strip=True) for index, row in enumerate(soup.find_all("tr"), start=1) if index % 2 == 0])


推荐答案

检查源代码后,似乎 BeautifulSoup 在其接口中不提供任何方便的点来扩展或修补其现有功能在这方面。使用 lxml 的功能是不可能的,因为 BeautifulSoup 只使用 lxml ,并使用解析结果从它们创建自己的相应对象。 lxml 对象不保留,以后无法访问。

After checking the source code, it seems that BeautifulSoup does not provide any convenient point in its interface to extend or monkey patch its existing functionality in this regard. Using functionality from lxml is not possible either since BeautifulSoup only uses lxml during parsing and uses the parsing results to create its own respective objects from them. The lxml objects are not preserved and cannot be accessed later.

这就是说,有足够的决心和Python的灵活性和自省能力,任何事情都是可能的。您甚至可以在运行时修改BeautifulSoup方法的内部结构:

That being said, with enough determination and with the flexibility and introspection capabilities of Python, anything is possible. You can modify the BeautifulSoup method internals even at run-time:

import inspect
import re
import textwrap

import bs4.element


def replace_code_lines(source, start_token, end_token,
                       replacement, escape_tokens=True):
    """Replace the source code between `start_token` and `end_token`
    in `source` with `replacement`. The `start_token` portion is included
    in the replaced code. If `escape_tokens` is True (default),
    escape the tokens to avoid them being treated as a regular expression."""

    if escape_tokens:
        start_token = re.escape(start_token)
        end_token = re.escape(end_token)

    def replace_with_indent(match):
        indent = match.group(1)
        return textwrap.indent(replacement, indent)

    return re.sub(r"^(\s+)({}[\s\S]+?)(?=^\1{})".format(start_token, end_token),
                  replace_with_indent, source, flags=re.MULTILINE)


# Get the source code of the Tag.select() method
src = textwrap.dedent(inspect.getsource(bs4.element.Tag.select))

# Replace the relevant part of the method
start_token = "if pseudo_type == 'nth-of-type':"
end_token = "else"
replacement = """\
if pseudo_type == 'nth-of-type':
    try:
        if pseudo_value in ("even", "odd"):
            pass
        else:
            pseudo_value = int(pseudo_value)
    except:
        raise NotImplementedError(
            'Only numeric values, "even" and "odd" are currently '
            'supported for the nth-of-type pseudo-class.')
    if isinstance(pseudo_value, int) and pseudo_value < 1:
        raise ValueError(
            'nth-of-type pseudo-class value must be at least 1.')
    class Counter(object):
        def __init__(self, destination):
            self.count = 0
            self.destination = destination

        def nth_child_of_type(self, tag):
            self.count += 1
            if pseudo_value == "even":
                return not bool(self.count % 2)
            elif pseudo_value == "odd":
                return bool(self.count % 2)
            elif self.count == self.destination:
                return True
            elif self.count > self.destination:
                # Stop the generator that's sending us
                # these things.
                raise StopIteration()
            return False
    checker = Counter(pseudo_value).nth_child_of_type
"""
new_src = replace_code_lines(src, start_token, end_token, replacement)

# Compile it and execute it in the target module's namespace
exec(new_src, bs4.element.__dict__)
# Monkey patch the target method
bs4.element.Tag.select = bs4.element.select

这是被修改的代码部分。

This is the portion of code being modified.

当然,这是一切,但优雅和可靠。我不想设想这在任何地方,永远被严重使用。

Of course, this is everything but elegant and reliable. I don't envision this being seriously used anywhere, ever.

这篇关于在BeautifulSoup中扩展CSS选择器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!