例子:

html = <a><b>Text</b>Text2</a>

BeautifullSoup代码
[x.extract() for x in html.findAll(.//b)]

在导出处,我们有:
html = <a>Text2</a>

Lxml代码:
[bad.getparent().remove(bad) for bad in html.xpath(".//b")]

在导出处,我们有:
html = <a></a>

因为lxml认为“Text2”是<b></b>的尾部

如果我们只需要标签连接的文本行,则可以使用:
for bad in raw.xpath(xpath_search):
    bad.text = ''

但是,如何在不更改文本的情况下,但删除没有尾部的标签呢?

最佳答案

编辑:

请查看@Joshmakers答案https://stackoverflow.com/a/47946748/8055036,它显然是更好的一种。

我执行了以下操作,以将尾部文本保护到 previous sibling 姐妹或父级。

def remove_keeping_tail(self, element):
    """Safe the tail text and then delete the element"""
    self._preserve_tail_before_delete(element)
    element.getparent().remove(element)

def _preserve_tail_before_delete(self, node):
    if node.tail: # preserve the tail
        previous = node.getprevious()
        if previous is not None: # if there is a previous sibling it will get the tail
            if previous.tail is None:
                previous.tail = node.tail
            else:
                previous.tail = previous.tail + node.tail
        else: # The parent get the tail as text
            parent = node.getparent()
            if parent.text is None:
                parent.text = node.tail
            else:
                parent.text = parent.text + node.tail

高温超导

10-03 01:02