使用 XPath 提取标签之间的文本，包括标记

本文介绍了使用 XPath 提取标签之间的文本，包括标记的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有以下一段 XML:

...<span class="st">在 Tim <em>Power</em>: Politieman...</span>...

我想提取标签之间的部分.为此，我使用 XPath:

/span[@class="st"]

然而，这将提取所有内容，包括 .和.

/span[@class="st"]/text()

将返回两个文本元素的列表.一个包含在蒂姆".另一个:政治家"... 不包括在内，并像分隔符一样处理.

是否有返回的纯 XPath 解决方案:

在 Tim <em>Power</em>: Politieman...

编辑感谢@helderdarocha 和@TextGeek.使用仅包含 的 XPath 提取纯文本似乎并非易事.

/span[@class="st"]/node() 解决方案创建一个包含各个行的列表，在 Python 中从列表中创建一个字符串是微不足道的.

解决方案

要获取任何子节点，您可以使用:

/span[@class="st"]/node()

这将返回:

两个子文本节点
完整的 节点(元素和内容).

如果您确实想要所有 text() 节点，包括 em 中的节点，则获取所有 text() 后代:

/span[@class="st"]//text()

或

/span[@class="st"]/descendant::text()

这将返回三个文本节点，文本 inside ，而不是 元素.>

I have the following piece of XML:

...<span class="st">In Tim <em>Power</em>: Politieman...</span>...

I want to extract the part between the  tags.For this I use XPath:

   /span[@class="st"]

This however will extract everything including the .and.

  /span[@class="st"]/text()

will return a list of two text elements. One containing "In Tim". The other ":Politieman". The .. is not included and is handled like a separator.

Is there a pure XPath solution which returns:

In Tim <em>Power</em>: Politieman...

EDITThanks to @helderdarocha and @TextGeek. Seems non trivial to extract plain text with XPath only including the .

The /span[@class="st"]/node() solution creates a list containing the individual lines, from which it is trivial in Python to create a String.

解决方案

To get any child node you can use:

/span[@class="st"]/node()

This will return:

Two child text nodes
The full  node (element and contents).

If you actually want all the text() nodes, including the ones inside em, then get all the text() descendants:

/span[@class="st"]//text()

/span[@class="st"]/descendant::text()

This will return three text nodes, the text inside , but not the  elements.

这篇关于使用 XPath 提取标签之间的文本，包括标记的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！