问题描述
我有以下一段 XML:
...<span class="st">在 Tim <em>Power</em>: Politieman...</span>...
我想提取 标签之间的部分.为此,我使用 XPath:
/span[@class="st"]
然而,这将提取所有内容,包括 .和.
/span[@class="st"]/text()
将返回两个文本元素的列表.一个包含在蒂姆".另一个:政治家".<em>..</em>
不包括在内,并像分隔符一样处理.
是否有返回的纯 XPath 解决方案:
在 Tim <em>Power</em>: Politieman...
编辑感谢@helderdarocha 和@TextGeek.使用仅包含 的 XPath 提取纯文本似乎并非易事.
/span[@class="st"]/node() 解决方案创建一个包含各个行的列表,在 Python 中从列表中创建一个字符串是微不足道的.
要获取任何子节点,您可以使用:
/span[@class="st"]/node()
这将返回:
- 两个子文本节点
- 完整的
节点(元素和内容).
如果您确实想要所有 text()
节点,包括 em
中的节点,则获取所有 text()
后代:
/span[@class="st"]//text()
或
/span[@class="st"]/descendant::text()
这将返回三个文本节点,文本 inside ,而不是
元素.>
I have the following piece of XML:
...<span class="st">In Tim <em>Power</em>: Politieman...</span>...
I want to extract the part between the <span>
tags.For this I use XPath:
/span[@class="st"]
This however will extract everything including the <span>
.and.
/span[@class="st"]/text()
will return a list of two text elements. One containing "In Tim". The other ":Politieman". The <em>..</em>
is not included and is handled like a separator.
Is there a pure XPath solution which returns:
In Tim <em>Power</em>: Politieman...
EDITThanks to @helderdarocha and @TextGeek. Seems non trivial to extract plain text with XPath only including the <em>
.
The /span[@class="st"]/node() solution creates a list containing the individual lines, from which it is trivial in Python to create a String.
To get any child node you can use:
/span[@class="st"]/node()
This will return:
- Two child text nodes
- The full
<em>
node (element and contents).
If you actually want all the text()
nodes, including the ones inside em
, then get all the text()
descendants:
/span[@class="st"]//text()
or
/span[@class="st"]/descendant::text()
This will return three text nodes, the text inside <em>
, but not the <em>
elements.
这篇关于使用 XPath 提取标签之间的文本,包括标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!