问题描述
给出这样的元素
<A>
hello
<annotation> NOT part of text </annotation>
world
</A>
如何仅获取子文本节点(如XPath text()
),使用 ElementTree
?
how can I get just the child text nodes (like XPath text()
), using ElementTree
?
两者 iter()
和 itertext()
是树遍历器,其中包括所有后代节点。我知道没有立即子迭代器。另外, iter()
只能找到 elements (毕竟是ElementTree),因此不能用于收集文本节点,因为
Both iter()
and itertext()
are tree walkers, which include all descendant nodes. There is no immediate child iterator that I'm aware of. Plus, iter()
only finds elements, anyway (it is after all, ElementTree), so can't be used to collect text nodes as such.
我知道有一个名为 lxml
的库,它提供了更好的XPath支持,但是我在这里问在添加另一个依赖项之前。 (另外,我是Python的新手,所以我可能会遗漏一些明显的东西。)
I understand that there's a library called lxml
which provides better XPath support, but I'm asking here before adding another dependency. (Plus I'm very new to Python so I might be missing something obvious.)
推荐答案
您会找到示例文本在三个属性中有些反直觉:
You find the text of your example somewhat counter-intuitively in three attributes:
- hello的文本。
- 注释。
- annotation.tail代表世界
(省略空白)。这有点麻烦。但是,遵循以下几条原则应该会有所帮助:
(whitespace omitted). This is somewhat cumbersome. However, something along these lines should help:
import xml.etree.ElementTree as et
xml = """
<A>
hello
<annotation> NOT part of text </annotation>
world
</A>"""
doc = et.fromstring(xml)
def all_texts(root):
if root.text is not None:
yield root.text
for child in root:
if child.tail is not None:
yield child.tail
print list(all_texts(doc))
这篇关于如何在ElementTree中迭代子文本节点(而非子代)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!