我有一个简单的XML文档,我正试图用python dom读入它(见下文):
XML文件:
<?xml version="1.0" encoding="utf-8"?>
<HeaderLookup>
<Header>
<Reserved>2</Reserved>
<CPU>1</CPU>
<Flag>1</Flag>
<VQI>12</VQI>
<Group_ID>16</Group_ID>
<DI>2</DI>
<DE>1</DE>
<ACOSS>5</ACOSS>
<RGH>8</RGH>
</Header>
</HeaderLookup>
python代码:
from xml.dom import minidom
xml_file = open("test.xml")
xmlroot = minidom.parse(xml_file).documentElement
xml_file.close()
for item in xmlroot.getElementsByTagName("Header")[0].childNodes:
print item
结果:
<DOM Text node "u'\n\t\t'">
<DOM Element: Reserved at 0x28d2828>
<DOM Text node "u'\n\t\t'">
<DOM Element: CPU at 0x28d28c8>
<DOM Text node "u'\n\t\t'">
<DOM Element: Flag at 0x28d2968>
<DOM Text node "u'\n\t\t'">
<DOM Element: VQI at 0x28d2a08>
<DOM Text node "u'\n\t\t'">
<DOM Element: Group_ID at 0x28d2ad0>
<DOM Text node "u'\n\t\t'">
<DOM Element: DI at 0x28d2b70>
<DOM Text node "u'\n\t\t'">
<DOM Element: DE at 0x28d2c10>
<DOM Text node "u'\n\t\t'">
<DOM Element: ACOSS at 0x28d2cb0>
<DOM Text node "u'\n\t\t'">
<DOM Element: RGH at 0x28d2d50>
<DOM Text node "u'\n\t'">
结果应该是9个子节点(reserved、cpu、flag、vqi、group_id、di、de、acos和rgh),但出于某种原因,它返回的是19个节点的列表,其中10个是空白(为什么这一点一开始被视为节点)?!!)有人能告诉我是否有办法让XML解析器不包括空白节点?
最佳答案
空白在XML中很重要,但请查看ElementTree,它有一个与DOM不同的用于处理XML的API。
例子
from xml.etree import ElementTree as et
data = '''\
<?xml version="1.0" encoding="utf-8"?>
<HeaderLookup>
<Header>
<Reserved>2</Reserved>
<CPU>1</CPU>
<Flag>1</Flag>
<VQI>12</VQI>
<Group_ID>16</Group_ID>
<DI>2</DI>
<DE>1</DE>
<ACOSS>5</ACOSS>
<RGH>8</RGH>
</Header>
</HeaderLookup>
'''
tree = et.fromstring(data)
for n in tree.find('Header'):
print n.tag,'=',n.text
产量
Reserved = 2
CPU = 1
Flag = 1
VQI = 12
Group_ID = 16
DI = 2
DE = 1
ACOSS = 5
RGH = 8
示例(扩展以前的代码)
空白仍然存在,但它在
.tail
属性中。tail
是元素后面的文本节点(在一个元素的结束和下一个元素的开始之间),而text
是元素的开始/结束标记之间的文本节点。def dump(e):
print '<%s>' % e.tag
print 'text =',repr(e.text)
for n in e:
dump(n)
print '</%s>' % e.tag
print 'tail =',repr(e.tail)
dump(tree)
产量
<HeaderLookup>
text = '\n '
<Header>
text = '\n '
<Reserved>
text = '2'
</Reserved>
tail = '\n '
<CPU>
text = '1'
</CPU>
tail = '\n '
<Flag>
text = '1'
</Flag>
tail = '\n '
<VQI>
text = '12'
</VQI>
tail = '\n '
<Group_ID>
text = '16'
</Group_ID>
tail = '\n '
<DI>
text = '2'
</DI>
tail = '\n '
<DE>
text = '1'
</DE>
tail = '\n '
<ACOSS>
text = '5'
</ACOSS>
tail = '\n '
<RGH>
text = '8'
</RGH>
tail = '\n '
</Header>
tail = '\n'
</HeaderLookup>
tail = None