问题描述
我是Python和SO的新手.我有一个XML文件,我需要从中提取信息.我已经为此苦苦挣扎了几天,但我想我终于找到了可以正确提取信息的东西.现在,我很难获得正确的输出.这是我的代码:
I am a relative newby to Python and SO. I have an xml file from which I need to extract information. I've been struggling with this for several days, but I think I finally found something that will extract the information properly. Now I'm having troubles getting the right output. Here is my code:
from xml import etree
node = etree.fromstring('<dataObject><identifier>5e1882d882ec530069d6d29e28944396</identifier><description>This is a paragraph about a shark.</description></dataObject>')
identifier = node.findtext('identifier')
description = node.findtext('description')
print identifier, description
我得到的结果是"5e1882d882ec530069d6d29e28944396这是关于鲨鱼的一段."这就是我想要的.
The result that I get is "5e1882d882ec530069d6d29e28944396 This is a paragraph about a shark.", which is what I want.
但是,我真正需要的是能够读取文件而不是字符串.所以我尝试这段代码:
However, what I really need is to be able to read from a file instead of a string. So I try this code:
from xml import etree
node = etree.parse('test3.xml')
identifier = node.findtext('identifier')
description = node.findtext('description')
print identifier, description
现在我的结果是无无".我有一种感觉,我要么文件不正确,要么输出出现问题.这是test3.xml的内容
Now my result is "None None". I have a feeling I'm either not getting the file in right or something is wrong with the output. Here is the contents of test3.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response xmlns="http://www.eol.org/transfer/content/0.3" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dwc="http://rs.tdwg.org/dwc/dwcore/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:dwct="http://rs.tdwg.org/dwc/terms/" xsi:schemaLocation="http://www.eol.org/transfer/content/0.3 http://services.eol.org/schema/content_0_3.xsd">
<identifier>5e1882d822ec530069d6d29e28944369</identifier>
<description>This is a paragraph about a shark.</description>
推荐答案
您的XML文件使用默认名称空间.您需要使用正确的名称空间对搜索进行限定:
Your XML file uses a default namespace. You need to qualify your searches with the correct namespace:
identifier = node.findtext('{http://www.eol.org/transfer/content/0.3}identifier')
让ElementTree匹配正确的元素.
for ElementTree to match the correct elements.
您还可以为.find()
,findall()
和iterfind()
方法提供一个显式的名称空间字典.这没有很好的记录:
You could also give the .find()
, findall()
and iterfind()
methods an explicit namespace dictionary. This is not documented very well:
namespaces = {'eol': 'http://www.eol.org/transfer/content/0.3'} # add more as needed
root.findall('eol:identifier', namespaces=namespaces)
仅在输入的namespaces
参数中查找 前缀.这意味着您可以使用所需的任何名称空间前缀. API会拆分eol:
部分,在namespaces
词典中查找相应的名称空间URL,然后更改搜索以查找XPath表达式{http://www.eol.org/transfer/content/0.3}identifier
.
Prefixes are only looked up in the namespaces
parameter you pass in. This means you can use any namespace prefix you like; the API splits off the eol:
part, looks up the corresponding namespace URL in the namespaces
dictionary, then changes the search to look for the XPath expression {http://www.eol.org/transfer/content/0.3}identifier
instead.
如果可以切换到 lxml
库,情况会更好;该库支持相同的ElementTree API,但会在元素的.nsmap
属性中为您收集名称空间.
If you can switch to the lxml
library things are better; that library supports the same ElementTree API, but collects namespaces for you in a .nsmap
attribute on elements.
这篇关于读取字符串而不是文件时,使用etree从文件解析xml可行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!