Python读取带有相关子元素的xml

本文介绍了Python读取带有相关子元素的xml的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述我有一个具有以下结构的xml文件：I have a xml file with this structure: <?DOMParser ?> <logbook:LogBook xmlns:logbook="http://www/logbook/1.0" version="1.2"><product> <serialNumber value="764000606"/></product><visits><visit> <general> <startDateTime>2014-01-10T12:22:39.166Z</startDateTime> <endDateTime>2014-03-11T13:51:31.480Z</endDateTime> </general> <parts> <part number="03081" name="WSSA" index="0016"/> </parts></visit><visit><general> <startDateTime>2013-01-10T12:22:39.166Z</startDateTime> <endDateTime>2013-03-11T13:51:31.480Z</endDateTime></general><parts> <part number="02081" name="PSSF" index="0017"/></parts></visit></visits></logbook:LogBook>我想从此xml获得两个输出：I want to have two outputs from this xml: 1-访问包括序列号，所以我这样写：1- visit including the serial Number, so I wrote:import pandas as pdimport xml.etree.ElementTree as ETtree = ET.parse(filename)root=tree.getroot()visits=pd.DataFrame()for general in root.iter('general'): for child in root.iter('serialNumber'): visits=visits.append({'startDateTime':general.find('startDateTime').text , 'endDateTime': general.find('endDateTime').text, 'serialNumber':child.attrib['value'] }, ignore_index=True)此代码的输出如下数据框：The output of this code is following dataframe: serialNumber | startDateTime | endDateTime -------------|------------------------|------------------------| 764000606 |2014-01-10T12:22:39.166Z|2014-03-11T13:51:31.480Z| 764000606 |2013-03-11T13:51:31.480Z|2013-01-10T12:22:39.166Z| 2-零件对于 parts ，我想要以下输出，即通过 startDateTime 区分访问，我想显示与每次访问相关的部分：For parts, I want to have the following output, in a way that I distinguish visits from each other by startDateTime and I want to show the parts related to the each visit : serialNumber | startDateTime|number|name|index| -------------|--------------|------|----|-----|我写的部分：parts=pd.DataFrame()for part in root.iter('part'): for child in root.iter('serialNumber'): parts=parts.append({'index':part.attrib['index'], 'znumber':part.attrib['number'], 'name': part.attrib['name'], 'serialNumber':child.attrib['value'], 'startDateTime':general.find('startDateTime').text}, ignore_index=True)这是我从这段代码中得到的：This is what I get from this code: index |name|serialNumber| startDateTime |znumber| ------|----|------------|------------------------|-------| 0016 |WSSA| 764000606 |2013-01-10T12:22:39.166Z| 03081 | 0017 |PSSF| 764000606 |2013-01-10T12:22:39.166Z| 02081 |同时我想要这个：看看 startDateTime ：While i want this: look at startDateTime: index |name|serialNumber| startDateTime |znumber| ------|----|------------|------------------------|-------| 0016 |WSSA| 764000606 |2014-01-10T12:22:39.166Z| 03081 | 0017 |PSSF| 764000606 |2013-01-10T12:22:39.166Z| 02081 |有什么想法吗？我正在使用XML ElementTree 推荐答案下面是一个示例，该示例从 xml 。Here's an example that gets the data from xml. code.py ：#!/usr/bin/env python3import sysimport xml.etree.ElementTree as ETfrom pprint import pprint as ppfile_name = "a.xml"def get_product_sn(product_node): for product_node_child in list(product_node): if product_node_child.tag == "serialNumber": return product_node_child.attrib.get("value", None) return Nonedef get_parts_data(parts_node): ret = list() for parts_node_child in list(parts_node): attrs = parts_node_child.attrib ret.append({"number": attrs.get("number", None), "name": attrs.get("name", None), "index": attrs.get("index", None)}) return retdef get_visit_node_data(visit_node): ret = dict() for visit_node_child in list(visit_node): if visit_node_child.tag == "general": for general_node_child in list(visit_node_child): if general_node_child.tag == "startDateTime": ret["startDateTime"] = general_node_child.text elif general_node_child.tag == "endDateTime": ret["endDateTime"] = general_node_child.text elif visit_node_child.tag == "parts": ret["parts"] = get_parts_data(visit_node_child) return retdef get_node_data(node): ret = {"visits": list()} for node_child in list(node): if node_child.tag == "product": ret["serialNumber"] = get_product_sn(node_child) elif node_child.tag == "visits": for visits_node_child in list(node_child): ret["visits"].append(get_visit_node_data(visits_node_child)) return retdef main(): tree = ET.parse(file_name) root_node = tree.getroot() data = get_node_data(root_node) pp(data)if __name__ == "__main__": print("Python {:s} on {:s}\n".format(sys.version, sys.platform)) main() 注释：它处理 xml 以树状方式显示，因此它会映射（如果您愿意）在 xml 上（如果 xml 结构发生变化，则代码也应进行调整）通常设计为： get_node_data 可以在具有两个子节点的节点上调用： product 和 visits 。在我们的例子中，它是根节点本身，但是在现实世界中，可能会有一系列这样的节点序列，每个节点都带有我上面列出的2个子节点。它被设计为易于错误处理，因此如果 xml 不完整，它将获取尽可能多的数据；我选择这种（贪婪的）方法，而不是遇到错误时会抛出异常因为我没有使用 pandas ，填充对象我只是返回一个 Python 字典（ json ）；我认为将其转换为 DataFrame 并不难我已经在 Python 2.7 和 Python中运行了它3.5 It treats the xml in a tree-like manner, so it maps (if you will) on the xml (if the xml structure changes, the code should be adapted as well)It's designed to be general: get_node_data could be called on a node that has 2 children: product and visits. In our case it's the root node itself, but in the real world there could be a sequence of such nodes each with the 2 children that I listed aboveIt's designed to be error-friendly so if the xml is incomplete, it will get as much data as it can; I chose this (greedy) approach over the one that when it encounters an error it simply throws an exceptionAs I didn't work with pandas, instead of populating the object I simply return a Python dictionary (json); I think converting it to a DataFrame shouldn't be hardI've run it with Python 2.7 and Python 3.5输出（包含2个键的字典）-出于可读性的考虑而缩进：The output (a dictionary containing 2 keys) - indented for readability: serialNumber -序列号（显然） visit （自这是一本字典，我不得不将此数据放在键下）-字典列表，每个字典包含来自 visit 节点 serialNumber - the serial number (obviously)visits (since it's a dictionary, I had to place this data "under" a key) - a list of dictionaries each containing data from a visit node 输出：(py_064_03.05.04_test0) e:\Work\Dev\StackOverflow\q045049761>"e:\Work\Dev\VEnvs\py_064_03.05.04_test0\Scripts\python.exe" code.pyPython 3.5.4 (v3.5.4:3f56838, Aug 8 2017, 02:17:05) [MSC v.1900 64 bit (AMD64)] on win32{'serialNumber': '764000606', 'visits': [{'endDateTime': '2014-03-11T13:51:31.480Z', 'parts': [{'index': '0016', 'name': 'WSSA', 'number': '03081'}], 'startDateTime': '2014-01-10T12:22:39.166Z'}, {'endDateTime': '2013-03-11T13:51:31.480Z', 'parts': [{'index': '0017', 'name': 'PSSF', 'number': '02081'}], 'startDateTime': '2013-01-10T12:22:39.166Z'}]} @ EDIT0 ：按一个请求添加了多个 part 节点处理评论。该功能已移至 get_parts_data 。现在， visits 列表中的每个条目都将具有一个 parts 键，其键值将是一个列表，该列表由从每个 part 节点中提取的字典组成（不是所提供的 xml 的大小写）。@EDIT0: added multiple part node handling as requested in one of the comments. That functionality has been moved to get_parts_data. Now, each entry in the visits list will have a parts key whose value will be a list consisting of dictionaries extracted from each part node (not the case for the provided xml). 这篇关于Python读取带有相关子元素的xml的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！