本文介绍了使用Python Pandas从XML / Json创建CSV的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将xml解析为多个不同的文件-


示例XML

 < ; integration-outbound:IntegrationEntity 
xmlns:xsi = http://www.w3.org/2001/XMLSchema-instance
< integrationEntityHeader>
< integrationTrackingNumber> 281#963-4c1d-9d26-877ba40a4b4b#1583507840354< / integrationTrackingNumber>
< referenceCodeForEntity> 25428< / referenceCodeForEntity>
< attachments>
< attachment>
< id> d6esd1d518b06019e01< / id>
< name> durance.pdf< / name>
< size> 0< / size>
< / attachment>
< attachment>
< id> 182e60164ddd4236b5bd96109< / id>
< name> ssds< / name>
< size> 0< / size>
< / attachment>
< / attachments>
< source> SIM< / source>
< entity> SUPPLIER< / entity>
< action> CREATE< / action>
< timestampUTC> 20200306T151721< / timestampUTC>
< zDocBaseVersion> 2.0< / zDocBaseVersion>
< zDocCustomVersion> 0< / zDocCustomVersion>
< / integrationEntityHeader>
< integrationEntityDetails>
<供应商>
< requestId> 2614352< / requestId>
< controlBlock>
< dataProcessingInfo>
< key> MODE< / key>
< value>入职< / value>
< / dataProcessingInfo>
< dataProcessingInfo>
< key> Supplier_Type< / key>
< value>操作< / value>
< / dataProcessingInfo>
< / controlBlock>
< id> 1647059< / id>
< facilityCode> 0001< / facilityCode>
< systemCode> 1< / systemCode>
< supplierType> Operational< / supplierType>
< systemFacilityDetails>
< systemFacilityDetail>
< facilityCode> 0001< / facilityCode>
< systemCode> 1< / systemCode>
< FacilityStatus> ACTIVE< / FacilityStatus>
< / systemFacilityDetail>
< / systemFacilityDetails>
< status> ACTIVE< / status>
< companyDetails>
< displayGSID> 254232128< / displayGSID>
< legalCompanyName> asdasdsads< / legalCompanyName>
< dunsNumber> 03-175-2493< / dunsNumber>
< legalStructure> 1< / legalStructure>
< website> www.aaadistributor.com< / website>
< noEmp> 25< / noEmp>
< companyIndi​​cator1099>否< / companyIndi​​cator1099>
< taxidAndWxformRequired>否< / taxidAndWxformRequired>
< taxidFormat>美联储。 Tax< / taxidFormat>
< wxForm> 182e601649ade4c38cd4236b5bd96109< / wxForm>
< taxid> 27-2204474< / taxid>
< companyTypeFix> SUPPLIER< / companyTypeFix>
< fields>
< field>
< id> LOW_CUURENT_SERV< / id>
< value> 1< / value>
< / field>
< field>
< id> LOW_COI< / id>
< value>美国< / value>
< / field>
< field>
< id> LOW_STATE_INCO< / id>
< value> US-PA< / value>
< / field>
< field>
< id> CERT_INSURANCE< / id>
< value> d6e6e460fe8958564c1d518b06019e01< / value>
< / field>
< field>
< id> COMP_DBA< / id>
< value> asdadas< / value>
< / field>
< field>
< id> LOW_AREUDIVE< / id>
< value> N< / value>
< / field>
< field>
< id> LOW_BU_SIZE1< / id>
< value> SMLBUS< / value>
< / field>
< field>
< id> EDI_CAP< / id>
< value> Y< / value>
< / field>
< field>
< id> EDI_WEB< / id>
< value> N< / value>
< / field>
< field>
< id> EDI_TRAD< / id>
< value> N< / value>
< / field>
< / fields>
< / companyDetails>
< allLocations>
< location>
< addressInternalid> 1704342< / addressInternalid>
< isDelete> false< / isDelete>
< internalSupplierid> 1647059< / internalSupplierid>
< acctGrpid> HQ< / acctGrpid>
< address1> 2501 GRANT AVE< / address1>
< country>美国< / country>
<州> US-PA< /州>
< city> PHILADELPHIA< / city>
< zip> 19114< / zip>
< phone>(215)745-7900< / phone>
< / location>
< / allLocations>
< contactDetails>
< contactDetail>
< contactInternalid> 12232< / contactInternalid>
< isDelete> false< / isDelete>
< addressInternalid> 1704312142< / addressInternalid>
< contactType> Main< / contactType>
< first& Raf< / firstName>
< lastName> jas< / lastName>
< title> Admin< / title>
< email> abcd@gmail.com< / email>
< phoneNo> 123-42-23-23< / phoneNo>
< createPortalLogin>是< / createPortalLogin>
< allowedPortalSideProducts> SIM,iSource,iContract< / allowedPortalSideProducts>
< / contactDetail>
< contactDetail>
< contactInternalid> 1944938< / contactInternalid>
< isDelete> false< / isDelete>
< addressInternalid> 1704342< / addressInternalid>
< contactType> Rad< / contactType>
< firstName> AVs< / firstName>
< lastName> asd< / lastName>
< title> Founder< / title>
< email> as@sds.com< / email>
< phoneNo> 21521-2112-7900< / phoneNo>
< createPortalLogin>是< / createPortalLogin>
< allowedPortalSideProducts> SIM,iContract,iSource< / allowedPortalSideProducts>
< / contactDetail>
< / contactDetails>
< myLocation>
< addresses>
< myLocationsInternalid> 1704342< / myLocationsInternalid>
< isDelete> false< / isDelete>
< addressInternalid> 1704342< / addressInternalid>
< usedAt> N< / usedAt>
< / addresses>
< / myLocation>
< bankDetails>
< fields>
< field>
< id> LOW_BANK_KEY< / id>
< value> 123213< / value>
< / field>
< field>
< id> LOW_EFT< / id>
< value> 123123< / value>
< / field>
< / fields>
< / bankDetails>
< forms>
< form>
< id> CATEGORY_PRODSER< / id>
< records>
< record>
< Internalid> 24348< / Internalid>
< isDelete> false< / isDelete>
< fields>
< field>
< id> CATEGOR_LEVEL_1< / id>
< value> MR< / value>
< / field>
< field>
< id> LOW_PRODSERV< / id>
< value> RES< / value>
< / field>
< field>
< id> LOW_LEVEL_2< / id>
< value> keylevel221< / value>
< / field>
< field>
< id> LOW_LEVEL_3< / id>
< value> keylevel3127< / value>
< / field>
< field>
< id> LOW_LEVEL_4< / id>
< value> keylevel4434< / value>
< / field>
< field>
< id> LOW_LEVEL_5< / id>
< value> keylevel5545< / value>
< / field>
< / fields>
< / record>
< record>
< Internalid> 24349< / Internalid>
< isDelete> false< / isDelete>
< fields>
< field>
< id> CATEGOR_LEVEL_1< / id>
< value> MR< / value>
< / field>
< field>
< id> LOW_PRODSERV< / id>
< value> RES< / value>
< / field>
< field>
< id> LOW_LEVEL_2< / id>
< value> keylevel221< / value>
< / field>
< field>
< id> LOW_LEVEL_3< / id>
< value> keylevel3125< / value>
< / field>
< field>
< id> LOW_LEVEL_4< / id>
< value> keylevel4268< / value>
< / field>
< field>
< id> LOW_LEVEL_5< / id>
< value> keylevel5418< / value>
< / field>
< / fields>
< / record>
< record>
< Internalid> 24350< / Internalid>
< isDelete> false< / isDelete>
< fields>
< field>
< id> CATEGOR_LEVEL_1< / id>
< value> MR< / value>
< / field>
< field>
< id> LOW_PRODSERV< / id>
< value> RES< / value>
< / field>
< field>
< id> LOW_LEVEL_2< / id>
< value> keylevel221< / value>
< / field>
< field>
< id> LOW_LEVEL_3< / id>
< value> keylevel3122< / value>
< / field>
< field>
< id> LOW_LEVEL_4< / id>
< value> keylevel425< / value>
< / field>
< field>
< id> LOW_LEVEL_5< / id>
< value> keylevel5221< / value>
< / field>
< / fields>
< / record>
< / records>
< / form>
< form>
< id> OTHER_INFOR< / id>
< records>
< record>
< isDelete> false< / isDelete>
< fields>
< field>
< id> S_EAST< / id>
< value> N< / value>
< / field>
< field>
< id> W_EST< / id>
< value> N< / value>
< / field>
< field>
< id> M_WEST< / id>
< value> N< / value>
< / field>
< field>
< id> N_EAST< / id>
< value> N< / value>
< / field>
< field>
< id> LOW_AREYOU_ASSET< / id>
< value> -1< / value>
< / field>
< field>
< id> LOW_SWART_PROG< / id>
< value> -1< / value>
< / field>
< / fields>
< / record>
< / records>
< / form>
< form>
< id" ABDCEDF< / id>
< records>
< record>
< isDelete> false< / isDelete>
< fields>
< field>
< id> LOW_COD_CONDUCT< / id>
< value> -1< / value>
< / field>
< / fields>
< / record>
< / records>
< / form>
< form>
< id" CODDUC< / id>
< records>
< record>
< isDelete> false< / isDelete>
< fields>
< field>
< id> LOW_SUPPLIER_TYPE< / id>
< value> 2< / value>
< / field>
< field>
< id> LOW_DO_INT_BOTH< / id>
< value> 1< / value>
< / field>
< / fields>
< / record>
< / records>
< / form>
< / forms>
< /供应商>
< / integrationEntityDetails>
< / integration-outbound:IntegrationEntity>

目标是将通用的xml转换为csv。根据输入文件,应将xml展平并分解为多个csv并存储。


输入的是上面的xml,下面是配置csv文件。需要创建3个csv文件,并在文件中提到相应的XPATH

  XPATH,ColumName,CSV_File_Name,ParentKey 
/ integration-outbound:IntegrationEntity /integrationEntityHeader/integrationTrackingNumber,integrationTrackingNumber,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/referenceCodeForEntity,referenceCodeForEntity,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/attachment [附件] /id,id,integrationEntityHeader.csv,
/ integration-outbound:IntegrationEntity / integrationEntityHeader / attachments / attachment [] /名称,名称,integrationEntityHeader.csv,
/ integration-outbound:IntegrationEntity / integrationEntityHeader / attachments /attachment[]/size,size,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/source,source,integrationEntityHeader.csv,
/ integration-outbound:IntegrationEntity / integrationEntityHe ader / entity,entity,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/action,action,integrationEntityHeader.csv,
/ integration-outbound:IntegrationEntity / integrationEntityHeader / timestampUTC,timestampUTC,integrationEntityHeader csv,
/ integration-outbound:IntegrationEntity / integrationEntityHeader / zDocBaseVersion,zDocBaseVersion,integrationEntityHeader.csv,
/ integration-outbound:IntegrationEntity / integrationEntityHeader / zDocCustomVersion,zDocCustomVersion,integrationEntityHeader.csv, -outbound:IntegrationEntity / integrationEntityHeader / integrationTrackingNumber,integrationTrackingNumber,integrationEntityDetailsControlBlock.csv,Y
/integration-outbound:IntegrationEntity/integrationEntityHeader/referenceCodeForEntity,referenceCodeForEntity,integrationEntityDetailsControlBlock.csv,Y
/ / supplier / requestId,requestId,integrationEntityDetailsControl Block.csv,
/集成出站:IntegrationEntity / integrationEntityDetails /供应商/controlBlock/dataProcessingInfo[]/key、key、integrationEntityDetailsControlBlock.csv、
/集成出站:IntegrationEntity / integrationEntityDetails /供应商/ controlBlock / dataProcessingInfo [] / value,value,integrationEntityDetailsControlBlock.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/id,supplier_id,integrationEntityDetailsControlBlock.csv,
/ integration-outbounds:IntegrationEntity / integrationDetailsControl form / form [] / id,id,integrationEntityDetailsForms.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/forms/form[]/records/record[]/Internalid,Internalid,Internalid,integrationEntityDetailsForms.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/forms/form[]/records/record[]/isDelete,FormId,integrationEntityDetailsForms.csv,
/ integration-outbound:IntegrationEntity / integrationEntityDe tails / supplier / forms / form [] / records / record [] / fields / field [] / id,SupplierFormRecordFieldId,integrationEntityDetailsForms.csv,
/ integration-outbound:IntegrationEntity / integrationEntityDetails / supplier / forms / form [] /records/record[]/fields/field[]/value,SupplierFormRecordFieldValue,integrationEntityDetailsForms.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/integrationTrackingNumber,integrationTrackingNumber,integrationEntityDetailsForms.csv,Y
/ integration-出站:IntegrationEntity / integrationEntityHeader / referenceCodeForEntity,referenceCodeForEntity,integrationEntityDetailsForms.csv,Y
/ integration-outbound:IntegrationEntity / integrationEntityDetails / supplier / requestId,requestId,integrationEntityDetailsForms.csv,Y
/ integration-out integrationEntityDetails / supplier / id,supplier_id,integrationEntityDetailsForms.csv,Y

我需要从中创建3个csv文件。 / p>

设计要选择ea ch csv文件并获取xpath并从xml中选择相应的值并获取它


步骤1-转换为xml到Json-

 导入json 
导入xmltodict

,其中open( / home / s0998hws / test.xml)作为xml_file:
data_dict = xmltodict.parse(xml_file.read())
xml_file.close()

#使用json.dumps()
#生成对应于json的对象数据

json_data = json.dumps(data_dict)

#写入json数据以输出
#json文件
带有open(``data。 json, w)作为json_file:
json_file.write(json_data)
json_file.close()

,其中open('data.json')为f:
d = json.load(f)

步骤2-使用熊猫归一化功能-
使用xpath /转换为。和[]作为其他delimter,并构建要从json生效的列,即代码将查找/ integration-outbound:IntegrationEntity / integrationEntityHeader / integrationTrackingNumber并转换为
.integrationEntityHeader.integrationTrackingNumber并使用第一个[]爆炸,出现在

  df_1 = pd.json_normalize(data = d ['integration-outbound:IntegrationEntity'])
df_2 = df_1 [['integrationEntityHeader.integrationTrackingNumber','integrationEntityDetails.supplier.requestId','integrationEntityHeader.referenceCodeForEntity','integrationEntityDetails.supplier.id','integrationEntityDetails.supplier.forms.form']]
df_3 = d ('integrationEntityDetails.supplier.forms.form')
df_3 ['integrationEntityDetails.supplier.forms.form.id'] = df_3 ['integrationEntityDetails.supplier.forms.form']。apply(lambda x:x。 get('id'))
df_3 ['integrationEntityDetails.supplier.forms.form.records'] = df_3 ['integrationEntityDetails.supplier.forms.form']。apply(lambd ax:x.get('records'))

我试图使用csv文件中的元数据并将其感染但是挑战在于

  df_3 ['integrationEntityDetails.supplier.forms.form.records.record.Internalid'] = df_3 ['integrationEntityDetails.supplier。 form.form.records.record']。apply(lambda x:x.get('Internalid'))

失败出现错误-

  Traceback(最近一次通话最近):
文件< stdin>,第1行,在< ; module>
文件 /usr/local/lib64/python3.6/site-packages/pandas/core/series.py,行3848,在其中应用
mapping = lib.map_infer(value,f, convert = convert_dtype)
文件 pandas / _libs / lib.pyx,第2327行在pandas._libs.lib.map_infer
文件< stdin>,第1行,< ; lambda>
AttributeError:列表对象没有属性获取

原因是来自熊猫数据框的数据

下面是生成的输出

  integrationEntityHeader.integrationTrackingNumber integrationEntityDetails.supplier.requestId integrationEntityHeader.referenceCodeForEntity integrationEntityDetails.supplier.id integrationEntityDetails.supplier.forms.form integrationEntityDetails .supplier.forms.form.id integrationEntityDetails.supplier.forms.form.records 
0 281#999eb16e-242c-4239-b33e-ae6f5296fb15#10c7338c-ab63-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647647 id':'CATEGORY_PRODSER','records':{'record':[{'Internalid':'24348','isDelete':'false','f ields':{'field':[{'id':'CATEGOR_LEVEL_1','value':'MR'},{'id':'LOW_PRODSERV','value':'RES'},{'id': 'LOW_LEVEL_2','value':'keylevel221'},{'id':'LOW_LEVEL_3','value':'keylevel3127'},{'id':'LOW_LEVEL_4','value':'keylevel4434'},{ 'id':'LOW_LEVEL_5','value':'keylevel5545'}]}},{'Internalid':'24349','isDelete':'false','fields':{'field':[{'id ':'CATEGOR_LEVEL_1','value':'MR'},{'id':'LOW_PRODSERV','value':'RES'},{'id':'LOW_LEVEL_2','value':'keylevel221'} ,{'id':'LOW_LEVEL_3','value':'keylevel3125'},{'id':'LOW_LEVEL_4','value':'keylevel4268'},{'id':'LOW_LEVEL_5','value': 'keylevel5418'}]}},{'Internalid':'24350','isDelete':'false','fields':{'field':[{'id':'CATEGOR_LEVEL_1','value':'MR '},{'id':'LOW_PRODSERV','value':'RES'},{'id':'LOW_LEVEL_2','value':'keylevel221'},{'id':'LOW_LEVEL_3','value ':'keylevel3122'},{'id':'LOW_LEVEL_4','value':'keylevel425'},{'id':'LOW_LEVEL_5','value':'keylevel5221'}]}}}}}} CATEGORY_PRODSER {'record':[{'Internalid':'24348','isDelete':'false','fields':{'field':[{'id':'CATEGOR_LEVEL_1','value':'MR'} ,{'id':'LOW_PRODSERV','value':'RES'},{'id':'LOW_LEVEL_2','value':'keylevel221'},{'id':'LOW_LEVEL_3','value': 'keylevel3127'},{'id':'LOW_LEVEL_4','value':'keylevel4434'},{'id':'LOW_LEVEL_5','value':'keylevel5545'}]}},{'Internalid':' 24349','isDelete':'false','fields':{'field':[{'id':'CATEGOR_LEVEL_1','value':'MR'},{'id':'LOW_PRODSERV','value ':'RES'},{'id':'LOW_LEVEL_2','value':'keylevel221'},{'id':'LOW_LEVEL_3','value':'keylevel3125'},{'id':'LOW_LEVEL_4 ','value':'keylevel4268'},{'id':'LOW_LEVEL_5','value':'keylevel5418'}]}},{'Internalid':'24350','isDelete':'false','字段:{'field':[{'id':'CATEGOR_LEVEL_1','value':'MR'},{'id':'LOW_PRODSERV','value':'RES'},{'id': 'LOW_LEVEL_2','value':'keylevel221'},{'id':'LOW_LEVEL_3','value':'keylevel3122'},{'id':'LOW_LEVEL_4','value':'keylevel425'},{ 'id':'低_LEVEL_5','value':'keylevel5221'}]}}}}
0 281#999eb16e-242c-4239-b33e-ae6f5296fb15#10c7338c-ab63-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 {'id ':'OTHER_INFOR','records':{'record':{'isDelete':'false','fields':{'field':[{'id':'S_EAST','value':'N' },{'id':'W_EST','value':'N'},{'id':'M_WEST','value':'N'},{'id':'N_EAST','value' :'N'},{'id':'LOW_AREYOU_ASSET','value':'-1'},{'id':'LOW_SWART_PROG','value':'-1'}]}}}}} OTHER_INFOR { 'record':{'isDelete':'false','fields':{'field':[{'id':'S_EAST','value':'N'},{'id':'W_EST', 'value':'N'},{'id':'M_WEST','value':'N'},{'id':'N_EAST','value':'N'},{'id': 'LOW_AREYOU_ASSET','value':'-1'},{'id':'LOW_SWART_PROG','value':'-1'}]}}}
0 281#999eb16e-242c-4239-b33e -ae6f5296fb15#10c7338c-ab63-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 {'id':'CORPORATESUSTAINABILITY','records':{'record':{'isDelete':'false','fields':{'字段:{'id':'LOW_COD_CON DUCT','value':'-1'}}}}}公司可持续发展{'record':{'isDelete':'false','fields':{'field':{'id':'LOW_COD_CONDUCT',' value':'-1'}}}}
0 281#999eb16e-242c-4239-b33e-ae6f5296fb15#10c7338c-ab63-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 {'id':'PRODUCTSERVICES' ,'records':{'record':{'isDelete':'false','fields':{'field':[{'id':'LOW_SUPPLIER_TYPE','value':'2'},{'id ':'LOW_DO_INT_BOTH','value':'1'}]}}}} PRODUCTSERVICES {'record':{'isDelete':'false','fields':{'field':[{'id':' LOW_SUPPLIER_TYPE','value':'2'},{'id':'LOW_DO_INT_BOTH','value':'1'}]}}}

期望的输出
integrationEntityDetailsForms.csv


  integrationTrackingNumber requestId referenceCodeForEntity provider.id integrationEntityDetails.s upplier.forms.form.id InternalId  isDelete    SupplierFormRecordFieldId   SupplierFormRecordFieldValue 
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24348 FALSE CATEGOR_LEVEL_1 MR
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24348 FALSE LOW_PRODSERV RES
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24348 FALSE LOW_LEVEL_2 keylevel221
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24348 FALSE LOW_LEVEL_3 keylevel3127
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24348 FALSE LOW_LEVEL_4 keylevel4434
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24348 FALSE LOW_LEVEL_5 keylevel5545
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATE GORY_PRODSER 24350 FALSE CATEGOR_LEVEL_1 MR
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24350 FALSE LOW_PRODSERV RES
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24350 FALSE LOW_LEVEL_2 keylevel221
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24350 FALSE LOW_LEVEL_3 keylevel3122
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24350 FALSE LOW_LEVEL_4 keylevel425
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24350 FALSE LOW_LEVEL_5 keylevel5221
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 OTHER_INFOR FALSE S_EAST N
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 OTHER_INFOR FALSE W_EST N
281#963-4c1d-9d26-877ba40a4b4b#15835078 40354 2614352 25428 1647059 OTHER_INFOR FALSE M_WEST N
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 OTHER_INFOR FALSE N_EAST N
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 OTHER_INFOR FALSE LOW_AREYOU_ASSET -1
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CORPORATESUSTAINABILITY FALSE LOW_SWART_PROG -1
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CORPORATESUSTAINABILITY FALSE LOW_COD_CONDUCT -1
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 PRODUCTSERVICES FALSE LOW_SUPPLIER_TYPE 2
281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 PRODUCTSERVICES FALSE LOW_DO_INT_BOTH 1


解决方案

The xml is converted to dict and then the parsing logic is written , the reason for this is because the same can be used for json . The stackoverflow is amazingly helpful and the solution is build based on the responses from all these links . For simplicity i have created a 3 level nest xml. This works on Python3

<?xml version="1.0"?><Company><Employee><FirstName>Hal</FirstName><LastName>Thanos</LastName><ContactNo>122131</ContactNo><Email>hal.thanos@xyz.com</Email><Addresses><Address><City>Bangalore</City><State>Karnataka</State><Zip>560212</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form></forms></Address></Addresses></Employee><Employee><FirstName>Iron</FirstName><LastName>Man</LastName><ContactNo>12324</ContactNo><Email>iron.man@xyz.com</Email><Addresses><Address><type>Permanent</type><City>Bangalore</City><State>Karnataka</State><Zip>560212</Zip><forms><for m><id>ID3</id><value>LIC</value></form></forms></Address><Address><type>Temporary</type><City>Concord</City><State>NC</State><Zip>28027</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form><form><id>ID3</id><value>SSN</value></form><form><id>ID2</id><value>CC</value></form></forms></Address></Addresses></Employee></Company> 
<?xml version="1.0"?><Company><Employee><FirstName>Captain</FirstName><LastName>America</LastName><ContactNo>13322</ContactNo><Email>captain.america@xyz.com</Email><Addresses><Address><City>Trivandrum</City><State>Kerala</State><Zip>28115</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form></forms></Address></Addresses></Employee><Employee><FirstName>Sword</FirstName><LastName>Man</LastName><ContactNo>12324</ContactNo><Email>sword.man@xyz.com</Email><Addresses><Address><type>Permanent</type><City>Bangalore</City><State>Karnataka</State><Zip>560212</Zip><forms><form><id>ID3</id>& lt;value>LIC</value></form></forms></Address><Address><type>Temporary</type><City>Concord</City><State>NC</State><Zip>28027</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form><form><id>ID3</id><value>SSN</value></form><form><id>ID2</id><value>CC</value></form></forms></Address></Addresses></Employee></Company>
<?xml version="1.0"?><Company><Employee><FirstName>Thor</FirstName><LastName>Odison</LastName><ContactNo>156565</ContactNo><Email>thor.odison@xyz.com</Email><Addresses><Address><City>Tirunelveli</City><State>TamilNadu</State><Zip>36595</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form></forms></Address></Addresses></Employee><Employee><FirstName>Spider</FirstName><LastName>Man</LastName><ContactNo>12324</ContactNo><Email>spider.man@xyz.com</Email><Addresses><Address><type>Permanent</type><City>Bangalore</City><State>Karnataka</State><Zip>560212</Zip><forms><form><id>ID3</id>&l t;value>LIC</value></form></forms></Address><Address><type>Temporary</type><City>Concord</City><State>NC</State><Zip>28027</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form><form><id>ID3</id><value>SSN</value></form><form><id>ID2</id><value>CC</value></form></forms></Address></Addresses></Employee></Company>
<?xml version="1.0"?><Company><Employee><FirstName>Black</FirstName><LastName>Widow</LastName><ContactNo>16767</ContactNo><Email>black.widow@xyz.com</Email><Addresses><Address><City>Mysore</City><State>Karnataka</State><Zip>12478</Zip><forms><form><id>ID1</id><value>LIC</value></form></forms></Address></Addresses></Employee><Employee><FirstName>White</FirstName><LastName>Man</LastName><ContactNo>5634</ContactNo><Email>white.man@xyz.com</Email><Addresses><Address><type>Permanent</type><City>Bangalore</City><State>Karnataka</State><Zip>560212</Zip><forms><form><id>ID3</id><value>LIC</value></form></forms></Address><Address>&l t;type>Temporary</type><City>Concord</City><State>NC</State><Zip>28027</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form><form><id>ID3</id><value>SSN</value></form><form><id>ID2</id><value>CC</value></form></forms></Address></Addresses></Employee></Company>

The config file for this xml is all possible array/multiple level/explode columns should be mentioned as []. The header is needed as referred in the code.


Chnage the variable as per u store
process_config_csv = ’config.csv’
xml_file_name = ’test.xml’

XPATH,ColumName,CSV_File_Name 
/Company/Employee[]/FirstName,FirstName,Name.csv
/Company/Employee[]/LastName,LastName,Name.csv
/Company/Employee[]/ContactNo,ContactNo,Name.csv
/Company/Employee[]/Email,Email,Name.csv
/Company/Employee[]/FirstName,FirstName,Address.csv
/Company/Employee[]/LastName,LastName,Address.csv
/Company/Employee[]/ContactNo,ContactNo,Address.csv
/Company/Employee[]/Email,Email,Address.csv
/Company/Employee[]/Addresses/Address[]/City,City,Address.csv
/Company/Employee[]/Addresses/Address[]/State,State,Address.csv
/Company/Employee[]/Addresses/Address[]/Zip,Zip,Address.csv
/Company/Employee[]/Addresses/Address[]/type,type,Address.csv
/Company/Employee[]/FirstName,FirstName,Form.csv
/Company/Employee[]/LastName,LastN ame,Form.csv
/Company/Employee[]/ContactNo,ContactNo,Form.csv
/Company/Employee[]/Email,Email,Form.csv
/Company/Employee[]/Addresses/Address[]/type,type,Form.csv
/Company/Employee[]/Addresses/Address[]/forms/form[]/id,id,Form.csv
/Company/Employee[]/Addresses/Address[]/forms/form[]/value,value,Form.csv

The code to create multiple csv based on the config file is

import json 
import xmltodict
import json
import os
import csv
import numpy as np
import pandas as pd
import sys
from collections import defaultdict
import numpy as np

def getMatches(L1, L2):
R = set()
for elm in L1:
for pat in L2:
if elm.find(pat) != -1:
if elm.find(’.’, len(pat)+1) != -1:
R.add(elm[:elm.find(’.’, len(pat)+1)])
else:
R.add(elm)
return lis t(R)

def xml_parse(xml_file_name):
try:
process_xml_file = xml_file_name
with open(process_xml_file) as xml_file:
for xml_string in xml_file:
"""Converting the xml to Dict"""
data_dict = xmltodict.parse(xml_string)
"""Converting the dict to Pandas DF"""
df_processing = pd.json_normalize(data_dict)
xml_parse_loop(df_processing)
xml_file.close()
except Exception as e:
s = str(e)
print(s)

def xml_parse_loop(df_processing_input):
CSV_File_Name = []
"""Getting the list of csv Files to be created"""
with open(process_config_csv, newline=’’) as csvfile:
DataCaptured = csv.DictReader(csvfile)
for row in DataCaptured:
if row[’CSV_File_Name’] not in CSV_File_Name:
CSV_File_Name.append(row[’CSV_File_Name’])
"""Iterating the list of CSV"""
for items in CSV_File_Name:
df_processing = df_processing_input
df_subset_process = []
df_subset_list_all_cols = []
df_process_sub_explode_Level = []
df_final_column_name = []
print(’Parsing the xml file for creating the file - ’ + str(items))
"""Fetching the field list for processs from the confic File"""
with open(process_config_csv, newline=’’) as csvfile:
DataCaptured = csv.DictReader(csvfile)
for row in DataCaptured:
if row[’CSV_File_Name’] in items:
df_final_column_name.append(row[’ColumName’])
"""Getting the columns until the first [] """
df_subset_process.append(row[’XPATH’].strip(’/’).replace("/",".").split(’[]’)[0])
"""Getting the All the columnnames"""
df_subset_list_all_cols.append(row[’XPATH’].strip(’/’).replace("/",".").replace("[]",""))
"""Getting the All the Columns to explode"""
df_process_sub_explode_Level.append(row[’XPATH’].strip(’/’).replace(’/’, ’.’).split(’[]’))
explode_ld = defaultdict(set)
"""Putting Level of explode and column names"""
for x in df_process_sub_explode_Level:
if len(x) > 1:
explode_ld[len(x) - 1].add(’’.join(x[: -1]))
explode_ld = {k: list(v) for k, v in explode_ld.items()}
#print(’ The All column list is for the file ’ + items + " is " + str(df_subset_list_all_cols))
#print(’ The first processing for the file ’ + items + " is " + str(df_subset_process))
#print(’The explode level of attributes for the file ’ + items + " is " + str(explode_ld))
"""Remove column duplciates"""
df_subset_process = list(dict.fromkeys(df_subset_process))
for col in df_subset_process:
if col not in df_processing.columns:
df_processing[col] = np.nan
df_processing = df_processing[df_subset_process]
df_processing_col_list = df_processing.columns.tolist()
print (’The total levels to be exploded : %d’ % len(explode_ld))
i=0
level=len(explode_ld)
for i in range(level):
print (’ Exploding the Level : %d’ % i )
df_processing_col_list = df_processing.columns.tolist()
list_of_explode=set(df_processing_col_list) & set(explode_ld[i + 1])
#print(’List to expolde’ + str(list_of_explode))
"""If founc in explode list exlplode some xml doesnt need to have a list it could be column handling the same"""
for c in list_of_explode:
print (’ There are column present which needs to be exploded - ’ + str(c))
df_processing = pd.concat((df_processing.iloc[[type(item) == list for item in df_processing[c]]].explode(c),df_processing.iloc[[type(item) != list for item in df_processing[c]]]))
print(’ Finding the columns need to be fetched ’)
"""From the overall column list fecthing the attributes needed to explode"""
next_level_pro_lst = getMatches(df_subset_list_all_cols,explode_ld[ i + 1 ])
#print(next_level_pro_lst)
df_processing_col_list = df_processing.columns.tolist()
for nex in next_level_pro_lst:
#print ("Fetching " + nex.rsplit(’.’, 1)[1] + ’ from ’ + nex.rsplit(’.’, 1)[0] + ’ from ’ + nex )
parent_col=nex.rsplit(’.’, 1)[0]
child_col=nex.rsplit(’.’, 1)[1]
#print(parent_col)
#print(df_processing_col_list)
if parent_col not in df_processing_col_list:
df_processing[nex.rsplit(’.’, 1)[0]] = ""
try:
df_processing[nex] = df_processing[parent_col].apply(lambda x: x.get(child_col))
except AttributeError:
df_processing[nex] = ""
df_processing_col_list = df_processing.columns.tolist()
if i == level-1:
print(’Last Level nothing to be done’)
else:
"""Extracting All columns until the next exlode column list is found"""
while len(set(df_processing_col_list) & set(explode_ld[i + 2]))==0:
next_level_pro_lst = getMatches(df_subset_list_all_cols, next_level_pro_lst)
#print(next_level_pro_lst)
for nextval in next_level_pro_lst:
if nextval not in df_processing_col_list:
#print("Fetching " + nextval.rsplit(’.’, 1)[1] + ’ from ’ + nextval.rsplit(’.’, 1)[0] + ’ from ’ + nextval)
if nextval.rsplit(’.’, 1)[0] not in df_processing.columns:
df_processing[nextval.rsplit(’.’, 1)[0]] = ""
try:
df_processing[nextval] = df_processing[nextval.rsplit(’.’, 1)[0]].apply(lambda x: x.get(nextval.rsplit(’.’, 1)[1]))
except AttributeError:
df_processing[nextval] = ""

df_processing_col_list = df_processing.columns.tolist()


df_processing = df_processing[df_subset_list_all_cols]
df_processing.columns = df_final_column_name
# if file does not exist write header
if not os.path.isfile(items):
print("The file does not exists Exists so writing new")
df_processing.to_csv(’{}’.format(items), header=’column_names’,index=None)
else: # else it exists so append without writing the header
print("The file does exists Exists so appending")
df_processing.to_csv(’{}’.format(items), mode=’a’, header=False,index=None)


from datetime import datetime
startTime = datetime.now().strftime("%Y%m%d_%H%M%S")
startTime = str(os.getpid()) + "_" + startTime
process_task_name = ’’
process_config_csv = ’config.csv’
xml_file_name = ’test.xml’
old_print = print

def timestamped_print(*args, **kwargs):
now = datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")
printheader = now + " xml_parser " + " " + process_task_name + " - "
old_print(printheader, *args, **kwargs)
print = timestamped_print

xml_parse(xml_file_name)

The output created are

[, ~]$ cat Name.csv 
FirstName,LastName,ContactNo,Email
Hal,Thanos,122131,hal.thanos@xyz.com
Iron,Man,12324,iron.man@xyz.com
Captain,America,13322,captain.america@xyz.com
Sword,Man,12324,sword.man@xyz.com
Thor,Odison,156565,thor.odison@xyz.com
Spider,Man,12324,spider.man@xyz.com
Black,Widow,16767,black.widow@xyz.com
White,Man,5634,white.man@xyz.com
[, ~]$ cat Address.csv
FirstName,LastName,ContactNo,Email,City,State,Zip,type
Iron,Man,12324,iron.man@xyz.com,Bangalore,Karnataka,560212,Permanent
Iron,Man,12324,iron.man@xyz.com,Concord,NC,28027,Temporary
Hal,Thanos,122131,hal.thanos@xyz.com,Bangalore,Karnataka,560212,
Sword,Man,12324,sword.man@xyz.com,Bangalore,Karnataka,560212,Permanent
Sword,Man,12324,sword.man@xyz.com,Concor d,NC,28027,Temporary
Captain,America,13322,captain.america@xyz.com,Trivandrum,Kerala,28115,
Spider,Man,12324,spider.man@xyz.com,Bangalore,Karnataka,560212,Permanent
Spider,Man,12324,spider.man@xyz.com,Concord,NC,28027,Temporary
Thor,Odison,156565,thor.odison@xyz.com,Tirunelveli,TamilNadu,36595,
White,Man,5634,white.man@xyz.com,Bangalore,Karnataka,560212,Permanent
White,Man,5634,white.man@xyz.com,Concord,NC,28027,Temporary
Black,Widow,16767,black.widow@xyz.com,Mysore,Karnataka,12478,
[, ~]$ cat Form.csv
FirstName,LastName,ContactNo,Email,type,id,value
Iron,Man,12324,iron.man@xyz.com,Temporary,ID1,LIC
Iron,Man,12324,iron.man@xyz.com,Temporary,ID2,PAS
Iron,Man,12324,iron.man@xyz.com,Temporary,ID3,SSN
Iron,Man,12324,iron.man@xyz.com,Temporary,ID2,CC
Hal,Thanos,122131,hal.thanos@xyz.com,,ID1,LIC
Hal,Thanos,122131,hal.thanos@xyz.com,,ID2,PAS
Iron,Man,12324,iron.man@xyz.com,Permanent,ID3,LIC
Sword,Man,12324,s word.man@xyz.com,Temporary,ID1,LIC
Sword,Man,12324,sword.man@xyz.com,Temporary,ID2,PAS
Sword,Man,12324,sword.man@xyz.com,Temporary,ID3,SSN
Sword,Man,12324,sword.man@xyz.com,Temporary,ID2,CC
Captain,America,13322,captain.america@xyz.com,,ID1,LIC
Captain,America,13322,captain.america@xyz.com,,ID2,PAS
Sword,Man,12324,sword.man@xyz.com,Permanent,ID3,LIC
Spider,Man,12324,spider.man@xyz.com,Temporary,ID1,LIC
Spider,Man,12324,spider.man@xyz.com,Temporary,ID2,PAS
Spider,Man,12324,spider.man@xyz.com,Temporary,ID3,SSN
Spider,Man,12324,spider.man@xyz.com,Temporary,ID2,CC
Thor,Odison,156565,thor.odison@xyz.com,,ID1,LIC
Thor,Odison,156565,thor.odison@xyz.com,,ID2,PAS
Spider,Man,12324,spider.man@xyz.com,Permanent,ID3,LIC
White,Man,5634,white.man@xyz.com,Temporary,ID1,LIC
White,Man,5634,white.man@xyz.com,Temporary,ID2,PAS
White,Man,5634,white.man@xyz.com,Temporary,ID3,SSN
White,Man,5634,white.man@xyz.com,Temporary,ID 2,CC
White,Man,5634,white.man@xyz.com,Permanent,ID3,LIC
Black,Widow,16767,black.widow@xyz.com,,ID1,LIC

The pieces and answers are extracted from different threads and thanks to
@Mark Tolonen @Mandy007 @deadshot





This can be definitely made shorter and more performing one and can be enhanced further


I am trying to parse to an xml into multiple different Files -

Sample XML

<integration-outbound:IntegrationEntity
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <integrationEntityHeader>
        <integrationTrackingNumber>281#963-4c1d-9d26-877ba40a4b4b#1583507840354</integrationTrackingNumber>
        <referenceCodeForEntity>25428</referenceCodeForEntity>
        <attachments>
            <attachment>
                <id>d6esd1d518b06019e01</id>
                <name>durance.pdf</name>
                <size>0</size>
            </attachment>
            <attachment>
                <id>182e60164ddd4236b5bd96109</id>
                <name>ssds</name>
                <size>0</size>
            </attachment>
        </attachments>
        <source>SIM</source>
        <entity>SUPPLIER</entity>
        <action>CREATE</action>
        <timestampUTC>20200306T151721</timestampUTC>
        <zDocBaseVersion>2.0</zDocBaseVersion>
        <zDocCustomVersion>0</zDocCustomVersion>
    </integrationEntityHeader>
    <integrationEntityDetails>
        <supplier>
            <requestId>2614352</requestId>
            <controlBlock>
                <dataProcessingInfo>
                    <key>MODE</key>
                    <value>Onboarding</value>
                </dataProcessingInfo>
                <dataProcessingInfo>
                    <key>Supplier_Type</key>
                    <value>Operational</value>
                </dataProcessingInfo>
            </controlBlock>
            <id>1647059</id>
            <facilityCode>0001</facilityCode>
            <systemCode>1</systemCode>
            <supplierType>Operational</supplierType>
            <systemFacilityDetails>
                <systemFacilityDetail>
                    <facilityCode>0001</facilityCode>
                    <systemCode>1</systemCode>
                    <FacilityStatus>ACTIVE</FacilityStatus>
                </systemFacilityDetail>
            </systemFacilityDetails>
            <status>ACTIVE</status>
            <companyDetails>
                <displayGSID>254232128</displayGSID>
                <legalCompanyName>asdasdsads</legalCompanyName>
                <dunsNumber>03-175-2493</dunsNumber>
                <legalStructure>1</legalStructure>
                <website>www.aaadistributor.com</website>
                <noEmp>25</noEmp>
                <companyIndicator1099>No</companyIndicator1099>
                <taxidAndWxformRequired>NO</taxidAndWxformRequired>
                <taxidFormat>Fed. Tax</taxidFormat>
                <wxForm>182e601649ade4c38cd4236b5bd96109</wxForm>
                <taxid>27-2204474</taxid>
                <companyTypeFix>SUPPLIER</companyTypeFix>
                <fields>
                    <field>
                        <id>LOW_CUURENT_SERV</id>
                        <value>1</value>
                    </field>
                    <field>
                        <id>LOW_COI</id>
                        <value>USA</value>
                    </field>
                    <field>
                        <id>LOW_STATE_INCO</id>
                        <value>US-PA</value>
                    </field>
                    <field>
                        <id>CERT_INSURANCE</id>
                        <value>d6e6e460fe8958564c1d518b06019e01</value>
                    </field>
                    <field>
                        <id>COMP_DBA</id>
                        <value>asdadas</value>
                    </field>
                    <field>
                        <id>LOW_AREUDIVE</id>
                        <value>N</value>
                    </field>
                    <field>
                        <id>LOW_BU_SIZE1</id>
                        <value>SMLBUS</value>
                    </field>
                    <field>
                        <id>EDI_CAP</id>
                        <value>Y</value>
                    </field>
                    <field>
                        <id>EDI_WEB</id>
                        <value>N</value>
                    </field>
                    <field>
                        <id>EDI_TRAD</id>
                        <value>N</value>
                    </field>
                </fields>
            </companyDetails>
            <allLocations>
                <location>
                    <addressInternalid>1704342</addressInternalid>
                    <isDelete>false</isDelete>
                    <internalSupplierid>1647059</internalSupplierid>
                    <acctGrpid>HQ</acctGrpid>
                    <address1>2501 GRANT AVE</address1>
                    <country>USA</country>
                    <state>US-PA</state>
                    <city>PHILADELPHIA</city>
                    <zip>19114</zip>
                    <phone>(215) 745-7900</phone>
                </location>
            </allLocations>
            <contactDetails>
                <contactDetail>
                    <contactInternalid>12232</contactInternalid>
                    <isDelete>false</isDelete>
                    <addressInternalid>1704312142</addressInternalid>
                    <contactType>Main</contactType>
                    <firstName>Raf</firstName>
                    <lastName>jas</lastName>
                    <title>Admin</title>
                    <email>abcd@gmail.com</email>
                    <phoneNo>123-42-23-23</phoneNo>
                    <createPortalLogin>yes</createPortalLogin>
                    <allowedPortalSideProducts>SIM,iSource,iContract</allowedPortalSideProducts>
                </contactDetail>
                <contactDetail>
                    <contactInternalid>1944938</contactInternalid>
                    <isDelete>false</isDelete>
                    <addressInternalid>1704342</addressInternalid>
                    <contactType>Rad</contactType>
                    <firstName>AVs</firstName>
                    <lastName>asd</lastName>
                    <title>Founder</title>
                    <email>as@sds.com</email>
                    <phoneNo>21521-2112-7900</phoneNo>
                    <createPortalLogin>yes</createPortalLogin>
                    <allowedPortalSideProducts>SIM,iContract,iSource</allowedPortalSideProducts>
                </contactDetail>
            </contactDetails>
            <myLocation>
                <addresses>
                    <myLocationsInternalid>1704342</myLocationsInternalid>
                    <isDelete>false</isDelete>
                    <addressInternalid>1704342</addressInternalid>
                    <usedAt>N</usedAt>
                </addresses>
            </myLocation>
            <bankDetails>
                <fields>
                    <field>
                        <id>LOW_BANK_KEY</id>
                        <value>123213</value>
                    </field>
                    <field>
                        <id>LOW_EFT</id>
                        <value>123123</value>
                    </field>
                </fields>
            </bankDetails>
            <forms>
                <form>
                    <id>CATEGORY_PRODSER</id>
                    <records>
                        <record>
                            <Internalid>24348</Internalid>
                            <isDelete>false</isDelete>
                            <fields>
                                <field>
                                    <id>CATEGOR_LEVEL_1</id>
                                    <value>MR</value>
                                </field>
                                <field>
                                    <id>LOW_PRODSERV</id>
                                    <value>RES</value>
                                </field>
                                <field>
                                    <id>LOW_LEVEL_2</id>
                                    <value>keylevel221</value>
                                </field>
                                <field>
                                    <id>LOW_LEVEL_3</id>
                                    <value>keylevel3127</value>
                                </field>
                                <field>
                                    <id>LOW_LEVEL_4</id>
                                    <value>keylevel4434</value>
                                </field>
                                <field>
                                    <id>LOW_LEVEL_5</id>
                                    <value>keylevel5545</value>
                                </field>
                            </fields>
                        </record>
                        <record>
                            <Internalid>24349</Internalid>
                            <isDelete>false</isDelete>
                            <fields>
                                <field>
                                    <id>CATEGOR_LEVEL_1</id>
                                    <value>MR</value>
                                </field>
                                <field>
                                    <id>LOW_PRODSERV</id>
                                    <value>RES</value>
                                </field>
                                <field>
                                    <id>LOW_LEVEL_2</id>
                                    <value>keylevel221</value>
                                </field>
                                <field>
                                    <id>LOW_LEVEL_3</id>
                                    <value>keylevel3125</value>
                                </field>
                                <field>
                                    <id>LOW_LEVEL_4</id>
                                    <value>keylevel4268</value>
                                </field>
                                <field>
                                    <id>LOW_LEVEL_5</id>
                                    <value>keylevel5418</value>
                                </field>
                            </fields>
                        </record>
                        <record>
                            <Internalid>24350</Internalid>
                            <isDelete>false</isDelete>
                            <fields>
                                <field>
                                    <id>CATEGOR_LEVEL_1</id>
                                    <value>MR</value>
                                </field>
                                <field>
                                    <id>LOW_PRODSERV</id>
                                    <value>RES</value>
                                </field>
                                <field>
                                    <id>LOW_LEVEL_2</id>
                                    <value>keylevel221</value>
                                </field>
                                <field>
                                    <id>LOW_LEVEL_3</id>
                                    <value>keylevel3122</value>
                                </field>
                                <field>
                                    <id>LOW_LEVEL_4</id>
                                    <value>keylevel425</value>
                                </field>
                                <field>
                                    <id>LOW_LEVEL_5</id>
                                    <value>keylevel5221</value>
                                </field>
                            </fields>
                        </record>
                    </records>
                </form>
                <form>
                    <id>OTHER_INFOR</id>
                    <records>
                        <record>
                            <isDelete>false</isDelete>
                            <fields>
                                <field>
                                    <id>S_EAST</id>
                                    <value>N</value>
                                </field>
                                <field>
                                    <id>W_EST</id>
                                    <value>N</value>
                                </field>
                                <field>
                                    <id>M_WEST</id>
                                    <value>N</value>
                                </field>
                                <field>
                                    <id>N_EAST</id>
                                    <value>N</value>
                                </field>
                                <field>
                                    <id>LOW_AREYOU_ASSET</id>
                                    <value>-1</value>
                                </field>
                                <field>
                                    <id>LOW_SWART_PROG</id>
                                    <value>-1</value>
                                </field>
                            </fields>
                        </record>
                    </records>
                </form>
                <form>
                    <id>ABDCEDF</id>
                    <records>
                        <record>
                            <isDelete>false</isDelete>
                            <fields>
                                <field>
                                    <id>LOW_COD_CONDUCT</id>
                                    <value>-1</value>
                                </field>
                            </fields>
                        </record>
                    </records>
                </form>
                <form>
                    <id>CODDUC</id>
                    <records>
                        <record>
                            <isDelete>false</isDelete>
                            <fields>
                                <field>
                                    <id>LOW_SUPPLIER_TYPE</id>
                                    <value>2</value>
                                </field>
                                <field>
                                    <id>LOW_DO_INT_BOTH</id>
                                    <value>1</value>
                                </field>
                            </fields>
                        </record>
                    </records>
                </form>
            </forms>
        </supplier>
    </integrationEntityDetails>
</integration-outbound:IntegrationEntity>

The goal is to have common xml to csv conversion to be put in place. Based on input file the xml should be flattend and exploded into multiple csv and stored.

The input is an xml which is above and config csv file below. Need to create 3 csv files with corresponding XPATH mentioned in the file

XPATH,ColumName,CSV_File_Name,ParentKey
/integration-outbound:IntegrationEntity/integrationEntityHeader/integrationTrackingNumber,integrationTrackingNumber,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/referenceCodeForEntity,referenceCodeForEntity,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/attachments/attachment[]/id,id,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/attachments/attachment[]/name,name,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/attachments/attachment[]/size,size,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/source,source,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/entity,entity,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/action,action,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/timestampUTC,timestampUTC,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/zDocBaseVersion,zDocBaseVersion,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/zDocCustomVersion,zDocCustomVersion,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/integrationTrackingNumber,integrationTrackingNumber,integrationEntityDetailsControlBlock.csv,Y
/integration-outbound:IntegrationEntity/integrationEntityHeader/referenceCodeForEntity,referenceCodeForEntity,integrationEntityDetailsControlBlock.csv,Y
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/requestId,requestId,integrationEntityDetailsControlBlock.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/controlBlock/dataProcessingInfo[]/key,key,integrationEntityDetailsControlBlock.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/controlBlock/dataProcessingInfo[]/value,value,integrationEntityDetailsControlBlock.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/id,supplier_id,integrationEntityDetailsControlBlock.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/forms/form[]/id,id,integrationEntityDetailsForms.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/forms/form[]/records/record[]/Internalid,Internalid,integrationEntityDetailsForms.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/forms/form[]/records/record[]/isDelete,FormId,integrationEntityDetailsForms.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/forms/form[]/records/record[]/fields/field[]/id,SupplierFormRecordFieldId,integrationEntityDetailsForms.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/forms/form[]/records/record[]/fields/field[]/value,SupplierFormRecordFieldValue,integrationEntityDetailsForms.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/integrationTrackingNumber,integrationTrackingNumber,integrationEntityDetailsForms.csv,Y
/integration-outbound:IntegrationEntity/integrationEntityHeader/referenceCodeForEntity,referenceCodeForEntity,integrationEntityDetailsForms.csv,Y
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/requestId,requestId,integrationEntityDetailsForms.csv,Y
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/id,supplier_id,integrationEntityDetailsForms.csv,Y

I need to create 3 csv files output from it.

The design is to pick each csv file and get the xpath and pick the corresponding value from the xml and fetch it

Step 1 - Convert to xml to Json -

import json
import xmltodict

with open("/home/s0998hws/test.xml") as xml_file:
    data_dict = xmltodict.parse(xml_file.read())
    xml_file.close()

    # generate the object using json.dumps()
    # corresponding to json data

json_data = json.dumps(data_dict)

# Write the json data to output
# json file
with open("data.json", "w") as json_file:
    json_file.write(json_data)
    json_file.close()

with open('data.json') as f:
    d = json.load(f)

Step 2 - Normalize using the panda normalize function -using the xpath / converting to . and [] as other delimter and building the columns to be fecthed from the json i.e code will look for /integration-outbound:IntegrationEntity/integrationEntityHeader/integrationTrackingNumber and convert to.integrationEntityHeader.integrationTrackingNumber and with the first [] it will exlode , there on

df_1=pd.json_normalize(data=d['integration-outbound:IntegrationEntity'])
df_2=df_1[['integrationEntityHeader.integrationTrackingNumber','integrationEntityDetails.supplier.requestId','integrationEntityHeader.referenceCodeForEntity','integrationEntityDetails.supplier.id','integrationEntityDetails.supplier.forms.form']]
df_3=df_2.explode('integrationEntityDetails.supplier.forms.form')
df_3['integrationEntityDetails.supplier.forms.form.id']=df_3['integrationEntityDetails.supplier.forms.form'].apply(lambda x: x.get('id'))
df_3['integrationEntityDetails.supplier.forms.form.records']=df_3['integrationEntityDetails.supplier.forms.form'].apply(lambda x: x.get('records'))

I was trying to use the metadata from the csv file and fecth it but the challenge is

df_3['integrationEntityDetails.supplier.forms.form.records.record.Internalid']=df_3['integrationEntityDetails.supplier.forms.form.records.record'].apply(lambda x: x.get('Internalid'))

Failed with Error -

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib64/python3.6/site-packages/pandas/core/series.py", line 3848, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)
  File "pandas/_libs/lib.pyx", line 2327, in pandas._libs.lib.map_infer
  File "<stdin>", line 1, in <lambda>
AttributeError: 'list' object has no attribute 'get'

The reason is the data from the panda dataframe is having list when and array and it is unable be fecth using the above method.

Below is the output generated

integrationEntityHeader.integrationTrackingNumber   integrationEntityDetails.supplier.requestId integrationEntityHeader.referenceCodeForEntity  integrationEntityDetails.supplier.id    integrationEntityDetails.supplier.forms.form    integrationEntityDetails.supplier.forms.form.id integrationEntityDetails.supplier.forms.form.records
    0   281#999eb16e-242c-4239-b33e-ae6f5296fb15#10c7338c-ab63-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428   1647059 {'id': 'CATEGORY_PRODSER', 'records': {'record': [{'Internalid': '24348', 'isDelete': 'false', 'fields': {'field': [{'id': 'CATEGOR_LEVEL_1', 'value': 'MR'}, {'id': 'LOW_PRODSERV', 'value': 'RES'}, {'id': 'LOW_LEVEL_2', 'value': 'keylevel221'}, {'id': 'LOW_LEVEL_3', 'value': 'keylevel3127'}, {'id': 'LOW_LEVEL_4', 'value': 'keylevel4434'}, {'id': 'LOW_LEVEL_5', 'value': 'keylevel5545'}]}}, {'Internalid': '24349', 'isDelete': 'false', 'fields': {'field': [{'id': 'CATEGOR_LEVEL_1', 'value': 'MR'}, {'id': 'LOW_PRODSERV', 'value': 'RES'}, {'id': 'LOW_LEVEL_2', 'value': 'keylevel221'}, {'id': 'LOW_LEVEL_3', 'value': 'keylevel3125'}, {'id': 'LOW_LEVEL_4', 'value': 'keylevel4268'}, {'id': 'LOW_LEVEL_5', 'value': 'keylevel5418'}]}}, {'Internalid': '24350', 'isDelete': 'false', 'fields': {'field': [{'id': 'CATEGOR_LEVEL_1', 'value': 'MR'}, {'id': 'LOW_PRODSERV', 'value': 'RES'}, {'id': 'LOW_LEVEL_2', 'value': 'keylevel221'}, {'id': 'LOW_LEVEL_3', 'value': 'keylevel3122'}, {'id': 'LOW_LEVEL_4', 'value': 'keylevel425'}, {'id': 'LOW_LEVEL_5', 'value': 'keylevel5221'}]}}]}}    CATEGORY_PRODSER    {'record': [{'Internalid': '24348', 'isDelete': 'false', 'fields': {'field': [{'id': 'CATEGOR_LEVEL_1', 'value': 'MR'}, {'id': 'LOW_PRODSERV', 'value': 'RES'}, {'id': 'LOW_LEVEL_2', 'value': 'keylevel221'}, {'id': 'LOW_LEVEL_3', 'value': 'keylevel3127'}, {'id': 'LOW_LEVEL_4', 'value': 'keylevel4434'}, {'id': 'LOW_LEVEL_5', 'value': 'keylevel5545'}]}}, {'Internalid': '24349', 'isDelete': 'false', 'fields': {'field': [{'id': 'CATEGOR_LEVEL_1', 'value': 'MR'}, {'id': 'LOW_PRODSERV', 'value': 'RES'}, {'id': 'LOW_LEVEL_2', 'value': 'keylevel221'}, {'id': 'LOW_LEVEL_3', 'value': 'keylevel3125'}, {'id': 'LOW_LEVEL_4', 'value': 'keylevel4268'}, {'id': 'LOW_LEVEL_5', 'value': 'keylevel5418'}]}}, {'Internalid': '24350', 'isDelete': 'false', 'fields': {'field': [{'id': 'CATEGOR_LEVEL_1', 'value': 'MR'}, {'id': 'LOW_PRODSERV', 'value': 'RES'}, {'id': 'LOW_LEVEL_2', 'value': 'keylevel221'}, {'id': 'LOW_LEVEL_3', 'value': 'keylevel3122'}, {'id': 'LOW_LEVEL_4', 'value': 'keylevel425'}, {'id': 'LOW_LEVEL_5', 'value': 'keylevel5221'}]}}]}
    0   281#999eb16e-242c-4239-b33e-ae6f5296fb15#10c7338c-ab63-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428   1647059 {'id': 'OTHER_INFOR', 'records': {'record': {'isDelete': 'false', 'fields': {'field': [{'id': 'S_EAST', 'value': 'N'}, {'id': 'W_EST', 'value': 'N'}, {'id': 'M_WEST', 'value': 'N'}, {'id': 'N_EAST', 'value': 'N'}, {'id': 'LOW_AREYOU_ASSET', 'value': '-1'}, {'id': 'LOW_SWART_PROG', 'value': '-1'}]}}}}   OTHER_INFOR {'record': {'isDelete': 'false', 'fields': {'field': [{'id': 'S_EAST', 'value': 'N'}, {'id': 'W_EST', 'value': 'N'}, {'id': 'M_WEST', 'value': 'N'}, {'id': 'N_EAST', 'value': 'N'}, {'id': 'LOW_AREYOU_ASSET', 'value': '-1'}, {'id': 'LOW_SWART_PROG', 'value': '-1'}]}}}
    0   281#999eb16e-242c-4239-b33e-ae6f5296fb15#10c7338c-ab63-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428   1647059 {'id': 'CORPORATESUSTAINABILITY', 'records': {'record': {'isDelete': 'false', 'fields': {'field': {'id': 'LOW_COD_CONDUCT', 'value': '-1'}}}}}  CORPORATESUSTAINABILITY {'record': {'isDelete': 'false', 'fields': {'field': {'id': 'LOW_COD_CONDUCT', 'value': '-1'}}}}
    0   281#999eb16e-242c-4239-b33e-ae6f5296fb15#10c7338c-ab63-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428   1647059 {'id': 'PRODUCTSERVICES', 'records': {'record': {'isDelete': 'false', 'fields': {'field': [{'id': 'LOW_SUPPLIER_TYPE', 'value': '2'}, {'id': 'LOW_DO_INT_BOTH', 'value': '1'}]}}}}  PRODUCTSERVICES {'record': {'isDelete': 'false', 'fields': {'field': [{'id': 'LOW_SUPPLIER_TYPE', 'value': '2'}, {'id': 'LOW_DO_INT_BOTH', 'value': '1'}]}}}

Expected OuputintegrationEntityDetailsForms.csv

integrationTrackingNumber   requestId   referenceCodeForEntity  supplier.id integrationEntityDetails.supplier.forms.form.id InternalId  isDelete    SupplierFormRecordFieldId   SupplierFormRecordFieldValue
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 CATEGORY_PRODSER    24348   FALSE   CATEGOR_LEVEL_1 MR
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 CATEGORY_PRODSER    24348   FALSE   LOW_PRODSERV    RES
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 CATEGORY_PRODSER    24348   FALSE   LOW_LEVEL_2 keylevel221
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 CATEGORY_PRODSER    24348   FALSE   LOW_LEVEL_3 keylevel3127
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 CATEGORY_PRODSER    24348   FALSE   LOW_LEVEL_4 keylevel4434
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 CATEGORY_PRODSER    24348   FALSE   LOW_LEVEL_5 keylevel5545
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 CATEGORY_PRODSER    24350   FALSE   CATEGOR_LEVEL_1 MR
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 CATEGORY_PRODSER    24350   FALSE   LOW_PRODSERV    RES
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 CATEGORY_PRODSER    24350   FALSE   LOW_LEVEL_2 keylevel221
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 CATEGORY_PRODSER    24350   FALSE   LOW_LEVEL_3 keylevel3122
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 CATEGORY_PRODSER    24350   FALSE   LOW_LEVEL_4 keylevel425
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 CATEGORY_PRODSER    24350   FALSE   LOW_LEVEL_5 keylevel5221
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 OTHER_INFOR     FALSE   S_EAST  N
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 OTHER_INFOR     FALSE   W_EST   N
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 OTHER_INFOR     FALSE   M_WEST  N
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 OTHER_INFOR     FALSE   N_EAST  N
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 OTHER_INFOR     FALSE   LOW_AREYOU_ASSET    -1
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 CORPORATESUSTAINABILITY     FALSE   LOW_SWART_PROG  -1
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 CORPORATESUSTAINABILITY     FALSE   LOW_COD_CONDUCT -1
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 PRODUCTSERVICES     FALSE   LOW_SUPPLIER_TYPE   2
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 PRODUCTSERVICES     FALSE   LOW_DO_INT_BOTH 1
解决方案

The xml is converted to dict and then the parsing logic is written , the reason for this is because the same can be used for json . The stackoverflow is amazingly helpful and the solution is build based on the responses from all these links . For simplicity i have created a 3 level nest xml. This works on Python3

<?xml version="1.0"?><Company><Employee><FirstName>Hal</FirstName><LastName>Thanos</LastName><ContactNo>122131</ContactNo><Email>hal.thanos@xyz.com</Email><Addresses><Address><City>Bangalore</City><State>Karnataka</State><Zip>560212</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form></forms></Address></Addresses></Employee><Employee><FirstName>Iron</FirstName><LastName>Man</LastName><ContactNo>12324</ContactNo><Email>iron.man@xyz.com</Email><Addresses><Address><type>Permanent</type><City>Bangalore</City><State>Karnataka</State><Zip>560212</Zip><forms><form><id>ID3</id><value>LIC</value></form></forms></Address><Address><type>Temporary</type><City>Concord</City><State>NC</State><Zip>28027</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form><form><id>ID3</id><value>SSN</value></form><form><id>ID2</id><value>CC</value></form></forms></Address></Addresses></Employee></Company>
<?xml version="1.0"?><Company><Employee><FirstName>Captain</FirstName><LastName>America</LastName><ContactNo>13322</ContactNo><Email>captain.america@xyz.com</Email><Addresses><Address><City>Trivandrum</City><State>Kerala</State><Zip>28115</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form></forms></Address></Addresses></Employee><Employee><FirstName>Sword</FirstName><LastName>Man</LastName><ContactNo>12324</ContactNo><Email>sword.man@xyz.com</Email><Addresses><Address><type>Permanent</type><City>Bangalore</City><State>Karnataka</State><Zip>560212</Zip><forms><form><id>ID3</id><value>LIC</value></form></forms></Address><Address><type>Temporary</type><City>Concord</City><State>NC</State><Zip>28027</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form><form><id>ID3</id><value>SSN</value></form><form><id>ID2</id><value>CC</value></form></forms></Address></Addresses></Employee></Company>
<?xml version="1.0"?><Company><Employee><FirstName>Thor</FirstName><LastName>Odison</LastName><ContactNo>156565</ContactNo><Email>thor.odison@xyz.com</Email><Addresses><Address><City>Tirunelveli</City><State>TamilNadu</State><Zip>36595</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form></forms></Address></Addresses></Employee><Employee><FirstName>Spider</FirstName><LastName>Man</LastName><ContactNo>12324</ContactNo><Email>spider.man@xyz.com</Email><Addresses><Address><type>Permanent</type><City>Bangalore</City><State>Karnataka</State><Zip>560212</Zip><forms><form><id>ID3</id><value>LIC</value></form></forms></Address><Address><type>Temporary</type><City>Concord</City><State>NC</State><Zip>28027</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form><form><id>ID3</id><value>SSN</value></form><form><id>ID2</id><value>CC</value></form></forms></Address></Addresses></Employee></Company>
<?xml version="1.0"?><Company><Employee><FirstName>Black</FirstName><LastName>Widow</LastName><ContactNo>16767</ContactNo><Email>black.widow@xyz.com</Email><Addresses><Address><City>Mysore</City><State>Karnataka</State><Zip>12478</Zip><forms><form><id>ID1</id><value>LIC</value></form></forms></Address></Addresses></Employee><Employee><FirstName>White</FirstName><LastName>Man</LastName><ContactNo>5634</ContactNo><Email>white.man@xyz.com</Email><Addresses><Address><type>Permanent</type><City>Bangalore</City><State>Karnataka</State><Zip>560212</Zip><forms><form><id>ID3</id><value>LIC</value></form></forms></Address><Address><type>Temporary</type><City>Concord</City><State>NC</State><Zip>28027</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form><form><id>ID3</id><value>SSN</value></form><form><id>ID2</id><value>CC</value></form></forms></Address></Addresses></Employee></Company>

The config file for this xml is all possible array/multiple level/explode columns should be mentioned as []. The header is needed as referred in the code.

Chnage the variable as per u storeprocess_config_csv = 'config.csv'xml_file_name = 'test.xml'

XPATH,ColumName,CSV_File_Name
/Company/Employee[]/FirstName,FirstName,Name.csv
/Company/Employee[]/LastName,LastName,Name.csv
/Company/Employee[]/ContactNo,ContactNo,Name.csv
/Company/Employee[]/Email,Email,Name.csv
/Company/Employee[]/FirstName,FirstName,Address.csv
/Company/Employee[]/LastName,LastName,Address.csv
/Company/Employee[]/ContactNo,ContactNo,Address.csv
/Company/Employee[]/Email,Email,Address.csv
/Company/Employee[]/Addresses/Address[]/City,City,Address.csv
/Company/Employee[]/Addresses/Address[]/State,State,Address.csv
/Company/Employee[]/Addresses/Address[]/Zip,Zip,Address.csv
/Company/Employee[]/Addresses/Address[]/type,type,Address.csv
/Company/Employee[]/FirstName,FirstName,Form.csv
/Company/Employee[]/LastName,LastName,Form.csv
/Company/Employee[]/ContactNo,ContactNo,Form.csv
/Company/Employee[]/Email,Email,Form.csv
/Company/Employee[]/Addresses/Address[]/type,type,Form.csv
/Company/Employee[]/Addresses/Address[]/forms/form[]/id,id,Form.csv
/Company/Employee[]/Addresses/Address[]/forms/form[]/value,value,Form.csv

The code to create multiple csv based on the config file is

import json
import xmltodict
import json
import os
import csv
import numpy as np
import pandas as pd
import sys
from collections import defaultdict
import numpy as np

def getMatches(L1, L2):
    R = set()
    for elm in L1:
        for pat in L2:
            if elm.find(pat) != -1:
                if elm.find('.', len(pat)+1) != -1:
                    R.add(elm[:elm.find('.', len(pat)+1)])
                else:
                    R.add(elm)
    return list(R)

def xml_parse(xml_file_name):
    try:
        process_xml_file = xml_file_name
        with open(process_xml_file) as xml_file:
            for xml_string in xml_file:
                """Converting the xml to Dict"""
                data_dict = xmltodict.parse(xml_string)
                """Converting the dict to Pandas DF"""
                df_processing = pd.json_normalize(data_dict)
                xml_parse_loop(df_processing)
            xml_file.close()
    except Exception as e:
        s = str(e)
        print(s)

def xml_parse_loop(df_processing_input):
    CSV_File_Name = []
    """Getting the list of csv Files to be created"""
    with open(process_config_csv, newline='') as csvfile:
        DataCaptured = csv.DictReader(csvfile)
        for row in DataCaptured:
            if row['CSV_File_Name'] not in CSV_File_Name:
                CSV_File_Name.append(row['CSV_File_Name'])
    """Iterating the list of CSV"""
    for items in CSV_File_Name:
            df_processing = df_processing_input
            df_subset_process = []
            df_subset_list_all_cols = []
            df_process_sub_explode_Level = []
            df_final_column_name = []
            print('Parsing the xml file for creating the file - ' + str(items))
            """Fetching the field list for processs from the confic File"""
            with open(process_config_csv, newline='') as csvfile:
                    DataCaptured = csv.DictReader(csvfile)
                    for row in DataCaptured:
                        if row['CSV_File_Name'] in items:
                                df_final_column_name.append(row['ColumName'])
                                """Getting the columns until the first [] """
                                df_subset_process.append(row['XPATH'].strip('/').replace("/",".").split('[]')[0])
                                """Getting the All the columnnames"""
                                df_subset_list_all_cols.append(row['XPATH'].strip('/').replace("/",".").replace("[]",""))
                                """Getting the All the Columns to explode"""
                                df_process_sub_explode_Level.append(row['XPATH'].strip('/').replace('/', '.').split('[]'))
            explode_ld = defaultdict(set)
            """Putting Level of explode and column names"""
            for x in df_process_sub_explode_Level:
                if len(x) > 1:
                    explode_ld[len(x) - 1].add(''.join(x[: -1]))
            explode_ld = {k: list(v) for k, v in explode_ld.items()}
            #print(' The All column list is for the file ' + items + " is " + str(df_subset_list_all_cols))
            #print(' The first processing for the file ' + items + " is " + str(df_subset_process))
            #print('The explode level of attributes for the file ' + items + " is " + str(explode_ld))
            """Remove column duplciates"""
            df_subset_process = list(dict.fromkeys(df_subset_process))
            for col in df_subset_process:
                if col not in df_processing.columns:
                    df_processing[col] = np.nan
            df_processing = df_processing[df_subset_process]
            df_processing_col_list = df_processing.columns.tolist()
            print ('The total levels to be exploded : %d' % len(explode_ld))
            i=0
            level=len(explode_ld)
            for i in range(level):
                print (' Exploding the Level : %d' % i )
                df_processing_col_list = df_processing.columns.tolist()
                list_of_explode=set(df_processing_col_list) & set(explode_ld[i + 1])
                #print('List to expolde' + str(list_of_explode))
                """If founc in explode list exlplode some xml doesnt need to have a list it could be column handling the same"""
                for c in list_of_explode:
                    print (' There are column present which needs to be exploded - ' + str(c))
                    df_processing = pd.concat((df_processing.iloc[[type(item) == list for item in df_processing[c]]].explode(c),df_processing.iloc[[type(item) != list for item in df_processing[c]]]))
                    print(' Finding the columns need to be fetched ')
                """From the overall column list fecthing the attributes needed to explode"""
                next_level_pro_lst = getMatches(df_subset_list_all_cols,explode_ld[ i + 1 ])
                #print(next_level_pro_lst)
                df_processing_col_list = df_processing.columns.tolist()
                for nex in next_level_pro_lst:
                    #print ("Fetching " + nex.rsplit('.', 1)[1] + ' from ' + nex.rsplit('.', 1)[0] + ' from ' + nex )
                    parent_col=nex.rsplit('.', 1)[0]
                    child_col=nex.rsplit('.', 1)[1]
                    #print(parent_col)
                    #print(df_processing_col_list)
                    if parent_col not in df_processing_col_list:
                        df_processing[nex.rsplit('.', 1)[0]] = ""
                    try:
                        df_processing[nex] = df_processing[parent_col].apply(lambda x: x.get(child_col))
                    except AttributeError:
                        df_processing[nex] = ""
                df_processing_col_list = df_processing.columns.tolist()
                if i == level-1:
                    print('Last Level nothing to be done')
                else:
                    """Extracting All columns until the next exlode column list is found"""
                    while len(set(df_processing_col_list) & set(explode_ld[i + 2]))==0:
                        next_level_pro_lst = getMatches(df_subset_list_all_cols, next_level_pro_lst)
                        #print(next_level_pro_lst)
                        for nextval in next_level_pro_lst:
                            if nextval not in df_processing_col_list:
                                #print("Fetching " + nextval.rsplit('.', 1)[1] + ' from ' + nextval.rsplit('.', 1)[0] + ' from ' + nextval)
                                if nextval.rsplit('.', 1)[0] not in df_processing.columns:
                                    df_processing[nextval.rsplit('.', 1)[0]] = ""
                                try:
                                    df_processing[nextval] = df_processing[nextval.rsplit('.', 1)[0]].apply(lambda x: x.get(nextval.rsplit('.', 1)[1]))
                                except AttributeError:
                                    df_processing[nextval] = ""

                        df_processing_col_list = df_processing.columns.tolist()


            df_processing = df_processing[df_subset_list_all_cols]
            df_processing.columns = df_final_column_name
            # if file does not exist write header
            if not os.path.isfile(items):
                print("The file does not exists Exists so writing new")
                df_processing.to_csv('{}'.format(items), header='column_names',index=None)
            else:  # else it exists so append without writing the header
                print("The file does exists Exists so appending")
                df_processing.to_csv('{}'.format(items), mode='a', header=False,index=None)


from datetime import datetime
startTime = datetime.now().strftime("%Y%m%d_%H%M%S")
startTime = str(os.getpid()) + "_" + startTime
process_task_name = ''
process_config_csv = 'config.csv'
xml_file_name = 'test.xml'
old_print = print

def timestamped_print(*args, **kwargs):
    now = datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")
    printheader = now + " xml_parser " + " " + process_task_name + " - "
    old_print(printheader, *args, **kwargs)
print = timestamped_print

xml_parse(xml_file_name)

The output created are

[, ~]$ cat Name.csv
FirstName,LastName,ContactNo,Email
Hal,Thanos,122131,hal.thanos@xyz.com
Iron,Man,12324,iron.man@xyz.com
Captain,America,13322,captain.america@xyz.com
Sword,Man,12324,sword.man@xyz.com
Thor,Odison,156565,thor.odison@xyz.com
Spider,Man,12324,spider.man@xyz.com
Black,Widow,16767,black.widow@xyz.com
White,Man,5634,white.man@xyz.com
[, ~]$ cat Address.csv
FirstName,LastName,ContactNo,Email,City,State,Zip,type
Iron,Man,12324,iron.man@xyz.com,Bangalore,Karnataka,560212,Permanent
Iron,Man,12324,iron.man@xyz.com,Concord,NC,28027,Temporary
Hal,Thanos,122131,hal.thanos@xyz.com,Bangalore,Karnataka,560212,
Sword,Man,12324,sword.man@xyz.com,Bangalore,Karnataka,560212,Permanent
Sword,Man,12324,sword.man@xyz.com,Concord,NC,28027,Temporary
Captain,America,13322,captain.america@xyz.com,Trivandrum,Kerala,28115,
Spider,Man,12324,spider.man@xyz.com,Bangalore,Karnataka,560212,Permanent
Spider,Man,12324,spider.man@xyz.com,Concord,NC,28027,Temporary
Thor,Odison,156565,thor.odison@xyz.com,Tirunelveli,TamilNadu,36595,
White,Man,5634,white.man@xyz.com,Bangalore,Karnataka,560212,Permanent
White,Man,5634,white.man@xyz.com,Concord,NC,28027,Temporary
Black,Widow,16767,black.widow@xyz.com,Mysore,Karnataka,12478,
[, ~]$ cat Form.csv
FirstName,LastName,ContactNo,Email,type,id,value
Iron,Man,12324,iron.man@xyz.com,Temporary,ID1,LIC
Iron,Man,12324,iron.man@xyz.com,Temporary,ID2,PAS
Iron,Man,12324,iron.man@xyz.com,Temporary,ID3,SSN
Iron,Man,12324,iron.man@xyz.com,Temporary,ID2,CC
Hal,Thanos,122131,hal.thanos@xyz.com,,ID1,LIC
Hal,Thanos,122131,hal.thanos@xyz.com,,ID2,PAS
Iron,Man,12324,iron.man@xyz.com,Permanent,ID3,LIC
Sword,Man,12324,sword.man@xyz.com,Temporary,ID1,LIC
Sword,Man,12324,sword.man@xyz.com,Temporary,ID2,PAS
Sword,Man,12324,sword.man@xyz.com,Temporary,ID3,SSN
Sword,Man,12324,sword.man@xyz.com,Temporary,ID2,CC
Captain,America,13322,captain.america@xyz.com,,ID1,LIC
Captain,America,13322,captain.america@xyz.com,,ID2,PAS
Sword,Man,12324,sword.man@xyz.com,Permanent,ID3,LIC
Spider,Man,12324,spider.man@xyz.com,Temporary,ID1,LIC
Spider,Man,12324,spider.man@xyz.com,Temporary,ID2,PAS
Spider,Man,12324,spider.man@xyz.com,Temporary,ID3,SSN
Spider,Man,12324,spider.man@xyz.com,Temporary,ID2,CC
Thor,Odison,156565,thor.odison@xyz.com,,ID1,LIC
Thor,Odison,156565,thor.odison@xyz.com,,ID2,PAS
Spider,Man,12324,spider.man@xyz.com,Permanent,ID3,LIC
White,Man,5634,white.man@xyz.com,Temporary,ID1,LIC
White,Man,5634,white.man@xyz.com,Temporary,ID2,PAS
White,Man,5634,white.man@xyz.com,Temporary,ID3,SSN
White,Man,5634,white.man@xyz.com,Temporary,ID2,CC
White,Man,5634,white.man@xyz.com,Permanent,ID3,LIC
Black,Widow,16767,black.widow@xyz.com,,ID1,LIC

The pieces and answers are extracted from different threads and thanks to@Mark Tolonen @Mandy007 @deadshot

Create a dict of list using python from csv

https://stackoverflow.com/questions/62837949/extract-a-list-from-a-list

How to explode Panda column with data having different dict and list of dict

This can be definitely made shorter and more performing one and can be enhanced further

这篇关于使用Python Pandas从XML / Json创建CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-04 04:25
查看更多