

我有一个大小为4 GB的XML文件.我想解析它并将其转换为数据框以对其进行处理.但是由于文件太大,因此以下代码无法将文件转换为Pandas数据框.该代码仅保持加载状态,不提供任何输出.但是,当我将它用于较小尺寸的类似文件时,我将获得正确的输出.

I have an XML file of size 4 GB. I want to parse it and convert it to a Data Frame to work on it. But because the file size is too large the following code is unable to convert the file to a Pandas Data Frame. The code just keeps loading and does not provide any output. But when I use it for a similar file of smaller size I obtain the correct output.


Can anyone suggest any solution to this. Maybe a code that speeds up the process of conversion from XML to Data Frame or splitting of the XML file into smaller sub sets.

关于我应该在个人系统(2 GB RAM)上使用如此大的XML文件还是我应该使用Google Colab的任何建议.如果是Google Colab,那么有什么方法可以更快地将如此大的文件上传到驱动器,从而上传到Colab?

Any suggestion whether I should work with such large XML files on my personal system (2 GB RAM) or I should use Google Colab. If Google Colab, then is there any way to upload such large files quicker to drive and thus to Colab?


import xml.etree.ElementTree as ET
tree = ET.parse("Badges.xml")
root = tree.getroot()

#Column names for DataFrame
columns = ['row Id',"UserId",'Name','Date','Class','TagBased']

#Creating DataFrame
df = pd.DataFrame(columns = columns)

#Converting XML Tree to a Pandas DataFrame

for node in root:

    row_Id = node.attrib.get("Id")
    UserId = node.attrib.get("UserId")
    Name = node.attrib.get("Name")
    Date = node.attrib.get("Date")
    Class = node.attrib.get("Class")
    TagBased = node.attrib.get("TagBased")

    df = df.append(pd.Series([row_Id,UserId,Name,Date,Class,TagBased], index = columns), ignore_index = True)


Following is my XML File:

  <row Id="82946" UserId="3718" Name="Teacher" Date="2008-09-15T08:55:03.923" Class="3" TagBased="False" />
  <row Id="82947" UserId="994" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82949" UserId="3893" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82950" UserId="4591" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82951" UserId="5196" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82952" UserId="2635" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82953" UserId="1113" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />


考虑 iterparse 进行快速流处理,逐步建立树.在每个迭代中,构建一个字典列表,然后可以将其传递到 pandas.DataFrame 构造函数一次外循环.在下面进行调整以指定根子节点的重复节点的名称:

Consider iterparse for fast streaming processing that builds tree incrementally. In each iteration build a list of dictionaries that you can then pass into pandas.DataFrame constructor once outside loop. Adjust below to name of repeating nodes of root's children:

from xml.etree.ElementTree import iterparse
#from cElementTree import iterparse
import pandas as pd

file_path = r"/path/to/Input.xml"
dict_list = []

for _, elem in iterparse(file_path, events=("end",)):
    if elem.tag == "row":
        dict_list.append({'rowId': elem.attrib['Id'],
                          'UserId': elem.attrib['UserId'],
                          'Name': elem.attrib['Name'],
                          'Date': elem.attrib['Date'],
                          'Class': elem.attrib['Class'],
                          'TagBased': elem.attrib['TagBased']})

        # dict_list.append(elem.attrib)      # ALTERNATIVELY, PARSE ALL ATTRIBUTES


df = pd.DataFrame(dict_list)


09-02 10:34