加速数据导入功能

加速数据导入功能

本文介绍了加速数据导入功能(Pandas 并附加到 DataFrame)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们的数据输出文件夹包含数量可变的 .csv 文件,这些文件与包含所有不同记录参数的 .xml 文件相关联.每个 .csv 文件代表一个记录数据的扫描",所以我目前正试图弄清楚如何将所有这些文件组合成一个大型的多索引(扫描#和时间)数据帧进行处理(因为我们通常正在查看立即进行一整套扫描并找到平均值).

Our data output folders contain a variable number of .csv files that are associated with .xml files that contain all the different recording parameters. Each .csv file represents a "sweep" of recording data, so I'm currently trying to figure out how to combine all of these files together into one large multiindexed (Sweep# and Time) dataframe for processing (since we usually are looking at an entire set of sweeps at once and finding average values).

到目前为止我有以下两个功能.第一个只是对数据框进行一些小的修改,使其在以后更易于管理.

I so far have the following two functions. The first one just does some minor modificatiosn to the dataframe to make it more manageable down the road.

def import_pcsv(filename):
    df = pd.read_csv(filename, skipinitialspace=True)
    df.rename(columns = {df.columns[0]:'Time'}, inplace=True)
    df.Time = df.Time/1000
    df.set_index('Time', inplace=True)
    return df

第二个是解析文件夹的真正主力.它抓取文件夹中的所有 xml 文件,解析它们(使用我放在另一个模块中的另一个函数),然后将关联的 csv 文件导入一个大数据帧.

This second one is the real workhorse for parsing the folder. It grabs all the xml files in the folder, parses them (with another function I put together in another module), and then imports the associated csv files into one large dataframe.

def import_pfolder(folder):
    vr_xmls = glob(folder+r'\*VoltageRecording*.xml')
    data = pd.DataFrame()
    counter = 1

    for file in vr_xmls:
        file_vals = pxml.parse_vr(file)
        df = import_pcsv(folder + '\\' + file_vals['voltage recording'] + '.csv')
        df['Sweep'] = 'Sweep' + str(counter)
        df.set_index('Sweep', append=True, inplace=True)
        data = data.append(df.reorder_levels(['Sweep','Time']))
        counter+=1

    return data

问题是,如果文件夹中有大量文件,这会变得很慢.第一个函数本质上和普通的 pandas read_csv 函数一样快(它慢了几毫秒,但没关系)

The problem is that this gets really slow if there are a large number of files in the folder. The first function is essentially as fast as the normal pandas read_csv function (it's a few ms slower, but that's fine)

我对文件夹中不同数量的 csv/xml 文件对进行了一些计时测试.每个的 %time 是:

I ran some timing tests for different number of csv/xml file pairs in the folder. the %time for each is:

1 个文件 = 339 毫秒

1 file = 339 ms

5 个文件 = 2.61 秒

5 files = 2.61 sec

10 个文件 = 7.53 秒

10 files = 7.53 sec

20 个文件 = 24.7 秒

20 files = 24.7 sec

40 个文件 = 87 秒

40 files = 87 sec

最后一个是真正的杀手.

That last one is a real killer.

为了弄清楚这一点,我还在 import_pfolder() 的 for 循环的每一行上获得了一些时间统计信息 - 括号中的时间是 %timeit 的最佳时间

In trying to figure this out I also got some time stats on each line of the for loop in import_pfolder() - the time in parentheses is best time from %timeit

第一行 = 2 毫秒 (614us)

1st line = 2 ms (614us)

第二行 = 98 毫秒(82.2 毫秒)

2nd line = 98 ms (82.2ms)

第三行 = 21 毫秒(10.8 毫秒)

3rd line = 21 ms (10.8ms)

第 4 行 = 49 毫秒

4th line = 49 ms

第 5 行 = 225 毫秒

5th line = 225 ms

我猜速度变慢是因为每次循环都必须复制最后一行中的整个数据帧.不过,我无法弄清楚如何避免这种情况.我在 .csv 文件中唯一确定的列是第一列(时间)——来自那里的文件可以有可变数量的列和行.有没有办法预先预先分配一个数据帧来考虑这种可变性?这甚至会有帮助吗?

I'm guessing the slow down is from having to copy over the entire dataframe in the last line for every loop through. I'm having trouble figuring out how to avoid this, though. The only column that I know for sure in the .csv files is the first one (Time) - the files from there can have a variable number of columns and rows. Is there a way to preallocate a dataframe beforehand that takes that variability into account? Would that even help?

任何建议将不胜感激.

谢谢

推荐答案

根本不要追加那样的 DataFrame(也不要从空的开始),每个追加都是一个副本.这将导致单个副本和持续的附加性能.Concat 文档在这里

Don't append DataFrames like that at all (nor start with an empty one), each append is a copy. This will result in a single copy, and constant appending performance. Concat docs are here

相反:

frames = []

for f in files:
      frames.append(process_your_file(f))

result = pd.concat(frames)

这篇关于加速数据导入功能(Pandas 并附加到 DataFrame)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 17:52