问题描述
初始问题:
我正在遍历几千个带有 Python Pandas DataFrame 的 pickle 文件,这些文件的行数(大约在 600 到 1300 之间)有所不同,但列数没有变化(确切地说是 636).然后我转换它们(对每个都进行完全相同的转换)并使用 DataFrame.to_csv()
方法将它们附加到一个 csv 文件中.
to_csv
代码摘录:
if picklefile == '0000.p':dftemp.to_csv(finalnormCSVFile)别的:dftemp.to_csv(finalnormCSVFile, mode='a', header=False)
让我烦恼的是它开始很快但性能呈指数级下降,我保留了处理时间日志:
开始:2015-03-24 03:26:36.9580582015-03-24 03:26:36.958058计数 = 0时间:0:00:002015-03-24 03:30:53.254755计数 = 100时间:0:04:16.2966972015-03-24 03:39:16.149883计数 = 200时间:0:08:22.8951282015-03-24 03:51:12.247342计数 = 300时间:0:11:56.0974592015-03-24 04:06:45.099034计数 = 400时间:0:15:32.8516922015-03-24 04:26:09.411652计数 = 500时间:0:19:24.3126182015-03-24 04:49:14.519529计数 = 600时间:0:23:05.1078772015-03-24 05:16:30.175175计数 = 700时间:0:27:15.6556462015-03-24 05:47:04.792289计数 = 800时间:0:30:34.6171142015-03-24 06:21:35.137891计数 = 900时间:0:34:30.3456022015-03-24 06:59:53.313468计数 = 1000时间:0:38:18.1755772015-03-24 07:39:29.805270计数 = 1100时间:0:39:36.4918022015-03-24 08:20:30.852613计数 = 1200时间:0:41:01.0473432015-03-24 09:04:14.613948计数 = 1300时间:0:43:43.7613352015-03-24 09:51:45.502538计数 = 1400时间:0:47:30.8885902015-03-24 11:09:48.366950计数 = 1500时间:1:18:02.8644122015-03-24 13:02:33.152289计数 = 1600时间:1:52:44.7853392015-03-24 15:30:58.534493计数 = 1700时间:2:28:25.3822042015-03-24 18:09:40.391639计数 = 1800时间:2:38:41.8571462015-03-24 21:03:19.204587计数 = 1900时间:2:53:38.8129482015-03-25 00:00:05.855970计数 = 2000时间:2:56:46.6513832015-03-25 03:53:05.020944计数 = 2100时间:3:52:59.1649742015-03-25 05:02:16.534149计数 = 2200时间:1:09:11.5132052015-03-25 06:07:32.446801计数 = 2300时间:1:05:15.9126522015-03-25 07:13:45.075216计数 = 2400时间:1:06:12.6284152015-03-25 08:20:17.927286计数 = 2500时间:1:06:32.8520702015-03-25 09:27:20.676520计数 = 2600时间:1:07:02.7492342015-03-25 10:35:01.657199计数 = 2700时间:1:07:40.9806792015-03-25 11:43:20.788178计数 = 2800时间:1:08:19.1309792015-03-25 12:53:57.734390计数 = 2900时间:1:10:36.9462122015-03-25 14:07:20.936314计数 = 3000时间:1:13:23.2019242015-03-25 15:22:47.076786计数 = 3100时间:1:15:26.1404722015-03-25 19:51:10.776342计数 = 3200时间:4:28:23.6995562015-03-26 03:06:47.372698计数 = 3300时间:7:15:36.596356计数 = 3324周期结束:2015-03-26 03:59:54.161842结束: 2015-03-26 03:59:54.161842总时长:2 天,0:33:17.203784
更新 #1:
我按照你建议的@Alexander 做了,但这肯定与 to_csv()
方法有关:
开始:2015-03-26 05:18:25.9484102015-03-26 05:18:25.948410计数 = 0时间:0:00:002015-03-26 05:20:30.425041计数 = 100时间:0:02:04.4766312015-03-26 05:22:27.680582计数 = 200时间:0:01:57.2555412015-03-26 05:24:26.012598计数 = 300时间:0:01:58.3320162015-03-26 05:26:16.542835计数 = 400时间:0:01:50.5302372015-03-26 05:27:58.063196计数 = 500时间:0:01:41.5203612015-03-26 05:29:45.769580计数 = 600时间:0:01:47.7063842015-03-26 05:31:44.537213计数 = 700时间:0:01:58.7676332015-03-26 05:33:41.591837计数 = 800时间:0:01:57.0546242015-03-26 05:35:43.963843计数 = 900时间:0:02:02.3720062015-03-26 05:37:46.171643计数 = 1000时间:0:02:02.2078002015-03-26 05:38:36.493399计数 = 1100时间:0:00:50.3217562015-03-26 05:39:42.123395计数 = 1200时间:0:01:05.6299962015-03-26 05:41:13.122048计数 = 1300时间:0:01:30.9986532015-03-26 05:42:41.885513计数 = 1400时间:0:01:28.7634652015-03-26 05:44:20.937519计数 = 1500时间:0:01:39.0520062015-03-26 05:46:16.012842计数 = 1600时间:0:01:55.0753232015-03-26 05:48:14.727444计数 = 1700时间:0:01:58.7146022015-03-26 05:50:15.792909计数 = 1800时间:0:02:01.0654652015-03-26 05:51:48.228601计数 = 1900时间:0:01:32.4356922015-03-26 05:52:22.755937计数 = 2000时间:0:00:34.5273362015-03-26 05:52:58.289474计数 = 2100时间:0:00:35.5335372015-03-26 05:53:39.406794计数 = 2200时间:0:00:41.1173202015-03-26 05:54:11.348939计数 = 2300时间:0:00:31.9421452015-03-26 05:54:43.057281计数 = 2400时间:0:00:31.7083422015-03-26 05:55:19.483600计数 = 2500时间:0:00:36.4263192015-03-26 05:55:52.216424计数 = 2600时间:0:00:32.7328242015-03-26 05:56:27.409991计数 = 2700时间:0:00:35.1935672015-03-26 05:57:00.810139计数 = 2800时间:0:00:33.4001482015-03-26 05:58:17.109425计数 = 2900时间:0:01:16.2992862015-03-26 05:59:31.021719计数 = 3000时间:0:01:13.9122942015-03-26 06:00:49.200303计数 = 3100时间:0:01:18.1785842015-03-26 06:02:07.732028计数 = 3200时间:0:01:18.5317252015-03-26 06:03:28.518541计数 = 3300时间:0:01:20.786513计数 = 3324周期结束:2015-03-26 06:03:47.321182结束: 2015-03-26 06:03:47.321182总时长:0:45:21.372772
根据要求,源代码:
进口泡菜将熊猫导入为 pd将 numpy 导入为 np从操作系统导入列表目录从 os.path 导入 isfile,加入从日期时间导入日期时间# 定义深度复制pandas数据帧的函数:def very_deep_copy(self):返回 pd.DataFrame(self.values.copy(), self.index.copy(), self.columns.copy())# 向 Dataframe 模块添加函数:pd.DataFrame.very_deep_copy = very_deep_copy#Define 数据帧头:头 = ['ConcatIndex', 'Concatenated String Index', 'FileID', ..., 'Attribute', 'Attribute']排除 = ['ConcatIndex', 'Concatenated String Index', 'FileID', ... , 'Real URL Array']path = "./dataset_final/"pickleFiles = [ f for f in listdir(path) if isfile(join(path,f)) ]finalnormCSVFile = 'finalNormalizedDataFrame2.csv'计数 = 0start_time = datetime.now()t1 = 开始时间打印(开始:"+ str(开始时间)+\n")对于pickleFiles 中的picklefile:如果计数%100 == 0:t2 = datetime.now()打印(str(t2))打印('计数 = ' + str(计数))打印('时间:' + str(t2 - t1)+ '\n')t1 = t2#DataFrame 操作:df = pd.read_pickle(path + picklefile)df['ConcatIndex'] = 100000*df.FileID + df.ID对于范围内的 i (0, len(df)):df.loc[i, 'Concatenated String Index'] = str(df['ConcatIndex'][i]).zfill(10)df.index = df.ConcatIndex#DataFrame 规范化:dftemp = df.very_deep_copy()对于头部字符串:如果字符串在排除中:如果字符串 != 'ConcatIndex':dftemp.drop(字符串,轴= 1,就地=真)别的:如果字符串中的真实":max = pd.DataFrame.max(df[string.strip('Real ')])elif 'child' 字符串:max = pd.DataFrame.max(df[string.strip('child')+'desc'])别的:max = pd.DataFrame.max(df[string])如果最大!= 0:dftemp[string] = dftemp[string]/maxdftemp.drop('ConcatIndex', axis=1, inplace=True)#以 CSV 格式保存数据帧:如果picklefile == '0000.p':dftemp.to_csv(finalnormCSVFile)别的:dftemp.to_csv(finalnormCSVFile, mode='a', header=False)计数 += 1打印('计数 = ' + str(计数))cycle_end_time = datetime.now()print("循环结束:" + str(cycle_end_time) + "\n")end_time = datetime.now()打印(结束:" + str(end_time))print('总时长:' + str(end_time - start_time) + '\n')
更新 #2:
按照建议,我对前几百个 picklefile 执行了命令 %prun %run "./DataSetNormalization.py"
,结果如下:
136373640 次函数调用(136342619 次原语调用)在 1018.769 秒内订购者:内部时间ncalls tottime percall cumtime percall filename:lineno(function)220 667.069 3.032 667.069 3.032 {'_io.TextIOWrapper'对象的'close'方法}1540 42.046 0.027 46.341 0.030 {pandas.lib.write_csv_rows}219 34.886 0.159 34.886 0.159 {内置方法收集}3520 16.782 0.005 16.782 0.005 {pandas.algos.take_2d_axis1_object_object}78323 9.948 0.000 9.948 0.000 {内置方法为空}25336892 9.645 0.000 12.635 0.000 {内置方法 isinstance}1433941 9.344 0.000 9.363 0.000 generic.py:1845(__setattr__)221051/220831 7.387 0.000 119.767 0.001 indexing.py:194(_setitem_with_indexer)723540 7.312 0.000 7.312 0.000 {'numpy.ufunc'对象的'reduce'方法}273414 7.137 0.000 20.642 0.000 internals.py:2656(set)604245 6.846 0.000 6.850 0.000 {numpy.ndarray"对象的复制"方法}1760 6.566 0.004 6.566 0.004 {pandas.lib.isnullobj}276274 5.315 0.000 5.315 0.000 {'numpy.ndarray'对象的'ravel'方法}1719244 5.264 0.000 5.266 0.000 {内置方法数组}1102450 5.070 0.000 29.543 0.000 internals.py:1804(make_block)1045687 5.056 0.000 10.209 0.000 index.py:709(__getitem__)1 4.718 4.718 1018.727 1018.727 DataSetNormalization.py:6()602485 4.575 0.000 15.087 0.000 internals.py:2586(iget)441662 4.562 0.000 33.386 0.000 internals.py:2129(适用)272754 4.550 0.000 4.550 0.000 internals.py:1291(set)220883 4.073 0.000 4.073 0.000 {内置方法charmap_encode}4781222 3.805 0.000 4.349 0.000 {内置方法getattr}52143 3.673 0.000 3.673 0.000 {内置方法truediv}1920486 3.671 0.000 3.672 0.000 {'pandas.index.IndexEngine'对象的'get_loc'方法}1096730 3.513 0.000 8.370 0.000 internals.py:3035(__init__)875899 3.508 0.000 14.458 0.000 series.py:134(__init__)334357 3.420 0.000 3.439 0.000 {pandas.lib.infer_dtype}2581268 3.419 0.000 4.774 0.000 {pandas.lib.values_from_object}1102450 3.036 0.000 6.110 0.000 internals.py:59(__init__)824856 2.888 0.000 45.749 0.000 generic.py:1047(_get_item_cache)2424185 2.657 0.000 3.870 0.000 numeric.py:1910(isscalar)273414 2.505 0.000 9.332 0.000 frame.py:2113(_sanitize_column)1646198 2.491 0.000 2.880 0.000 index.py:698(__contains__)879639 2.461 0.000 2.461 0.000 generic.py:87(__init__)552988 2.385 0.000 4.451 0.000 internals.py:3565(_get_blkno_placements)824856 2.349 0.000 51.282 0.000 frame.py:1655(__getitem__)220831 2.224 0.000 21.670 0.000 internals.py:460(setitem)326437 2.183 0.000 11.352 0.000 common.py:1862(_possably_infer_to_datetimelike)602485 2.167 0.000 16.974 0.000 frame.py:1982(_box_item_values)602485 2.087 0.000 23.202 0.000 internals.py:2558(get)770739 2.036 0.000 6.471 0.000 internals.py:1238(__init__)276494 1.966 0.000 1.966 0.000 {pandas.lib.get_blkno_indexers}10903876/10873076 1.935 0.000 1.972 0.000 {内置方法len}220831 1.924 0.000 76.647 0.000 indexing.py:372(setter)220 1.893 0.009 1.995 0.009 {内置方法加载}1920486 1.855 0.000 8.198 0.000 index.py:1173(get_loc)112860 1.828 0.000 9.607 0.000 common.py:202(_isnull_ndarraylike)602485 1.707 0.000 8.903 0.000 series.py:238(from_array)875899 1.688 0.000 2.493 0.000 series.py:263(_set_axis)3300 1.661 0.001 1.661 0.001 {'numpy.ndarray'对象的'tolist'方法}1102670 1.609 0.000 2.024 0.000 internals.py:108(mgr_locs)4211850 1.593 0.000 1.593 0.000 {内置方法 issubclass}1335546 1.501 0.000 2.253 0.000 generic.py:297(_get_axis_name)273414 1.411 0.000 37.866 0.000 frame.py:1994(__setitem__)441662 1.356 0.000 7.884 0.000 indexing.py:982(_convert_to_indexer)220831 1.349 0.000 131.331 0.001 indexing.py:95(__setitem__)273414 1.329 0.000 23.170 0.000 generic.py:1138(_set_item)326437 1.276 0.000 6.203 0.000 fromnumeric.py:2259(prod)274734 1.271 0.000 2.113 0.000 shape_base.py:60(至少_2d)273414 1.242 0.000 34.396 0.000 frame.py:2072(_set_item)602485 1.183 0.000 1.979 0.000 generic.py:1061(_set_as_cached)934422 1.175 0.000 1.894 0.000 {numpy.ndarray"对象的视图"方法}1540 1.144 0.001 58.217 0.038 format.py:1409(_save_chunk)220831 1.144 0.000 9.198 0.000 indexing.py:139(_convert_tuple)441662 1.137 0.000 3.036 0.000 indexing.py:154(_convert_scalar_indexer)220831 1.087 0.000 1.281 0.000 arrayprint.py:343(array2string)1332026 1.056 0.000 3.997 0.000 generic.py:310(_get_axis)602485 1.046 0.000 9.949 0.000 frame.py:1989(_box_col_values)220 1.029 0.005 1.644 0.007 internals.py:2429(_interleave)824856 1.025 0.000 46.777 0.000 frame.py:1680(_getitem_column)1491578 1.022 0.000 2.990 0.000 common.py:58(_check)782616 1.010 0.000 3.513 0.000 numeric.py:394(asarray)290354 0.988 0.000 1.386 0.000 internals.py:1950(形状)220831 0.958 0.000 15.392 0.000 generic.py:2101(复制)273414 0.940 0.000 1.796 0.000 indexing.py:1520(_convert_to_index_sliceable)220831 0.920 0.000 1.558 0.000 common.py:1110(_possably_downcast_to_dtype)220611 0.914 0.000 0.914 0.000 {pandas.lib.is_bool_array}498646 0.906 0.000 0.906 0.000 {方法'清除''dict'对象}715345 0.848 0.000 13.083 0.000 common.py:132(_isnull_new)452882 0.824 0.000 1.653 0.000 index.py:256(__array_finalize__)602485 0.801 0.000 0.801 0.000 internals.py:208(iget)52583 0.748 0.000 2.038 0.000 common.py:1223(_fill_zeros)606005 0.736 0.000 6.755 0.000 internals.py:95(make_block_same_class)708971 0.732 0.000 2.156 0.000 internals.py:3165(值)1760378 0.724 0.000 0.724 0.000 internals.py:2025(_get_items)109560 0.720 0.000 6.140 0.000 nanops.py:152(_get_values)220831 0.718 0.000 11.017 0.000 internals.py:2395(复制)924669 0.712 0.000 1.298 0.000 common.py:2248(_get_dtype_type)1515796 0.698 0.000 0.868 0.000 {内置方法 hasattr}220831 0.670 0.000 4.299 0.000 internals.py:435(复制)875899 0.661 0.000 0.661 0.000 series.py:285(_set_subtyp)220831 0.648 0.000 0.649 0.000 {'pandas.index.IndexEngine'对象的'get_value'方法}452882 0.640 0.000 0.640 0.000 index.py:218(_reset_identity)715345 0.634 0.000 1.886 0.000 {pandas.lib.isscalar}1980 0.626 0.000 1.172 0.001 internals.py:3497(_merge_blocks)220831 0.620 0.000 2.635 0.000 common.py:1933(_is_bool_indexer)272754 0.608 0.000 0.899 0.000 internals.py:1338(should_store)220831 0.599 0.000 3.463 0.000 series.py:482(__getitem__)498645 0.591 0.000 1.497 0.000 generic.py:1122(_clear_item_cache)1119390 0.584 0.000 1.171 0.000 index.py:3936(_ensure_index)220831 0.573 0.000 1.883 0.000 index.py:222(查看)814797 0.555 0.000 0.905 0.000 internals.py:3086(_values)52583 0.543 0.000 15.545 0.000 ops.py:469(包装器)220831 0.536 0.000 3.760 0.000 internals.py:371(_try_cast_result)228971 0.533 0.000 0.622 0.000 generic.py:1829(__getattr__)769651 0.528 0.000 0.528 0.000 {内置方法最小值}224351 0.509 0.000 2.030 0.000 generic.py:1099(_maybe_update_cacher)...
我将重新运行它以进行确认,但看起来它肯定与 Pandas 的 to_csv()
方法有关,因为大部分运行时间都用于 io 和 csv writer.为什么会有这种效果?有什么建议吗?
更新 #3:
好吧,我做了一个完整的 %prun
测试,实际上几乎 90% 的时间都用在了 {method 'close' of '_io.TextIOWrapper' 对象}
.所以我想这就是问题所在……你们怎么看?
我的问题是:
- 性能下降的原因是什么?
pandas.DataFrames.to_csv()
附加模式是否在每次写入文件时加载整个文件?- 有没有办法改进流程?
在这种情况下,您应该分析您的代码(以查看哪些函数调用占用的时间最多),这样您可以凭经验检查它在 read_csv
中确实很慢,而不是在其他地方...
从查看您的代码来看:首先,这里有很多复制和很多循环(没有足够的矢量化)...每次您看到循环时,都会寻找一种方法来删除它.其次,当你使用zfill之类的东西时,我想知道你是否想要to_fwf
(固定宽度格式)而不是to_csv
?
一些健全性测试:某些文件是否明显比其他文件大(这可能会导致您点击交换)?你确定最大的文件只有 1200 行吗??你检查过这个吗?例如使用 wc -l
.
IMO 我认为它不太可能是垃圾收集..(正如在另一个答案中所建议的那样).
这里对您的代码进行了一些改进,应该会改进运行时.
列是固定的 我会提取列计算并将实数、子项和其他归一化向量化.使用应用而不是迭代(用于 zfill).
columns_to_drop = set(head) &set(exclude) # 也可能 - ['ConcatIndex']剩余列数 = 设置(头) - 设置(排除)real_cols = [r for r in resume_cols if 'Real ' in r]real_cols_suffix = [r.strip('Real ') for r in real]剩余列数 = 剩余列数 - real_colschild_cols = [r for r in resume_cols if 'child' in r]child_cols_desc = [r.strip('child'+'desc') for r in real]剩余列数 = 剩余列数 - 子列数对于计数,枚举中的picklefile(pickleFiles):如果计数 % 100 == 0:t2 = datetime.now()打印(str(t2))打印('计数 = ' + str(计数))打印('时间:' + str(t2 - t1)+ '\n')t1 = t2#DataFrame 操作:df = pd.read_pickle(path + picklefile)df['ConcatIndex'] = 100000*df.FileID + df.ID# 在这里使用 apply 而不是迭代df['Concatenated String Index'] = df['ConcatIndex'].apply(lambda x: str(x).zfill(10))df.index = df.ConcatIndex#DataFrame 规范化:dftemp = df.very_deep_copy() # 不要*认为*你需要这个# 删除所有排除dftemp.drop(columns_to_drop),axis=1,inplace=True)# 归一化真实的colsm = dftemp[real_cols_suffix].max()m.index = real_colsdftemp[real_cols] = dftemp[real_cols]/米# 标准化子列m = dftemp[child_cols_desc].max()m.index = child_colsdftemp[child_cols] = dftemp[child_cols]/m# 标准化剩余剩余 = 列表(剩余 - 孩子)dftemp[remaining] = dftemp[remaining]/dftemp[remaining].max()# 如果这种情况很重要,则丢弃 m 的行,其中 .max() 为 0#如果最大!= 0:# dftemp[string] = dftemp[string]/max# 这个是提前删除的,如果你需要它,然后从 columns_to_drop 中减去 ['ConcatIndex']# dftemp.drop('ConcatIndex', axis=1, inplace=True)#以 CSV 格式保存数据帧:如果picklefile == '0000.p':dftemp.to_csv(finalnormCSVFile)别的:dftemp.to_csv(finalnormCSVFile, mode='a', header=False)
作为一种风格,我可能会选择将这些部分中的每一个都包装到函数中,这也意味着如果这真的是问题所在,那么更多的东西可以被 gc'd...>
另一个更快的选择是使用 pytables (HDF5Store),如果您不需要将结果输出为 csv(但我希望您这样做)...
目前最好的做法是分析您的代码.例如在 ipython 中使用 %prun
例如见 http://pynash.org/2013/03/06/timing-和-profiling.html.然后你可以看到它肯定是read_csv
,具体是哪里(你的哪一行代码和哪几行pandas代码).
啊哈,我错过了您将所有这些附加到单个 csv 文件.在你的修剪中,它显示大部分时间都花在 close
上,所以让我们保持文件打开:
# 在 for 循环之外(所以文件只打开和关闭一次)f = 打开(finalnormCSVFile,'w')...对于泡菜文件...如果picklefile == '0000.p':dftemp.to_csv(f)别的:dftemp.to_csv(f, mode='a', header=False)...f.close()
每次文件在可以追加之前被打开,它需要在写入之前寻找到最后,这可能是昂贵的(我不明白为什么这应该是那个 不好,但保持打开状态就不需要这样做了).
Initial Question:
I'm looping through a couple of thousand pickle files with Python Pandas DataFrames in it which vary in the number of rows (between aprox. 600 and 1300) but not in the number of collumns (636 to be exact). Then I transform them (exactly the same tranformations to each) and append them to a csv file using the DataFrame.to_csv()
method.
The to_csv
code excerpt:
if picklefile == '0000.p':
dftemp.to_csv(finalnormCSVFile)
else:
dftemp.to_csv(finalnormCSVFile, mode='a', header=False)
What bothers me is that it starts off pretty fast but performance decreases exponentially, I kept a processing time log:
start: 2015-03-24 03:26:36.958058
2015-03-24 03:26:36.958058
count = 0
time: 0:00:00
2015-03-24 03:30:53.254755
count = 100
time: 0:04:16.296697
2015-03-24 03:39:16.149883
count = 200
time: 0:08:22.895128
2015-03-24 03:51:12.247342
count = 300
time: 0:11:56.097459
2015-03-24 04:06:45.099034
count = 400
time: 0:15:32.851692
2015-03-24 04:26:09.411652
count = 500
time: 0:19:24.312618
2015-03-24 04:49:14.519529
count = 600
time: 0:23:05.107877
2015-03-24 05:16:30.175175
count = 700
time: 0:27:15.655646
2015-03-24 05:47:04.792289
count = 800
time: 0:30:34.617114
2015-03-24 06:21:35.137891
count = 900
time: 0:34:30.345602
2015-03-24 06:59:53.313468
count = 1000
time: 0:38:18.175577
2015-03-24 07:39:29.805270
count = 1100
time: 0:39:36.491802
2015-03-24 08:20:30.852613
count = 1200
time: 0:41:01.047343
2015-03-24 09:04:14.613948
count = 1300
time: 0:43:43.761335
2015-03-24 09:51:45.502538
count = 1400
time: 0:47:30.888590
2015-03-24 11:09:48.366950
count = 1500
time: 1:18:02.864412
2015-03-24 13:02:33.152289
count = 1600
time: 1:52:44.785339
2015-03-24 15:30:58.534493
count = 1700
time: 2:28:25.382204
2015-03-24 18:09:40.391639
count = 1800
time: 2:38:41.857146
2015-03-24 21:03:19.204587
count = 1900
time: 2:53:38.812948
2015-03-25 00:00:05.855970
count = 2000
time: 2:56:46.651383
2015-03-25 03:53:05.020944
count = 2100
time: 3:52:59.164974
2015-03-25 05:02:16.534149
count = 2200
time: 1:09:11.513205
2015-03-25 06:07:32.446801
count = 2300
time: 1:05:15.912652
2015-03-25 07:13:45.075216
count = 2400
time: 1:06:12.628415
2015-03-25 08:20:17.927286
count = 2500
time: 1:06:32.852070
2015-03-25 09:27:20.676520
count = 2600
time: 1:07:02.749234
2015-03-25 10:35:01.657199
count = 2700
time: 1:07:40.980679
2015-03-25 11:43:20.788178
count = 2800
time: 1:08:19.130979
2015-03-25 12:53:57.734390
count = 2900
time: 1:10:36.946212
2015-03-25 14:07:20.936314
count = 3000
time: 1:13:23.201924
2015-03-25 15:22:47.076786
count = 3100
time: 1:15:26.140472
2015-03-25 19:51:10.776342
count = 3200
time: 4:28:23.699556
2015-03-26 03:06:47.372698
count = 3300
time: 7:15:36.596356
count = 3324
end of cycle: 2015-03-26 03:59:54.161842
end: 2015-03-26 03:59:54.161842
total duration: 2 days, 0:33:17.203784
Update #1:
I did as you suggested @Alexander but it has certainly to do with the to_csv()
mehod:
start: 2015-03-26 05:18:25.948410
2015-03-26 05:18:25.948410
count = 0
time: 0:00:00
2015-03-26 05:20:30.425041
count = 100
time: 0:02:04.476631
2015-03-26 05:22:27.680582
count = 200
time: 0:01:57.255541
2015-03-26 05:24:26.012598
count = 300
time: 0:01:58.332016
2015-03-26 05:26:16.542835
count = 400
time: 0:01:50.530237
2015-03-26 05:27:58.063196
count = 500
time: 0:01:41.520361
2015-03-26 05:29:45.769580
count = 600
time: 0:01:47.706384
2015-03-26 05:31:44.537213
count = 700
time: 0:01:58.767633
2015-03-26 05:33:41.591837
count = 800
time: 0:01:57.054624
2015-03-26 05:35:43.963843
count = 900
time: 0:02:02.372006
2015-03-26 05:37:46.171643
count = 1000
time: 0:02:02.207800
2015-03-26 05:38:36.493399
count = 1100
time: 0:00:50.321756
2015-03-26 05:39:42.123395
count = 1200
time: 0:01:05.629996
2015-03-26 05:41:13.122048
count = 1300
time: 0:01:30.998653
2015-03-26 05:42:41.885513
count = 1400
time: 0:01:28.763465
2015-03-26 05:44:20.937519
count = 1500
time: 0:01:39.052006
2015-03-26 05:46:16.012842
count = 1600
time: 0:01:55.075323
2015-03-26 05:48:14.727444
count = 1700
time: 0:01:58.714602
2015-03-26 05:50:15.792909
count = 1800
time: 0:02:01.065465
2015-03-26 05:51:48.228601
count = 1900
time: 0:01:32.435692
2015-03-26 05:52:22.755937
count = 2000
time: 0:00:34.527336
2015-03-26 05:52:58.289474
count = 2100
time: 0:00:35.533537
2015-03-26 05:53:39.406794
count = 2200
time: 0:00:41.117320
2015-03-26 05:54:11.348939
count = 2300
time: 0:00:31.942145
2015-03-26 05:54:43.057281
count = 2400
time: 0:00:31.708342
2015-03-26 05:55:19.483600
count = 2500
time: 0:00:36.426319
2015-03-26 05:55:52.216424
count = 2600
time: 0:00:32.732824
2015-03-26 05:56:27.409991
count = 2700
time: 0:00:35.193567
2015-03-26 05:57:00.810139
count = 2800
time: 0:00:33.400148
2015-03-26 05:58:17.109425
count = 2900
time: 0:01:16.299286
2015-03-26 05:59:31.021719
count = 3000
time: 0:01:13.912294
2015-03-26 06:00:49.200303
count = 3100
time: 0:01:18.178584
2015-03-26 06:02:07.732028
count = 3200
time: 0:01:18.531725
2015-03-26 06:03:28.518541
count = 3300
time: 0:01:20.786513
count = 3324
end of cycle: 2015-03-26 06:03:47.321182
end: 2015-03-26 06:03:47.321182
total duration: 0:45:21.372772
And as requested, the source code:
import pickle
import pandas as pd
import numpy as np
from os import listdir
from os.path import isfile, join
from datetime import datetime
# Defining function to deep copy pandas data frame:
def very_deep_copy(self):
return pd.DataFrame(self.values.copy(), self.index.copy(), self.columns.copy())
# Adding function to Dataframe module:
pd.DataFrame.very_deep_copy = very_deep_copy
#Define Data Frame Header:
head = [
'ConcatIndex', 'Concatenated String Index', 'FileID', ..., 'Attribute<autosave>', 'Attribute<bgcolor>'
]
exclude = [
'ConcatIndex', 'Concatenated String Index', 'FileID', ... , 'Real URL Array'
]
path = "./dataset_final/"
pickleFiles = [ f for f in listdir(path) if isfile(join(path,f)) ]
finalnormCSVFile = 'finalNormalizedDataFrame2.csv'
count = 0
start_time = datetime.now()
t1 = start_time
print("start: " + str(start_time) + "\n")
for picklefile in pickleFiles:
if count%100 == 0:
t2 = datetime.now()
print(str(t2))
print('count = ' + str(count))
print('time: ' + str(t2 - t1) + '\n')
t1 = t2
#DataFrame Manipulation:
df = pd.read_pickle(path + picklefile)
df['ConcatIndex'] = 100000*df.FileID + df.ID
for i in range(0, len(df)):
df.loc[i, 'Concatenated String Index'] = str(df['ConcatIndex'][i]).zfill(10)
df.index = df.ConcatIndex
#DataFrame Normalization:
dftemp = df.very_deep_copy()
for string in head:
if string in exclude:
if string != 'ConcatIndex':
dftemp.drop(string, axis=1, inplace=True)
else:
if 'Real ' in string:
max = pd.DataFrame.max(df[string.strip('Real ')])
elif 'child' in string:
max = pd.DataFrame.max(df[string.strip('child')+'desc'])
else:
max = pd.DataFrame.max(df[string])
if max != 0:
dftemp[string] = dftemp[string]/max
dftemp.drop('ConcatIndex', axis=1, inplace=True)
#Saving DataFrame in CSV:
if picklefile == '0000.p':
dftemp.to_csv(finalnormCSVFile)
else:
dftemp.to_csv(finalnormCSVFile, mode='a', header=False)
count += 1
print('count = ' + str(count))
cycle_end_time = datetime.now()
print("end of cycle: " + str(cycle_end_time) + "\n")
end_time = datetime.now()
print("end: " + str(end_time))
print('total duration: ' + str(end_time - start_time) + '\n')
Update #2:
As suggested I executed the command %prun %run "./DataSetNormalization.py"
for the first couple of hundred picklefiles and the result is as followed:
136373640 function calls (136342619 primitive calls) in 1018.769 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
220 667.069 3.032 667.069 3.032 {method 'close' of '_io.TextIOWrapper' objects}
1540 42.046 0.027 46.341 0.030 {pandas.lib.write_csv_rows}
219 34.886 0.159 34.886 0.159 {built-in method collect}
3520 16.782 0.005 16.782 0.005 {pandas.algos.take_2d_axis1_object_object}
78323 9.948 0.000 9.948 0.000 {built-in method empty}
25336892 9.645 0.000 12.635 0.000 {built-in method isinstance}
1433941 9.344 0.000 9.363 0.000 generic.py:1845(__setattr__)
221051/220831 7.387 0.000 119.767 0.001 indexing.py:194(_setitem_with_indexer)
723540 7.312 0.000 7.312 0.000 {method 'reduce' of 'numpy.ufunc' objects}
273414 7.137 0.000 20.642 0.000 internals.py:2656(set)
604245 6.846 0.000 6.850 0.000 {method 'copy' of 'numpy.ndarray' objects}
1760 6.566 0.004 6.566 0.004 {pandas.lib.isnullobj}
276274 5.315 0.000 5.315 0.000 {method 'ravel' of 'numpy.ndarray' objects}
1719244 5.264 0.000 5.266 0.000 {built-in method array}
1102450 5.070 0.000 29.543 0.000 internals.py:1804(make_block)
1045687 5.056 0.000 10.209 0.000 index.py:709(__getitem__)
1 4.718 4.718 1018.727 1018.727 DataSetNormalization.py:6(<module>)
602485 4.575 0.000 15.087 0.000 internals.py:2586(iget)
441662 4.562 0.000 33.386 0.000 internals.py:2129(apply)
272754 4.550 0.000 4.550 0.000 internals.py:1291(set)
220883 4.073 0.000 4.073 0.000 {built-in method charmap_encode}
4781222 3.805 0.000 4.349 0.000 {built-in method getattr}
52143 3.673 0.000 3.673 0.000 {built-in method truediv}
1920486 3.671 0.000 3.672 0.000 {method 'get_loc' of 'pandas.index.IndexEngine' objects}
1096730 3.513 0.000 8.370 0.000 internals.py:3035(__init__)
875899 3.508 0.000 14.458 0.000 series.py:134(__init__)
334357 3.420 0.000 3.439 0.000 {pandas.lib.infer_dtype}
2581268 3.419 0.000 4.774 0.000 {pandas.lib.values_from_object}
1102450 3.036 0.000 6.110 0.000 internals.py:59(__init__)
824856 2.888 0.000 45.749 0.000 generic.py:1047(_get_item_cache)
2424185 2.657 0.000 3.870 0.000 numeric.py:1910(isscalar)
273414 2.505 0.000 9.332 0.000 frame.py:2113(_sanitize_column)
1646198 2.491 0.000 2.880 0.000 index.py:698(__contains__)
879639 2.461 0.000 2.461 0.000 generic.py:87(__init__)
552988 2.385 0.000 4.451 0.000 internals.py:3565(_get_blkno_placements)
824856 2.349 0.000 51.282 0.000 frame.py:1655(__getitem__)
220831 2.224 0.000 21.670 0.000 internals.py:460(setitem)
326437 2.183 0.000 11.352 0.000 common.py:1862(_possibly_infer_to_datetimelike)
602485 2.167 0.000 16.974 0.000 frame.py:1982(_box_item_values)
602485 2.087 0.000 23.202 0.000 internals.py:2558(get)
770739 2.036 0.000 6.471 0.000 internals.py:1238(__init__)
276494 1.966 0.000 1.966 0.000 {pandas.lib.get_blkno_indexers}
10903876/10873076 1.935 0.000 1.972 0.000 {built-in method len}
220831 1.924 0.000 76.647 0.000 indexing.py:372(setter)
220 1.893 0.009 1.995 0.009 {built-in method load}
1920486 1.855 0.000 8.198 0.000 index.py:1173(get_loc)
112860 1.828 0.000 9.607 0.000 common.py:202(_isnull_ndarraylike)
602485 1.707 0.000 8.903 0.000 series.py:238(from_array)
875899 1.688 0.000 2.493 0.000 series.py:263(_set_axis)
3300 1.661 0.001 1.661 0.001 {method 'tolist' of 'numpy.ndarray' objects}
1102670 1.609 0.000 2.024 0.000 internals.py:108(mgr_locs)
4211850 1.593 0.000 1.593 0.000 {built-in method issubclass}
1335546 1.501 0.000 2.253 0.000 generic.py:297(_get_axis_name)
273414 1.411 0.000 37.866 0.000 frame.py:1994(__setitem__)
441662 1.356 0.000 7.884 0.000 indexing.py:982(_convert_to_indexer)
220831 1.349 0.000 131.331 0.001 indexing.py:95(__setitem__)
273414 1.329 0.000 23.170 0.000 generic.py:1138(_set_item)
326437 1.276 0.000 6.203 0.000 fromnumeric.py:2259(prod)
274734 1.271 0.000 2.113 0.000 shape_base.py:60(atleast_2d)
273414 1.242 0.000 34.396 0.000 frame.py:2072(_set_item)
602485 1.183 0.000 1.979 0.000 generic.py:1061(_set_as_cached)
934422 1.175 0.000 1.894 0.000 {method 'view' of 'numpy.ndarray'objects}
1540 1.144 0.001 58.217 0.038 format.py:1409(_save_chunk)
220831 1.144 0.000 9.198 0.000 indexing.py:139(_convert_tuple)
441662 1.137 0.000 3.036 0.000 indexing.py:154(_convert_scalar_indexer)
220831 1.087 0.000 1.281 0.000 arrayprint.py:343(array2string)
1332026 1.056 0.000 3.997 0.000 generic.py:310(_get_axis)
602485 1.046 0.000 9.949 0.000 frame.py:1989(_box_col_values)
220 1.029 0.005 1.644 0.007 internals.py:2429(_interleave)
824856 1.025 0.000 46.777 0.000 frame.py:1680(_getitem_column)
1491578 1.022 0.000 2.990 0.000 common.py:58(_check)
782616 1.010 0.000 3.513 0.000 numeric.py:394(asarray)
290354 0.988 0.000 1.386 0.000 internals.py:1950(shape)
220831 0.958 0.000 15.392 0.000 generic.py:2101(copy)
273414 0.940 0.000 1.796 0.000 indexing.py:1520(_convert_to_index_sliceable)
220831 0.920 0.000 1.558 0.000 common.py:1110(_possibly_downcast_to_dtype)
220611 0.914 0.000 0.914 0.000 {pandas.lib.is_bool_array}
498646 0.906 0.000 0.906 0.000 {method 'clear' of 'dict' objects}
715345 0.848 0.000 13.083 0.000 common.py:132(_isnull_new)
452882 0.824 0.000 1.653 0.000 index.py:256(__array_finalize__)
602485 0.801 0.000 0.801 0.000 internals.py:208(iget)
52583 0.748 0.000 2.038 0.000 common.py:1223(_fill_zeros)
606005 0.736 0.000 6.755 0.000 internals.py:95(make_block_same_class)
708971 0.732 0.000 2.156 0.000 internals.py:3165(values)
1760378 0.724 0.000 0.724 0.000 internals.py:2025(_get_items)
109560 0.720 0.000 6.140 0.000 nanops.py:152(_get_values)
220831 0.718 0.000 11.017 0.000 internals.py:2395(copy)
924669 0.712 0.000 1.298 0.000 common.py:2248(_get_dtype_type)
1515796 0.698 0.000 0.868 0.000 {built-in method hasattr}
220831 0.670 0.000 4.299 0.000 internals.py:435(copy)
875899 0.661 0.000 0.661 0.000 series.py:285(_set_subtyp)
220831 0.648 0.000 0.649 0.000 {method 'get_value' of 'pandas.index.IndexEngine' objects}
452882 0.640 0.000 0.640 0.000 index.py:218(_reset_identity)
715345 0.634 0.000 1.886 0.000 {pandas.lib.isscalar}
1980 0.626 0.000 1.172 0.001 internals.py:3497(_merge_blocks)
220831 0.620 0.000 2.635 0.000 common.py:1933(_is_bool_indexer)
272754 0.608 0.000 0.899 0.000 internals.py:1338(should_store)
220831 0.599 0.000 3.463 0.000 series.py:482(__getitem__)
498645 0.591 0.000 1.497 0.000 generic.py:1122(_clear_item_cache)
1119390 0.584 0.000 1.171 0.000 index.py:3936(_ensure_index)
220831 0.573 0.000 1.883 0.000 index.py:222(view)
814797 0.555 0.000 0.905 0.000 internals.py:3086(_values)
52583 0.543 0.000 15.545 0.000 ops.py:469(wrapper)
220831 0.536 0.000 3.760 0.000 internals.py:371(_try_cast_result)
228971 0.533 0.000 0.622 0.000 generic.py:1829(__getattr__)
769651 0.528 0.000 0.528 0.000 {built-in method min}
224351 0.509 0.000 2.030 0.000 generic.py:1099(_maybe_update_cacher)
...
I will rerun it for confirmation but looks like it certainly has something to do with pandas' to_csv()
method, because most of the run time is used on io and the csv writer. Why is it having this effect? Any suggestions?
Update #3:
Well, I did a full %prun
test and indeed almost 90% of the time spent is used on {method 'close' of '_io.TextIOWrapper' objects}
. So I guess here's the problem... What do you guys think?
My questions here are:
- What originates here the decrease in performance?
- Does
pandas.DataFrames.to_csv()
append mode load the whole file each time it writes to it? - Is there a way to enhance the process?
In these kind of situation you should profile your code (to see which function calls are taking the most time), that way you can check empirically that it is indeed slow in the read_csv
rather than elsewhere...
From looking at your code: Firstly there's a lot of copying here and a lot of looping (not enough vectorization)... everytime you see looping look for a way to remove it. Secondly, when you use things like zfill, I wonder if you want to_fwf
(fixed width format) rather than to_csv
?
Some sanity testing: Are some files are significantly bigger than others (which could lead to you hitting swap)? Are you sure the largest files are only 1200 rows?? Have your checked this? e.g. using wc -l
.
IMO I think it unlikely to be garbage collection.. (as was suggested in the other answer).
Here are a few improvements on your code, which should improve the runtime.
Columns are fixed I would extract the column calculations and vectorize the real, child and other normalizations. Use apply rather than iterating (for zfill).
columns_to_drop = set(head) & set(exclude) # maybe also - ['ConcatIndex']
remaining_cols = set(head) - set(exclude)
real_cols = [r for r in remaining_cols if 'Real ' in r]
real_cols_suffix = [r.strip('Real ') for r in real]
remaining_cols = remaining_cols - real_cols
child_cols = [r for r in remaining_cols if 'child' in r]
child_cols_desc = [r.strip('child'+'desc') for r in real]
remaining_cols = remaining_cols - child_cols
for count, picklefile in enumerate(pickleFiles):
if count % 100 == 0:
t2 = datetime.now()
print(str(t2))
print('count = ' + str(count))
print('time: ' + str(t2 - t1) + '\n')
t1 = t2
#DataFrame Manipulation:
df = pd.read_pickle(path + picklefile)
df['ConcatIndex'] = 100000*df.FileID + df.ID
# use apply here rather than iterating
df['Concatenated String Index'] = df['ConcatIndex'].apply(lambda x: str(x).zfill(10))
df.index = df.ConcatIndex
#DataFrame Normalization:
dftemp = df.very_deep_copy() # don't *think* you need this
# drop all excludes
dftemp.drop(columns_to_drop), axis=1, inplace=True)
# normalize real cols
m = dftemp[real_cols_suffix].max()
m.index = real_cols
dftemp[real_cols] = dftemp[real_cols] / m
# normalize child cols
m = dftemp[child_cols_desc].max()
m.index = child_cols
dftemp[child_cols] = dftemp[child_cols] / m
# normalize remaining
remaining = list(remaining - child)
dftemp[remaining] = dftemp[remaining] / dftemp[remaining].max()
# if this case is important then discard the rows of m with .max() is 0
#if max != 0:
# dftemp[string] = dftemp[string]/max
# this is dropped earlier, if you need it, then subtract ['ConcatIndex'] from columns_to_drop
# dftemp.drop('ConcatIndex', axis=1, inplace=True)
#Saving DataFrame in CSV:
if picklefile == '0000.p':
dftemp.to_csv(finalnormCSVFile)
else:
dftemp.to_csv(finalnormCSVFile, mode='a', header=False)
As a point of style I would probably choose to wrap each of these parts into functions, this will also mean more things can be gc'd if that really was the issue...
Another options which would be faster is to use pytables (HDF5Store) if you didn't need to resulting output to be csv (but I expect you do)...
The best thing to do by far is to profile your code. e.g. with %prun
in ipython e.g. see http://pynash.org/2013/03/06/timing-and-profiling.html. Then you can see it definitely is read_csv
and specifically where (which line of your code and which lines of pandas code).
Ah ha, I'd missed that you are appending all these to a single csv file. And in your prun it shows most of the time is spent in close
, so let's keep the file open:
# outside of the for loop (so the file is opened and closed only once)
f = open(finalnormCSVFile, 'w')
...
for picklefile in ...
if picklefile == '0000.p':
dftemp.to_csv(f)
else:
dftemp.to_csv(f, mode='a', header=False)
...
f.close()
Each time the file is opened before it can append to, it needs to seek to the end before writing, it could be that this is the expensive (I don't see why this should be that bad, but keeping it open removes the need to do this).
这篇关于性能:Python pandas DataFrame.to_csv append 逐渐变慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!