我正在处理大熊猫大约300兆的财务数据,这与拍卖中的限价单相对应。它是多维数据,如下所示:
bid ask
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity
2014-05-13 08:47:16.180000 102.298 1000000 102.297 1500000 102.296 6500000 102.295 8000000 102.294 3000000 102.293 24300000 102.292 6000000 102.291 1000000 102.290 1000000 102.289 2500000 102.288 11000000 102.287 4000000 102.286 10100000 102.284 5000000 102.280 1500000 102.276 3000000 102.275 8100000 102.265 9500000 NaN NaN NaN NaN 102.302 2000000 102.303 6100000 102.304 14700000 102.305 3500000 102.307 9800000 102.308 15500000 102.310 5000000 102.312 7000000 102.313 1000000 102.315 8000000 102.316 4500000 102.320 4000000 102.321 1000000 102.324 4000000 102.325 9500000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2014-05-13 08:47:17.003000 102.298 1000000 102.297 2500000 102.296 6500000 102.295 7000000 102.294 3000000 102.293 24300000 102.292 6000000 102.291 1000000 102.290 1000000 102.289 2500000 102.288 11000000 102.287 4000000 102.286 10100000 102.284 5000000 102.280 1500000 102.276 3000000 102.275 8100000 102.265 9500000 NaN NaN NaN NaN 102.302 2000000 102.303 5100000 102.304 14700000 102.305 4500000 102.307 9800000 102.308 15500000 102.310 5000000 102.312 7000000 102.313 1000000 102.315 8000000 102.316 4500000 102.320 4000000 102.321 1000000 102.324 4000000 102.325 9500000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2014-05-13 08:47:17.005000 102.298 3000000 102.297 3500000 102.296 6000000 102.295 9300000 102.294 4000000 102.293 17500000 102.292 2000000 102.291 4000000 102.290 1000000 102.289 2500000 102.288 6000000 102.287 4000000 102.286 10100000 102.284 5000000 102.280 1500000 102.276 3000000 102.275 8100000 102.265 9500000 NaN NaN NaN NaN 102.302 2000000 102.303 5100000 102.304 14700000 102.305 4500000 102.307 9000000 102.308 16300000 102.310 5000000 102.312 7000000 102.313 1000000 102.315 8000000 102.316 4500000 102.320 4000000 102.321 1000000 102.324 4000000 102.325 9500000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2014-05-13 08:47:17.006000 102.299 1000000 102.298 3000000 102.297 6500000 102.296 5000000 102.295 5300000 102.294 4000000 102.293 15500000 102.292 2000000 102.291 4000000 102.290 1000000 102.289 2500000 102.288 6000000 102.287 4000000 102.286 10100000 102.284 5000000 102.280 1500000 102.276 3000000 102.275 8100000 102.265 9500000 NaN NaN 102.302 2000000 102.303 5100000 102.304 11700000 102.305 7500000 102.307 9000000 102.308 11300000 102.309 5000000 102.310 5000000 102.312 7000000 102.313 1000000 102.315 8000000 102.316 4500000 102.320 4000000 102.321 1000000 102.324 4000000 102.325 9500000 NaN NaN NaN NaN NaN NaN NaN NaN
2014-05-13 08:47:17.007000 102.299 1000000 102.298 3000000 102.297 8500000 102.296 4000000 102.295 4300000 102.294 5000000 102.293 14500000 102.292 2000000 102.291 4000000 102.290 1000000 102.289 2500000 102.288 6000000 102.287 4000000 102.286 10100000 102.284 5000000 102.280 1500000 102.276 3000000 102.275 8100000 102.265 9500000 NaN NaN 102.302 2000000 102.303 4100000 102.304 13700000 102.305 7500000 102.307 8000000 102.308 12300000 102.309 5000000 102.310 5000000 102.312 7000000 102.313 1000000 102.315 8000000 102.316 4500000 102.320 4000000 102.321 1000000 102.324 4000000 102.325 9500000 NaN NaN NaN NaN NaN NaN NaN NaN
(当你达到20级时,注意第一级的变化。很抱歉表格格式太长…)
为了处理数据,我需要做很多透视操作。例如,不是0,1,2,3。。。(订单在队列中的相对位置),它们有102.297、102.296、。。。即以订单价格为指标。他就是这样一个例子:
x.stack([0,0]).reset_index(drop=True,level=2).set_index("price",append=True).unstack([1,2]).fillna(0).diff().stack([1,1])
顺从的:
quantity
side price
2014-05-13 08:47:17.003000 ask 102.300 0
102.301 0
102.302 0
102.303 -1000000
102.304 0
这可以通过组合
stack/unstack/reset_index
来实现,但它看起来确实效率低下。我没有看代码,但我猜每个stack
/unstack
上都会生成一个表的副本,这会导致8GB系统内存不足,并开始访问页面文件。我认为在这种情况下也不能使用pivot
,因为所需的列在多索引中有什么建议可以加快速度吗?
下面是一个示例输入csv文件,根据注释:
side,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask
level,0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,16,16,17,17,18,18,19,19,0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,16,16,17,17,18,18,19,19
value,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity
2014-05-13 08:47:16.18,102.298,1000000.0,102.297,1500000.0,102.296,6500000.0,102.295,8000000.0,102.294,3000000.0,102.293,2.43E7,102.292,6000000.0,102.291,1000000.0,102.29,1000000.0,102.289,2500000.0,102.288,1.1E7,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,N/A,N/A,102.302,2000000.0,102.303,6100000.0,102.304,1.47E7,102.305,3500000.0,102.307,9800000.0,102.308,1.55E7,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.003,102.298,1000000.0,102.297,2500000.0,102.296,6500000.0,102.295,7000000.0,102.294,3000000.0,102.293,2.43E7,102.292,6000000.0,102.291,1000000.0,102.29,1000000.0,102.289,2500000.0,102.288,1.1E7,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,N/A,N/A,102.302,2000000.0,102.303,5100000.0,102.304,1.47E7,102.305,4500000.0,102.307,9800000.0,102.308,1.55E7,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.005,102.298,3000000.0,102.297,3500000.0,102.296,6000000.0,102.295,9300000.0,102.294,4000000.0,102.293,1.75E7,102.292,2000000.0,102.291,4000000.0,102.29,1000000.0,102.289,2500000.0,102.288,6000000.0,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,N/A,N/A,102.302,2000000.0,102.303,5100000.0,102.304,1.47E7,102.305,4500000.0,102.307,9000000.0,102.308,1.63E7,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.006,102.299,1000000.0,102.298,3000000.0,102.297,6500000.0,102.296,5000000.0,102.295,5300000.0,102.294,4000000.0,102.293,1.55E7,102.292,2000000.0,102.291,4000000.0,102.29,1000000.0,102.289,2500000.0,102.288,6000000.0,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,102.302,2000000.0,102.303,5100000.0,102.304,1.17E7,102.305,7500000.0,102.307,9000000.0,102.308,1.13E7,102.309,5000000.0,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.007,102.299,1000000.0,102.298,3000000.0,102.297,8500000.0,102.296,4000000.0,102.295,4300000.0,102.294,5000000.0,102.293,1.45E7,102.292,2000000.0,102.291,4000000.0,102.29,1000000.0,102.289,2500000.0,102.288,6000000.0,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,102.302,2000000.0,102.303,4100000.0,102.304,1.37E7,102.305,7500000.0,102.307,8000000.0,102.308,1.23E7,102.309,5000000.0,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
最佳答案
Unstack本质上创建了一个索引x列的枚举,因此当您有很多列和行时,它可以创建一个巨大的内存空间。
这是一个soln,速度较慢,但峰值内存使用率应该低得多(我认为)。它提供了一个稍微小一些的总空间,因为您可能有一些原始的零条目不在这里(但您总是可以重新索引和填充来修复它)。
定义此函数,可能会针对这种情况进行优化(已在级别上分组)
In [79]: def f(x):
try:
y = x.stack([0,0]).reset_index(drop=True,level=2).set_index("price",append=True).unstack([1,2]).fillna(0).diff().stack([1,1])
return y[y!=0].dropna()
except:
return None
....:
按列上的“level”分组并应用f;不要直接使用apply,而是将结果作为行进行concat(这是“unstacking”部分)。
然而,这会产生dup(在价格水平上),所以需要对它们进行汇总。
In [76]: concat([ f(grp) for g, grp in df.groupby(level='level',axis=1) ]).groupby(level=[0,1,2]).sum().sortlevel()
Out[76]:
value quantity
side price
2014-05-13 08:47:17.003 ask 102.303 -1000000
102.305 1000000
bid 102.295 -1000000
102.297 1000000
2014-05-13 08:47:17.005 ask 102.307 -800000
102.308 800000
bid 102.288 -5000000
102.291 3000000
102.292 -4000000
102.293 -6800000
102.294 1000000
102.295 2300000
102.296 -500000
102.297 1000000
102.298 2000000
2014-05-13 08:47:17.006 ask 102.304 -3000000
102.305 3000000
102.308 -5000000
102.309 5000000
102.310 0
102.312 0
102.313 0
102.315 0
102.316 0
102.320 0
102.321 0
102.324 0
102.325 0
bid 102.265 -9500000
102.275 0
102.276 0
102.280 0
102.284 0
102.286 0
102.287 0
102.288 0
102.289 0
102.290 0
102.291 0
102.292 0
102.293 -2000000
102.294 0
102.295 -4000000
102.296 -1000000
102.297 3000000
102.298 0
102.299 1000000
2014-05-13 08:47:17.007 ask 102.303 -1000000
102.304 2000000
102.307 -1000000
102.308 1000000
bid 102.293 -1000000
102.294 1000000
102.295 -1000000
102.296 -1000000
102.297 2000000
计时(我认为优化f将使这一过程更快)
In [77]: %timeit concat([ f(grp) for g, grp in df.groupby(level='level',axis=1) ]).groupby(level=[0,1,2]).sum().sortlevel()
1 loops, best of 3: 319 ms per loop
In [78]: %memit concat([ f(grp) for g, grp in df.groupby(level='level',axis=1) ]).groupby(level=[0,1,2]).sum().sortlevel()
maximum of 1: 67.515625 MB per loop
原始方法
In [7]: %timeit df.stack([0,0]).reset_index(drop=True,level=2).set_index("price",append=True).unstack([1,2]).fillna(0).diff().stack([1,1])
10 loops, best of 3: 56.4 ms per loop
In [8]: %memit df.stack([0,0]).reset_index(drop=True,level=2).set_index("price",append=True).unstack([1,2]).fillna(0).diff().stack([1,1])
maximum of 1: 61.187500 MB per loop
关于python - Pandas Pivot MultiIndex有效,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/23672414/