我正在处理大熊猫大约300兆的财务数据,这与拍卖中的限价单相对应。它是多维数据,如下所示:

                                bid                                                                                                                                                                                                                                                                                                                                                                                       ask
                                  0                  1                  2                  3                  4                  5                  6                  7                  8                  9                 10                 11                 12                 13                 14                 15                 16                 17                 18               19                  0                  1                  2                  3                  4                  5                  6                  7                  8                  9                 10                 11                 12                 13                 14                 15               16               17               18               19
                              price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity  price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity  price  quantity  price  quantity  price  quantity  price  quantity
2014-05-13 08:47:16.180000  102.298   1000000  102.297   1500000  102.296   6500000  102.295   8000000  102.294   3000000  102.293  24300000  102.292   6000000  102.291   1000000  102.290   1000000  102.289   2500000  102.288  11000000  102.287   4000000  102.286  10100000  102.284   5000000  102.280   1500000  102.276   3000000  102.275   8100000  102.265   9500000      NaN       NaN    NaN       NaN  102.302   2000000  102.303   6100000  102.304  14700000  102.305   3500000  102.307   9800000  102.308  15500000  102.310   5000000  102.312   7000000  102.313   1000000  102.315   8000000  102.316   4500000  102.320   4000000  102.321   1000000  102.324   4000000  102.325   9500000      NaN       NaN    NaN       NaN    NaN       NaN    NaN       NaN    NaN       NaN
2014-05-13 08:47:17.003000  102.298   1000000  102.297   2500000  102.296   6500000  102.295   7000000  102.294   3000000  102.293  24300000  102.292   6000000  102.291   1000000  102.290   1000000  102.289   2500000  102.288  11000000  102.287   4000000  102.286  10100000  102.284   5000000  102.280   1500000  102.276   3000000  102.275   8100000  102.265   9500000      NaN       NaN    NaN       NaN  102.302   2000000  102.303   5100000  102.304  14700000  102.305   4500000  102.307   9800000  102.308  15500000  102.310   5000000  102.312   7000000  102.313   1000000  102.315   8000000  102.316   4500000  102.320   4000000  102.321   1000000  102.324   4000000  102.325   9500000      NaN       NaN    NaN       NaN    NaN       NaN    NaN       NaN    NaN       NaN
2014-05-13 08:47:17.005000  102.298   3000000  102.297   3500000  102.296   6000000  102.295   9300000  102.294   4000000  102.293  17500000  102.292   2000000  102.291   4000000  102.290   1000000  102.289   2500000  102.288   6000000  102.287   4000000  102.286  10100000  102.284   5000000  102.280   1500000  102.276   3000000  102.275   8100000  102.265   9500000      NaN       NaN    NaN       NaN  102.302   2000000  102.303   5100000  102.304  14700000  102.305   4500000  102.307   9000000  102.308  16300000  102.310   5000000  102.312   7000000  102.313   1000000  102.315   8000000  102.316   4500000  102.320   4000000  102.321   1000000  102.324   4000000  102.325   9500000      NaN       NaN    NaN       NaN    NaN       NaN    NaN       NaN    NaN       NaN
2014-05-13 08:47:17.006000  102.299   1000000  102.298   3000000  102.297   6500000  102.296   5000000  102.295   5300000  102.294   4000000  102.293  15500000  102.292   2000000  102.291   4000000  102.290   1000000  102.289   2500000  102.288   6000000  102.287   4000000  102.286  10100000  102.284   5000000  102.280   1500000  102.276   3000000  102.275   8100000  102.265   9500000    NaN       NaN  102.302   2000000  102.303   5100000  102.304  11700000  102.305   7500000  102.307   9000000  102.308  11300000  102.309   5000000  102.310   5000000  102.312   7000000  102.313   1000000  102.315   8000000  102.316   4500000  102.320   4000000  102.321   1000000  102.324   4000000  102.325   9500000    NaN       NaN    NaN       NaN    NaN       NaN    NaN       NaN
2014-05-13 08:47:17.007000  102.299   1000000  102.298   3000000  102.297   8500000  102.296   4000000  102.295   4300000  102.294   5000000  102.293  14500000  102.292   2000000  102.291   4000000  102.290   1000000  102.289   2500000  102.288   6000000  102.287   4000000  102.286  10100000  102.284   5000000  102.280   1500000  102.276   3000000  102.275   8100000  102.265   9500000    NaN       NaN  102.302   2000000  102.303   4100000  102.304  13700000  102.305   7500000  102.307   8000000  102.308  12300000  102.309   5000000  102.310   5000000  102.312   7000000  102.313   1000000  102.315   8000000  102.316   4500000  102.320   4000000  102.321   1000000  102.324   4000000  102.325   9500000    NaN       NaN    NaN       NaN    NaN       NaN    NaN       NaN

(当你达到20级时,注意第一级的变化。很抱歉表格格式太长…)
为了处理数据,我需要做很多透视操作。例如,不是0,1,2,3。。。(订单在队列中的相对位置),它们有102.297、102.296、。。。即以订单价格为指标。他就是这样一个例子:
x.stack([0,0]).reset_index(drop=True,level=2).set_index("price",append=True).unstack([1,2]).fillna(0).diff().stack([1,1])

顺从的:
                                         quantity
                           side price
2014-05-13 08:47:17.003000 ask  102.300         0
                                102.301         0
                                102.302         0
                                102.303  -1000000
                                102.304         0

这可以通过组合stack/unstack/reset_index来实现,但它看起来确实效率低下。我没有看代码,但我猜每个stack/unstack上都会生成一个表的副本,这会导致8GB系统内存不足,并开始访问页面文件。我认为在这种情况下也不能使用pivot,因为所需的列在多索引中
有什么建议可以加快速度吗?
下面是一个示例输入csv文件,根据注释:
side,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask
level,0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,16,16,17,17,18,18,19,19,0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,16,16,17,17,18,18,19,19
value,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity
2014-05-13 08:47:16.18,102.298,1000000.0,102.297,1500000.0,102.296,6500000.0,102.295,8000000.0,102.294,3000000.0,102.293,2.43E7,102.292,6000000.0,102.291,1000000.0,102.29,1000000.0,102.289,2500000.0,102.288,1.1E7,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,N/A,N/A,102.302,2000000.0,102.303,6100000.0,102.304,1.47E7,102.305,3500000.0,102.307,9800000.0,102.308,1.55E7,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.003,102.298,1000000.0,102.297,2500000.0,102.296,6500000.0,102.295,7000000.0,102.294,3000000.0,102.293,2.43E7,102.292,6000000.0,102.291,1000000.0,102.29,1000000.0,102.289,2500000.0,102.288,1.1E7,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,N/A,N/A,102.302,2000000.0,102.303,5100000.0,102.304,1.47E7,102.305,4500000.0,102.307,9800000.0,102.308,1.55E7,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.005,102.298,3000000.0,102.297,3500000.0,102.296,6000000.0,102.295,9300000.0,102.294,4000000.0,102.293,1.75E7,102.292,2000000.0,102.291,4000000.0,102.29,1000000.0,102.289,2500000.0,102.288,6000000.0,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,N/A,N/A,102.302,2000000.0,102.303,5100000.0,102.304,1.47E7,102.305,4500000.0,102.307,9000000.0,102.308,1.63E7,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.006,102.299,1000000.0,102.298,3000000.0,102.297,6500000.0,102.296,5000000.0,102.295,5300000.0,102.294,4000000.0,102.293,1.55E7,102.292,2000000.0,102.291,4000000.0,102.29,1000000.0,102.289,2500000.0,102.288,6000000.0,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,102.302,2000000.0,102.303,5100000.0,102.304,1.17E7,102.305,7500000.0,102.307,9000000.0,102.308,1.13E7,102.309,5000000.0,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.007,102.299,1000000.0,102.298,3000000.0,102.297,8500000.0,102.296,4000000.0,102.295,4300000.0,102.294,5000000.0,102.293,1.45E7,102.292,2000000.0,102.291,4000000.0,102.29,1000000.0,102.289,2500000.0,102.288,6000000.0,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,102.302,2000000.0,102.303,4100000.0,102.304,1.37E7,102.305,7500000.0,102.307,8000000.0,102.308,1.23E7,102.309,5000000.0,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A

最佳答案

Unstack本质上创建了一个索引x列的枚举,因此当您有很多列和行时,它可以创建一个巨大的内存空间。
这是一个soln,速度较慢,但峰值内存使用率应该低得多(我认为)。它提供了一个稍微小一些的总空间,因为您可能有一些原始的零条目不在这里(但您总是可以重新索引和填充来修复它)。
定义此函数,可能会针对这种情况进行优化(已在级别上分组)

In [79]: def f(x):
    try:
        y = x.stack([0,0]).reset_index(drop=True,level=2).set_index("price",append=True).unstack([1,2]).fillna(0).diff().stack([1,1])
        return y[y!=0].dropna()
    except:
        return None
   ....:

按列上的“level”分组并应用f;不要直接使用apply,而是将结果作为行进行concat(这是“unstacking”部分)。
然而,这会产生dup(在价格水平上),所以需要对它们进行汇总。
In [76]: concat([ f(grp) for g, grp in df.groupby(level='level',axis=1) ]).groupby(level=[0,1,2]).sum().sortlevel()
Out[76]:
value                                 quantity
                        side price
2014-05-13 08:47:17.003 ask  102.303  -1000000
                             102.305   1000000
                        bid  102.295  -1000000
                             102.297   1000000
2014-05-13 08:47:17.005 ask  102.307   -800000
                             102.308    800000
                        bid  102.288  -5000000
                             102.291   3000000
                             102.292  -4000000
                             102.293  -6800000
                             102.294   1000000
                             102.295   2300000
                             102.296   -500000
                             102.297   1000000
                             102.298   2000000
2014-05-13 08:47:17.006 ask  102.304  -3000000
                             102.305   3000000
                             102.308  -5000000
                             102.309   5000000
                             102.310         0
                             102.312         0
                             102.313         0
                             102.315         0
                             102.316         0
                             102.320         0
                             102.321         0
                             102.324         0
                             102.325         0
                        bid  102.265  -9500000
                             102.275         0
                             102.276         0
                             102.280         0
                             102.284         0
                             102.286         0
                             102.287         0
                             102.288         0
                             102.289         0
                             102.290         0
                             102.291         0
                             102.292         0
                             102.293  -2000000
                             102.294         0
                             102.295  -4000000
                             102.296  -1000000
                             102.297   3000000
                             102.298         0
                             102.299   1000000
2014-05-13 08:47:17.007 ask  102.303  -1000000
                             102.304   2000000
                             102.307  -1000000
                             102.308   1000000
                        bid  102.293  -1000000
                             102.294   1000000
                             102.295  -1000000
                             102.296  -1000000
                             102.297   2000000

计时(我认为优化f将使这一过程更快)
In [77]: %timeit concat([ f(grp) for g, grp in df.groupby(level='level',axis=1) ]).groupby(level=[0,1,2]).sum().sortlevel()
1 loops, best of 3: 319 ms per loop

In [78]: %memit concat([ f(grp) for g, grp in df.groupby(level='level',axis=1) ]).groupby(level=[0,1,2]).sum().sortlevel()
maximum of 1: 67.515625 MB per loop

原始方法
In [7]: %timeit df.stack([0,0]).reset_index(drop=True,level=2).set_index("price",append=True).unstack([1,2]).fillna(0).diff().stack([1,1])
10 loops, best of 3: 56.4 ms per loop

In [8]: %memit df.stack([0,0]).reset_index(drop=True,level=2).set_index("price",append=True).unstack([1,2]).fillna(0).diff().stack([1,1])
maximum of 1: 61.187500 MB per loop

关于python - Pandas Pivot MultiIndex有效,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/23672414/

10-10 23:58