问题描述
我正在尝试有效地计算Pandas DataFrame每一列的运行总和,并且呈指数衰减. DataFrame包含世界上每个国家/地区的每日得分. DataFrame看起来像这样:
I'm trying to efficiently compute a running sum, with exponential decay, of each column of a Pandas DataFrame. The DataFrame contains a daily score for each country in the world. The DataFrame looks like this:
AF UK US
2014-07-01 0.998042 0.595720 0.524698
2014-07-02 0.380649 0.838436 0.355149
2014-07-03 0.306240 0.274755 0.964524
2014-07-04 0.396721 0.836027 0.225848
2014-07-05 0.151291 0.677794 0.603548
2014-07-06 0.558846 0.050535 0.551785
2014-07-07 0.463514 0.552748 0.265537
2014-07-08 0.240282 0.278825 0.116432
2014-07-09 0.309446 0.096573 0.246021
2014-07-10 0.800977 0.583496 0.713893
我不确定如何在不迭代数据帧的情况下计算滚动总和(带有衰减),因为我需要知道昨天的分数才能计算出今天的分数.但是要计算昨天的分数,我需要知道昨天的分数的前一天,等等.这是我一直在使用的代码,但是我想找到一种更有效的方法.
I'm not sure how to calculate the rolling sum (with decay) without iterating through the dataframe, since I need to know yesterday's score to calculate today's score. But to calculate yesterday's score, I need to know the day before yesterday's score, etc. This is the code that I've been using, but I'd like a more efficient way to go about it.
for j, val in df.iteritems():
for i, row in enumerate(val):
df[j].iloc[i] = row + val[i-1]*np.exp(-0.05)
推荐答案
您可以使用以下事实:当指数乘以它们的指数时,会添加:
You can use the fact that when exponentials multiply their exponents add:
例如:
N(2) = N(2) + N(1) * exp(-0.05)
N(3) = N(3) + (N(2) + N(1) * exp(-0.05))*exp(-0.05)
N(3) = N(3) + N(2)*exp(-0.05) + N(1)*exp(-0.1)
N(4) = ...and so on
然后可以使用numpy将其矢量化:
This can then be vectorized using numpy:
dataset = pd.DataFrame(np.random.rand(1000,3), columns=["A", "B","C"])
weightspace = np.exp(np.linspace(len(dataset), 0, num=len(dataset))*-0.05)
def rollingsum(array):
weights = weightspace[0-len(array):]
# Convolve the array and the weights to obtain the result
a = np.dot(array, weights).sum()
return a
a = pd.expanding_apply(dataset, rollingsum)
pd.expanding_apply
将rollingsum函数向后应用到每一行,并调用len(dataset)
次. np.linspace
生成大小为len(dataset)
的数据集,并计算当前行每行乘以exp(-0.05)
的次数.
pd.expanding_apply
applies the rollingsum function backwards to each row, calling it len(dataset)
times. np.linspace
generates a dataset of size len(dataset)
and calculates how many times each row is multiplied by exp(-0.05)
for the current row.
因为它是矢量化的,所以应该很快:
Because it is vectorized, it should be fast:
%timeit a = pd.expanding_apply(dataset, rollingsum)
10 loops, best of 3: 25.5 ms per loop
与之相比(请注意,我使用的是python 3,必须对第一行的行为进行更改...):
This compares with (note I'm using python 3 and had to make a change to the behaviour on the first row...):
def multipleApply(df):
for j, val in df.iteritems():
for i, row in enumerate(val):
if i == 0:
continue
df[j].iloc[i] = row + val[i-1]*np.exp(-0.05)
结果显示为:
In[68]: %timeit multipleApply(dataset)
1 loops, best of 3: 414 ms per loop
这篇关于Python Pandas DataFrame的指数衰减的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!