问题描述
这个问题与我能找到的其他类似问题不同,因为我试图将回顾窗口和阈值合并为一个滚动总和.我实际上不确定我想要做的事情是否可以一步实现:
This question is unlike other similar ones that I could find because I am trying to combine a lookback window and a threshold into one rolling sum. I'm not actually sure what I'm trying to do is achievable in one step:
我有一个带有日期时间列和值列的熊猫数据框.我创建了一个列,它对滚动时间窗口内的值列 (V) 求和.但是,一旦达到某个阈值,我希望此滚动总和重置为 0.
I have a pandas dataframe with a datetime column and a value column. I have created a column that sums the value column (V) over a rolling time window. However I would like this rolling sum to reset to 0 once it reaches a certain threshold.
我不知道是否可以在一个列操作步骤中执行此操作,因为在总和的每个步骤中都有两个条件在起作用 - 回顾窗口和阈值.如果有人对这是否可行以及我如何实现它有任何想法,请告诉我.我知道如何迭代地执行此操作,但是速度非常慢(我的数据帧有 > 100 万个条目).
I don't know if it's possible to do this in one column manipulation step since there are two conditions at play at each step in the sum- the lookback window and the threshold. If anyone has any ideas about if this is possible and how I might be able to achieve it please let me know. I know how to do this iteratively however it is very very slow (my dataframe has >1 million entries).
示例:
回顾时间:3分钟
阈值:3
+---+-----------------------+-------+--------------------------+
| | myDate | V | rolling | desired_column |
+---+-----------------------+-------+---------+----------------+
| 1 | 2020-04-01 10:00:00 | 0 | 0 | 0 |
| 2 | 2020-04-01 10:01:00 | 1 | 1 | 1 |
| 3 | 2020-04-01 10:02:00 | 2 | 3 | 3 |
| 4 | 2020-04-01 10:03:00 | 1 | 4 | 1 |
| 5 | 2020-04-01 10:04:00 | 0 | 4 | 1 |
| 6 | 2020-04-01 10:05:00 | 4 | 7 | 5 |
| 7 | 2020-04-01 10:06:00 | 1 | 6 | 1 |
| 8 | 2020-04-01 10:07:00 | 1 | 6 | 2 |
| 9 | 2020-04-01 10:08:00 | 0 | 6 | 0 |
| 10| 2020-04-01 10:09:00 | 3 | 5 | 5 |
+---+-----------------------+-------+---------+----------------+
在此示例中,总和决定总和不会考虑违反(或等于)阈值 3 的行上或之前的任何值.
In this example the sum rulling sum will not take into account any values on or before a row that breaches (or is equal to) the threshold of 3.
推荐答案
以下方法无论如何都不是内存有效的,但它应该比循环更快.它假定时间是连续的,以便委托给 numpy 方法,否则您可以在调用之前包含缺少的时间.
The following approach is not memory efficient by any means, but it should be faster than looping. It assumes time is continuous in order to delegate to numpy methods, otherwise you can include the missing times before calling.
def rolling_window(a, window):
b = np.concatenate((np.zeros(window-1), a)) # only for 1d
return np.array([b[..., i:i+window] for i in range(a.size)])
def dynamic_window(w: np.array, reset):
regions = np.hstack([
np.zeros((w.shape[0], 1)),
np.cumsum(w, axis=-1)[:, :-1]
]) // reset
return w * (regions == regions[:, -1][:, np.newaxis])
将其用作
# sample df
# please always provide a callable line of code
# you could get it with `df.head(10).to_dict('split')`
df = pd.DataFrame({
'myDate': pd.date_range('2020-04-01 10:00', periods=10, freq='T'),
'V': [0, 1, 2, 1, 0, 4, 1, 1, 0, 3]
})
# include all time increments
df = pd.concat([
df,
pd.DataFrame(pd.date_range(df['myDate'].min(),
df['myDate'].max(), freq='T'), columns=['myDate'])
]).drop_duplicates(subset=['myDate']).fillna(0).sort_values('myDate')
df['4min_sum'] = df.rolling('4min', on='myDate')['V'].sum()
# use the functions
df['desired_column'] = dynamic_window(
rolling_window(df['V'].to_numpy(), 4),
3).sum(axis=-1)
输出
myDate V 4min_sum desired_column
0 2020-04-01 10:00:00 0.0 0.0 0.0
1 2020-04-01 10:01:00 1.0 1.0 1.0
2 2020-04-01 10:02:00 2.0 3.0 3.0
3 2020-04-01 10:03:00 1.0 4.0 1.0
4 2020-04-01 10:04:00 0.0 4.0 1.0
5 2020-04-01 10:05:00 4.0 7.0 4.0
6 2020-04-01 10:06:00 1.0 6.0 1.0
7 2020-04-01 10:07:00 1.0 6.0 2.0
8 2020-04-01 10:08:00 0.0 6.0 0.0
9 2020-04-01 10:09:00 3.0 5.0 5.0
注意它在 10:05 如何输出 4 而不是您在预期输出中的 5.根据你的逻辑,它应该是 4;该窗口包含 [2, 1, 0, 4]
并且,由于前两个数字总和为 3,因此窗口应重置并返回 0 + 4.
Notice how at 10:05 it outputs 4 instead of the 5 you have in your expected output. According to your logic it should be 4; that window contains [2, 1, 0, 4]
and, since the two first numbers sum 3, the window should reset and return 0 + 4.
这篇关于创建一个滚动总和列,一旦达到阈值就会重置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!