问题描述
我的问题最好用一个例子来描述,例如t
是时间索引,而x
是数据,我们已经输入了
My question is best described by an example, say t
is the time index, and x
is the data, we have input
t = [1,2,3, 7,9,11, 17,18,20]
x = [1,2,3, 4,5,6, 7,8,9]
s = ['P', 'P', 'N', 'N', 'N', 'N', 'P', 'P', 'P']
window = 2
所需的输出:
t1 = [1, 3, 7, 17]
x1 = [3, -3, -15, 24]
即我想对x进行聚类,以便如果2个连续样本的时间戳之差为< = window,并且它们具有相同的s
值,则将它们放在一起,然后加总所有在同一聚类中的值.而且,那些具有N的S值的聚类使它们成为负数.然后,将每个群集中第一个样本的时间戳记作为该群集的时间.
I.e. I want to cluster the x's such that if 2 consecutive samples have timestamps whose difference is <=window, and they have the same s
value, put them together, and add up all that are in the same cluster. Moreover, those clusters that have s-value of N, make them negative.Then, take the time stamp of the first sample in each cluster as the time for that cluster.
如何在熊猫中做到这一点?
How do I do this in pandas?
示例的解释:聚类为(1,2),(3),(4、5、6),(7、8、9). (3)必须位于自己的群集中,因为即使它接近其前身,它也具有不同的符号. (4,5,6)均为负(s值为N),因此该群集的分配值为-(4 + 5 + 6)= -15
EXPLANATION OF EXAMPLE: The clusters are (1,2), (3), (4,5,6), (7,8,9). (3) has to be in its own cluster because even though it is close to its predecessor, it has different sign. (4,5,6) are all negative (s-value is N), so the assigned value for that cluster is -(4+5+6) = -15
推荐答案
这是一个开始.给定您的值的数据框,添加三个新列,数据向后移动一次.同时添加x的签名版本.
Here's a start. Given a dataframe of your values, add three new columns with the data shifted once backwards. Also add a signed version of x.
df = pd.DataFrame({'t':t, 'x':x, 's':s})
df[['s_1', 't_1', 'x_1']] = df.shift(-1)
df['x_signed'] = np.where(df['s'] == 'N', -1 * df['x'], df['x'])
根据两个可能的条件,添加一个表示新集群开始的布尔列.
Add a boolean column that represents the start of a new cluster, based on your two possible conditions.
df['cluster'] = (df['s'] != df['s_1']) | (df['t_1'] - df['t'] > window)
通过向后移位一次,填充第一个值(第0组),转换为整数,然后求和,将其转换为组号.
Convert this into group numbers, by shifting backwards once, filling the first value (group 0), converting to integers, and then taking a cumulative sum.
df['cluster'] = df['cluster'].shift(1).fillna(False).astype(int).cumsum()
然后从那里开始,很容易进行分组并获得输出.
Then from there, it's pretty easy to groupby and get your output.
In [72]: df.groupby('cluster').agg({'t':'first', 'x_signed':'sum'})
Out[72]:
t x_signed
cluster
0 1 3
1 3 -3
2 7 -15
3 17 24
这篇关于 pandas -彼此之间的铲斗事件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!