问题描述
我有以下数据框df:
data={'id':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2],
'value':[2,2,3,2,2,2,3,3,3,3,1,4,1,1,1,4,4,1,1,1,1,1]}
df=pd.DataFrame.from_dict(data)
df
Out[8]:
id value
0 1 2
1 1 2
2 1 3
3 1 2
4 1 2
5 1 2
6 1 3
7 1 3
8 1 3
9 1 3
10 2 1
11 2 4
12 2 1
13 2 1
14 2 1
15 2 4
16 2 4
17 2 1
18 2 1
19 2 1
20 2 1
21 2 1
我需要做的是在ID级别(df.groupby ['id'])进行识别,当该值连续3次或多次连续显示相同的数字时.
What I need to do is identify at the id level (df.groupby['id']) when the value shows the same number consecutively for 3 or more times.
我希望得到以下结果:
df
Out[12]:
id value flag
0 1 2 0
1 1 2 0
2 1 3 0
3 1 2 1
4 1 2 1
5 1 2 1
6 1 3 1
7 1 3 1
8 1 3 1
9 1 3 1
10 2 1 0
11 2 4 0
12 2 1 1
13 2 1 1
14 2 1 1
15 2 4 0
16 2 4 0
17 2 1 1
18 2 1 1
19 2 1 1
20 2 1 1
21 2 1 1
我尝试使用pandas rolling.mean来尝试groupby和lambda的变体,以确定随后将平均滚动周期与值"进行比较的地方,以及在相同的地方这表示一个标志.但这有几个问题,包括您可能拥有不同的值,这些值的平均值将等于您要标记的值.另外,我不知道如何标记"创建初始标记的滚动平均值的所有值.看到这里,它标识了标志的右侧",但是随后我需要填充滚动平均长度的先前值.在这里查看我的代码:
I have tried variations of groupby and lambda using pandas rolling.mean to identify where the average of the rolling period is then compared to the 'value', and where they are the same this indicates a flag. But this has several problems, including that you could have different values that will average to the value you are trying to flag. Also, I can't figure out how to 'flag' all of the values of the rolling mean that created the initial flag. See here, this identifies the 'right side' of the flag, but then I need to fill the previous values of the rolling mean length. See my code here:
test=df.copy()
test['rma']=test.groupby('id')['value'].transform(lambda x: x.rolling(min_periods=3,window=3).mean())
test['flag']=np.where(test.rma==test.value,1,0)
结果在这里:
test
Out[61]:
id value rma flag
0 1 2 NaN 0
1 1 2 NaN 0
2 1 3 2.333333 0
3 1 2 2.333333 0
4 1 2 2.333333 0
5 1 2 2.000000 1
6 1 3 2.333333 0
7 1 3 2.666667 0
8 1 3 3.000000 1
9 1 3 3.000000 1
10 2 1 NaN 0
11 2 4 NaN 0
12 2 1 2.000000 0
13 2 1 2.000000 0
14 2 1 1.000000 1
15 2 4 2.000000 0
16 2 4 3.000000 0
17 2 1 3.000000 0
18 2 1 2.000000 0
19 2 1 1.000000 1
20 2 1 1.000000 1
21 2 1 1.000000 1
等不及要看我想念的东西!谢谢
Can't wait to see what I am missing! Thanks
推荐答案
您可以尝试一下; 1)使用df.value.diff().ne(0).cumsum()
创建一个额外的组变量以表示值更改; 2)使用transform('size')
计算组大小并将其与三个进行比较,然后您将获得所需的flag
列:
You can try this; 1) Create an extra group variable with df.value.diff().ne(0).cumsum()
to denote the value changes; 2) use transform('size')
to calculate the group size and compare with three, then you get the flag
column you need:
df['flag'] = df.value.groupby([df.id, df.value.diff().ne(0).cumsum()]).transform('size').ge(3).astype(int)
df
突发事件:
1) diff
不等于零(这实际上是df.value.diff().ne(0)
的意思)在值发生变化时给出条件True
:
1) diff
is not equal to zero (which is literally what df.value.diff().ne(0)
means) gives a condition True
whenever there is a value change:
df.value.diff().ne(0)
#0 True
#1 False
#2 True
#3 True
#4 False
#5 False
#6 True
#7 False
#8 False
#9 False
#10 True
#11 True
#12 True
#13 False
#14 False
#15 True
#16 False
#17 True
#18 False
#19 False
#20 False
#21 False
#Name: value, dtype: bool
2)然后cumsum
给出id的非降序序列,其中每个id表示具有相同值的连续块,请注意在对布尔值求和时,True
被视为1,而False
被视为零:
2) Then cumsum
gives a non descending sequence of ids where each id denotes a consecutive chunk with same values, note when summing boolean values, True
is considered as one while False
is considered as zero:
df.value.diff().ne(0).cumsum()
#0 1
#1 1
#2 2
#3 3
#4 3
#5 3
#6 4
#7 4
#8 4
#9 4
#10 5
#11 6
#12 7
#13 7
#14 7
#15 8
#16 8
#17 9
#18 9
#19 9
#20 9
#21 9
#Name: value, dtype: int64
3)与id
列结合使用,您可以对数据框进行分组,计算组大小并获得flag
列.
3) combined with id
column, you can group the data frame, calculate the group size and get the flag
column.
这篇关于使用Groupby识别Pandas Dataframe中的连续相同值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!