问题描述
我在Python中有以下数据框:
I've got a following data frame in Python:
df = pd.DataFrame.from_dict({'measurement_id': np.repeat([1, 2], [6, 6]),
'min': np.concatenate([np.repeat([1, 2, 3], [2, 2, 2]),
np.repeat([1, 2, 3], [2, 2, 2])]),
'obj': list('AB' * 6),
'var': [1, 2, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1]})
首先,在由object
定义的每个组中,我想将ID分配给measurement_id
和var
列的唯一运行.如果这些列的任何值发生更改,它将启动应指定新ID的新运行.
First, within each group defined by object
, I'd like to assign id to unique run of measurement_id
and var
columns. If any value of those columns changes, it starts new run that should be assigned with new id. So the
df['rleid_output'] = [1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 3]
然后,对于由rleid_output
定义的每个组,我想检查运行持续了多少分钟(min
列),给了我expected_output
列:
Then, for each group defined by rleid_output
I'd like to check how many minutes (min
column) the run lasted giving me expected_output
column:
df['expected_output'] = [2, 2, 2, 2, 1, 1, 2, 3, 2, 3, 1, 3]
如果是R,我将按照以下步骤操作:
If it was R, I'd proceed as follows:
df <- data.frame(measurement_id = rep(1:2, each = 6),
min = rep(rep(1:3, each = 2), 2),
object = rep(LETTERS[1:2], 6),
var = c(1, 2, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1))
df %>%
group_by(object) %>%
mutate(rleid = data.table::rleid(measurement_id, var)) %>%
group_by(object, rleid) %>%
mutate(expected_output = last(min) - first(min) + 1)
所以我需要的主要是与Python pd.DataFrame.groupby
子句一起工作的R data.table::rleid
等效项.有什么想法可以解决这个问题吗?
So the main thing I need is R data.table::rleid
equivalent that would work with Python pd.DataFrame.groupby
clause. Any ideas how to solve this?
@Edit:数据框的新的更新示例:
@ new, updated example of data frame:
df = pd.DataFrame.from_dict({'measurement_id': np.repeat([1, 2], [6, 6]),
'min': np.concatenate([np.repeat([1, 2, 3], [2, 2, 2]),
np.repeat([1, 2, 3], [2, 2, 2])]),
'obj': list('AB' * 6),
'var': [1, 2, 2, 2, 1, 1, 2, 1, 2, 1, 1, 1]})
df['rleid_output'] = [1, 1, 2, 1, 3, 2, 4, 3, 4, 3, 5, 3]
df['expected_output'] = [1, 2, 1, 2, 1, 1, 2, 3, 2, 3, 1, 3]
推荐答案
更新后的答案
问题在于,每个measurement_id, obj, var
组中的min
列都应保持顺序.我们可以在measurement_id, obj, var
上按组进行检查,然后检查min
列中的差异是否大于1
.如果是这样,我们会在expected_output
中将其标记为唯一的持续时间:
The problem is that the min
column in each group of measurement_id, obj, var
should be maintained order. We can check this by group by on measurement_id, obj, var
and then checking if the difference in min
column is greater than 1
. If so, we mark it as a unique duration in expected_output
:
df['grouper'] = (df.groupby(['measurement_id', 'obj', 'var'])['min']
.apply(lambda x: x.diff().fillna(1).eq(1))
)
df['expected_output'] = (
df.groupby(['measurement_id', 'obj', 'var'])['grouper'].transform('sum').astype(int)
)
df = df.drop(columns='grouper')
measurement_id min obj var expected_output
0 1 1 A 1 1
1 1 1 B 2 2
2 1 2 A 2 1
3 1 2 B 2 2
4 1 3 A 1 1
5 1 3 B 1 1
6 2 1 A 2 2
7 2 1 B 1 3
8 2 2 A 2 2
9 2 2 B 1 3
10 2 3 A 1 1
11 2 3 B 1 3
遵循OP的逻辑的旧答案
我们可以通过使用GroupBy.diff
来获取您的rleid_output
,基本上是每次measurement_id
每次更改var
时唯一的标识符. obj
We can achieve this by using GroupBy.diff
to get your rleid_output
, basically a unique identifier each time var
changes for each measurement_id
& obj
之后,使用GroupBy.nunique
测量minutes
的量:
rleid_output = df.groupby(['measurement_id', 'obj'])['var'].diff().abs().bfill()
df['expected_output'] = (df.groupby(['measurement_id', 'obj', rleid_output])['min']
.transform('nunique'))
measurement_id min obj var expected_output
0 1 1 A 1 2
1 1 1 B 2 2
2 1 2 A 1 2
3 1 2 B 2 2
4 1 3 A 2 1
5 1 3 B 1 1
6 2 1 A 2 2
7 2 1 B 1 3
8 2 2 A 2 2
9 2 2 B 1 3
10 2 3 A 1 1
11 2 3 B 1 3
这篇关于相当于Python中的R group_by()+ rleid()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!