问题描述
考虑以下示例数据框:
> df
id name time
1 1 b 10
2 1 b 12
3 1 a 0
4 2 a 5
5 2 b 11
6 2 a 9
7 2 b 7
8 1 a 15
9 2 b 1
10 1 a 3
df = structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L),
name = c("b", "b", "a", "a", "b", "a", "b", "a", "b", "a"
), time = c(10L, 12L, 0L, 5L, 11L, 9L, 7L, 15L, 1L, 3L)), .Names = c("id",
"name", "time"), row.names = c(NA, -10L), class = "data.frame")
我需要识别并记录所有序列 seq< - c(a ,b)
,其中a根据时间列在b之前为每个id。 a和b之间不允许使用其他名称。实际序列长度至少为5.
样本数据的预期结果为
I need to identify and record all sequences seq <- c("a","b")
, where "a" precedes "b" based on "time" column, for each id. No other names between "a" and "b" are permitted. Real sequence length is at least 5.The expected result for the sample data is
a b
1 3 10
2 5 7
3 9 11
有一个类似的问题。但是,对于我来说,如何处理id栏目并不清楚。是否可以使用dplyr解决问题?
There is a similar question Finding rows in R dataframe where a column value follows a sequence. However, it is not clear to me how to deal with "id" column in my case. Is it a way to solve the problem using "dplyr"?
推荐答案
library(dplyr); library(tidyr)
# sort data frame by id and time
df %>% arrange(id, time) %>% group_by(id) %>%
# get logical vector indicating rows of a followed by b and mark each pair as unique
# by cumsum
mutate(ab = name == "a" & lead(name) == "b", g = cumsum(ab)) %>%
# subset rows where conditions are met
filter(ab | lag(ab)) %>%
# reshape your data frame to wide format
select(-ab) %>% spread(name, time)
#Source: local data frame [3 x 4]
#Groups: id [2]
# id g a b
#* <int> <int> <int> <int>
#1 1 1 3 10
#2 2 1 5 7
#3 2 2 9 11
如果序列长度大于2,那么您将需要检查多个滞后,并且其中一个选项是使用 shift
从 data.table
结合减少
函数(接受向量作为滞后/领先步骤) / code>,说如果我们需要检查模式 abb
:
If length of the sequence is larger than two, then you will need to check multiple lags, and one option of this is to use shift
function(which accepts a vector as lag/lead steps) from data.table
combined with Reduce
, say if we need to check pattern abb
:
library(dplyr); library(tidyr); library(data.table)
pattern = c("a", "b", "b")
len_pattern = length(pattern)
df %>% arrange(id, time) %>% group_by(id) %>%
# same logic as before but use Reduce function to check multiple lags condition
mutate(ab = Reduce("&", Map("==", shift(name, n = 0:(len_pattern - 1), type = "lead"), pattern)),
g = cumsum(ab)) %>%
# use reduce or to subset sequence rows having the same length as the pattern
filter(Reduce("|", shift(ab, n = 0:(len_pattern - 1), type = "lag"))) %>%
# make unique names
group_by(g, add = TRUE) %>% mutate(name = paste(name, 1:n(), sep = "_")) %>%
# pivoting the table to wide format
select(-ab) %>% spread(name, time)
#Source: local data frame [1 x 5]
#Groups: id, g [1]
# id g a_1 b_2 b_3
#* <int> <int> <int> <int> <int>
#1 1 1 3 10 12
这篇关于R - 在数据帧中按组识别行元素的序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!