本文介绍了按组有条件地从时间序列中过滤观察结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含多个时间序列(值〜时间)的df("df"),其观察结果按3个因素分组:温度,重复和物种.这些数据需要在时间序列的下端和上端进行修整,但是这些阈值是组条件的(例如,删除2以下和10以上的观测值,其中temp = 10,rep = 2,并且种类="A").我有一个随附的df(df_thresholds),其中包含分组值以及每个组要使用的最小值和最大值.并非所有组都需要修剪(我想定期更新此文件,这将指导修剪df的位置).有人可以帮我有条件地按组过滤掉这些值吗?我有以下内容,虽然很接近但还不足够.当我反转最大和最小布尔测试时,我得到的观测值为零.

I have a df ("df") containing multiple time series (value ~ time) whose observations are grouped by 3 factors: temp, rep, and species. These data need to be trimmed at the lower and upper ends of the time series, but these threshold values are group conditional (e.g. remove observations below 2 and above 10 where temp=10, rep=2, and species = "A"). I have an accompanying df (df_thresholds) that contains grouping values and the mins and maxs i want to use for each group. Not all groups need trimming (I would like to update this file regularly which would guide where to trim df). Can anybody help me conditionally filter out these values by group? I have the following, which is close but not quite there. When I reverse the max and min boolean tests, I get zero observations.

df <- data.frame(species = c(rep("A", 16), rep("B", 16)),
                 temp=as.factor(c(rep(10,4),rep(20,4),rep(10,4),rep(20,4))),
                 rep=as.factor(c(rep(1,8),rep(2,8),rep(1,8),rep(2,8))),
                 time=rep(seq(1:4),4),
                 value=c(1,4,8,16,2,4,9,16,2,4,10,16,2,4,15,16,2,4,6,16,1,4,8,16,1,2,8,16,2,3,4,16))

df_thresholds <- data.frame(species=c("A", "A", "B"), 
                            temp=as.factor(c(10,20,10)),
                            rep=as.factor(c(1,1,2)), 
                            min_value=c(2,4,2),
                            max_value=c(10,10,9))

#desired outcome
df_desired <- df[c(2:3,6:7,9:24,26:27,29:nrow(df)),]


#attempt
df2 <- df

for (i in 1:nrow(df_thresholds)) {  
  df2 <- df2 %>%
    filter(!(species==df_thresholds$species[i] & temp==df_thresholds$temp[i] & rep==df_thresholds$rep[i] & value>df_thresholds$min_value[i] & value<df_thresholds$max_value[i]))
}

这是我根据以下建议实施的解决方案.

Here's the solution I implemented per suggestions below.

df_test <- left_join(df, df_thresholds, by=c('species','temp','rep'))
df_test$min_value[is.na(df_test$min_value)] <- 0
df_test$max_value[is.na(df_test$max_value)] <- 999

df_test2 <- df_test %>%
  filter(value >= min_value & value <= max_value)

推荐答案

我们可以使用mapply

df[-c(with(df_thresholds, 
      mapply(function(x, y, z, min_x, max_x) 
           which(df$species == x & df$temp == y & df$rep == z & 
              (df$value < min_x | df$value > max_x)),
                 species, temp, rep, min_value, max_value))), ]


#   species temp rep time value
#2        A   10   1    2     4
#3        A   10   1    3     8
#6        A   20   1    2     4
#7        A   20   1    3     9
#9        A   10   2    1     2
#10       A   10   2    2     4
#11       A   10   2    3    10
#12       A   10   2    4    16
#......

mapply中,我们相应地传递df_thresholds过滤器df的所有列,并找出每一行的最小值和最大值之外的索引,并将其从原始数据帧中排除.

In mapply we pass all the columns of df_thresholds filter df accordingly and find out indices which are outside min and max value for each row and exclude them from the original dataframe.

mapply调用的结果是

#[1]  1  4  5  8 25 28

是我们要从df中排除的行,因为它们不在范围内.

which are the rows we want to exclude from the df since they fall out of range.

这篇关于按组有条件地从时间序列中过滤观察结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-16 13:18