问题描述
我在下面找到了解决该问题的方法,但是,它适用于小型数据集,但仍会在大型数据集上产生错误输出.有人知道为什么吗?我找不到错误.这是代码:
I found a way for the problem below, however, it works on a small dataset but still creates falses output on large datasets. Someone knows why? I can't find the mistake. Here's the code:
df$continuous <-
unlist(lapply(split(df, df$ID),
function(x) {
sapply(1:nrow(x),
function(y) {
any(x$start[y] - x$end[-(y:NROW(x$end))] <= 1)
})
}))
原始问题:我正在使用一个函数来识别一系列开始/结束日期中的间隔.如果开始日期晚于任何先前结束日期的1天之后,则输出应为FALSE.
ORIGINAL PROBLEM:I'm working on a function to identify a gap in a series of start/end dates. The output should be FALSE if a start date begins later than 1 day after any of the previous end dates.
数据:
df <- data.frame('ID' = c('1','1','1','1','1','1'), 'start' = as.Date(c('2010-01-01', '2010-01-03', '2010-01-05', '2010-01-09','2010-02-01', '2010-02-10')),
'end' = as.Date(c('2010-01-03', '2010-01-22', '2010-01-07', '2010-01-12', '2010-02-10', '2010-02-12')))
这是我尝试使用x = start
和y = end
解决此问题的方法:
This is my attempt to solve this with x = start
and y = end
:
my_fun <- function(x,y){
any(x[i] - y[1:NROW(i)-1] <= 1)
}
如果我指定i
的话效果很好,但是我没有设法将其包装成一个循环.最终,应将此功能以dplyr
方式应用于大型数据集中的组.
It works well if I specify i
but I don't manage to wrap this into a loop. Ultimately, this function should be applied to groups in a large dataset in a dplyr
manner.
它应该是这样的:
ID start end continuous
1 1 2010-01-01 2010-01-03 FALSE #or TRUE
2 1 2010-01-03 2010-01-22 TRUE
3 1 2010-01-05 2010-01-07 TRUE
4 1 2010-01-09 2010-01-12 TRUE
5 1 2010-02-01 2010-02-10 FALSE
6 1 2010-02-10 2010-02-12 TRUE #according to my function or FALSE compared to start[1] would be even better
非常感谢您的帮助.
推荐答案
您可以使用dplyr
和lubridate
进行此操作. dplyr
具有非常有用的窗口功能像lag()
这类分析很方便.
You can do this using dplyr
and lubridate
. dplyr
has really useful window functions like lag()
that are handy for this type of analysis.
library(tidyverse)
library(lubridate)
df %>%
mutate(start - lag(end, 1) == 0)
# ID start end start - lag(end, 1) == 0
# 1 1 2010-01-01 2010-01-03 NA
# 2 1 2010-01-03 2010-01-22 TRUE
# 3 1 2010-01-05 2010-01-07 FALSE
# 4 1 2010-01-09 2010-01-12 FALSE
# 5 1 2010-02-01 2010-02-10 FALSE
# 6 1 2010-02-10 2010-02-12 TRUE
您如何处理数据的第一行?由于没有先前的值,因此显示NA
.通常,这是您应该如何处理这种情况的方法,但是如果您希望它具有不同的值,我可以编辑我的答案.
How do you want to handle the first row of your data? Since there is no previous value, it shows NA
. This is generally how you should handle situations like this but I can edit my answer if you'd like it to have a different value.
这篇关于识别时间数据之间的差距的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!