我有一个很大的时间段数据集,由“开始”和“结束”列定义。一些时期重叠。

我想将所有重叠的时间段合并(拼合/合并/折叠)以具有一个“开始”值和一个“结束”值。

一些示例数据:

  ID      start        end
1  A 2013-01-01 2013-01-05
2  A 2013-01-01 2013-01-05
3  A 2013-01-02 2013-01-03
4  A 2013-01-04 2013-01-06
5  A 2013-01-07 2013-01-09
6  A 2013-01-08 2013-01-11
7  A 2013-01-12 2013-01-15


所需结果:

  ID      start        end
1  A 2013-01-01 2013-01-06
2  A 2013-01-07 2013-01-11
3  A 2013-01-12 2013-01-15


我尝试过的

  require(dplyr)
  data <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L), class = "factor", .Label = "A"),
    start = structure(c(1356998400, 1356998400, 1357084800, 1357257600,
    1357516800, 1357603200, 1357948800), tzone = "UTC", class = c("POSIXct",
    "POSIXt")), end = structure(c(1357344000, 1357344000, 1357171200,
    1357430400, 1357689600, 1357862400, 1358208000), tzone = "UTC", class = c("POSIXct",
    "POSIXt"))), .Names = c("ID", "start", "end"), row.names = c(NA,
-7L), class = "data.frame")

remove.overlaps <- function(data){
data2 <- data
for ( i in 1:length(unique(data$start))) {
x3 <- filter(data2, start>=data$start[i] & start<=data$end[i])
x4 <- x3[1,]
x4$end <- max(x3$end)
data2 <- filter(data2, start<data$start[i] | start>data$end[i])
data2 <- rbind(data2,x4)
}
data2 <- na.omit(data2)}

data <- remove.overlaps(data)

最佳答案

这是一个可能的解决方案。这里的基本思想是使用start函数将滞后的cummax日期与最大终止日期“直到现在”进行比较,并创建一个将数据分为几组的索引

data %>%
  arrange(ID, start) %>% # as suggested by @Jonno in case the data is unsorted
  group_by(ID) %>%
  mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
                     cummax(as.numeric(end)))[-n()])) %>%
  group_by(ID, indx) %>%
  summarise(start = first(start), end = last(end))

# Source: local data frame [3 x 4]
# Groups: ID
#
#   ID indx      start        end
# 1  A    0 2013-01-01 2013-01-06
# 2  A    1 2013-01-07 2013-01-11
# 3  A    2 2013-01-12 2013-01-15

关于r - 如何展平/合并重叠的时间段,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/28938147/

10-11 07:50