问题描述
几周前,我在StackOverflow上从一个慷慨的响应者那里收到的一些不错的代码中发现了一个很难解决的错误,即使不是我自己创建的,也可以在今天使用一些新的帮助.
I've identified, if not myself created, a difficult bug to resolve in some nice code received from a generous respondent here on StackOverflow a few weeks ago, and I could use some new assistance today.
样本数据(以下称为对象eh
):
Sample data (called object eh
below):
ID 2013-03-20 2013-04-09 2013-04-11 2013-04-17 2013-04-25 2013-05-15 2013-05-24 2013-05-25 2013-05-26
5167f 0 0 0 0 0 0 0 0 0
1214m 0 0 0 0 0 0 0 0 0
1844f 0 0 0 0 0 0 0 0 0
2113m 0 0 0 0 0 0 0 0 0
2254m 0 0 0 0 0 0 0 0 0
2721f 0 0 0 0 0 0 0 0 0
3121f 0 0 0 0 0 0 0 0 0
3486f 0 0 0 0 0 0 0 0 0
3540f 0 0 0 0 0 0 0 0 0
4175m 0 0 0 0 0 0 0 0 0
我需要能够按其各自的列日期所属的时间段(例如,每1、2、3或4周)对0s
和1s
进行分组.每当1
在特定日期范围(Period
)内至少下降一次时,就会在该Period
中为该ID
汇总一个1
(否则为0
).
I need to be able to group 0s
and 1s
by the time period in which their respective column date falls (e.g., every 1, 2, 3, or 4 weeks). Whenever a 1
falls at least once within a specific date range (Period
), then a 1
is summarized for that ID
in that Period
(0
, else).
我以1周的摘要例程为例.我的主要问题是,在从"2013-03-20"
到"2015-12-31"
的时间序列中,最终生成的最终输出缺少某些可能的1周Periods
.
I'm starting with the 1-week summary routine as an example. My main problem is that the final output generated lacks some of the total possible 1-week Periods
during the time series "2013-03-20"
to "2015-12-31"
.
在此示例输出中的通知,其中行表示唯一的IDs
,列表示唯一的Periods
,如何缺少Periods
2、5、7和9:
Notice in this example output, wherein the rows are for unique IDs
and columns are for unique Periods
, how Periods
2, 5, 7, and 9 are missing:
1 3 4 6 8 10 11 12 13 14
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
这里是对原始数据帧进行分组的完整例程(请参见上面共享的示例数据):
Here is the full routine for grouping the original data frame (see sample data shared above):
#Convert to data table from original data frame, eh
dt <- as.data.table(eh)
#One week summarized encounter histories
dt_merge <- data_frame(
# Create a column showing the beginning date
Date1 = seq(from = ymd("2013-03-20"), to = ymd("2015-12-31"), by = "1 week")) %>%
# Create a column showing the end date of each period
mutate(Date2 = lead(Date1)) %>%
# Adjust Date1
mutate(Date1 = if_else(Date1 == ymd("2013-03-20"), Date1, Date1 + 1)) %>%
# Remove the last row
drop_na(Date2) %>%
# Create date list
mutate(Dates = map2(Date1, Date2, function(x, y){ seq(x, y, by = "day") })) %>%
unnest() %>%
# Create Group ID
mutate(RunID = group_indices_(., dots. = c("Date1", "Date2"))) %>%
# Create Period ID
mutate(Period = paste0(RunID)) %>%
# Add a column showing Month
mutate(Month = month(Dates)) %>%
# Add a column showing Year
mutate(Year = year(Dates)) %>%
# Add a column showing season
mutate(Season = case_when(
Month %in% 3:5 ~ "Spring",
Month %in% 6:8 ~ "Summer",
Month %in% 9:11 ~ "Fall",
Month %in% c(12, 1, 2) ~ "Winter",
TRUE ~ NA_character_
)) %>%
# Combine Season and Year
mutate(SeasonYear = paste0(Season, Year)) %>%
select(-Date1, -Date2, -RunID)
dt2 <- dt %>%
# Reshape the data frame
gather(Date, Value, -ID) %>%
# Convert Date to date class
mutate(Date = ymd(Date)) %>%
# Join dt_merge
left_join(dt_merge, by = c("Date" = "Dates"))
one.week <- dt2 %>%
group_by(ID, Period) %>%
summarise(Value = max(Value)) %>%
spread(Period, Value)
#Finished product
one.week <- as.data.frame(one.week)
#Missing weeks 2, 5, 7, and 9...
one.week
有人可以帮助我了解我哪里出错了吗?预先感谢!
Can someone help me understand where I've gone wrong? Thanks in advance!
-AD
推荐答案
之所以发生这种情况,是因为eh
数据缺少这些星期.例如,如果您查看组成第2周的日期:
This is happening because those weeks are missing from the eh
data. For example, if you look at the dates that make up week 2:
dt_merge %>%
filter(Period == 2)
#> # A tibble: 7 x 6
#> Dates Period Month Year Season SeasonYear
#> <date> <chr> <dbl> <dbl> <chr> <chr>
#> 1 2013-03-28 2 3 2013 Spring Spring2013
#> 2 2013-03-29 2 3 2013 Spring Spring2013
#> 3 2013-03-30 2 3 2013 Spring Spring2013
#> 4 2013-03-31 2 3 2013 Spring Spring2013
#> 5 2013-04-01 2 4 2013 Spring Spring2013
#> 6 2013-04-02 2 4 2013 Spring Spring2013
#> 7 2013-04-03 2 4 2013 Spring Spring2013
您可以看到eh
列中没有这些日期,该日期从2013-03-20跳至2013-04-09.因为在创建dt2
时使用了left_join
,所以仅保留eh
中的日期(因此是星期).
You can see that none of these dates are in the columns of eh
, which skip from 2013-03-20 to 2013-04-09. Because you use a left_join
when creating dt2
, only dates (and therefore weeks) in eh
are retained.
可以使用 tidyr 包中的complete()
创建ID和日期的缺失组合来更正此问题.
This can be corrected by using complete()
from the tidyr package to create the missing combinations of ID and Date.
dt2 <- dt %>%
# Reshape the data frame
gather(Date, Value, -ID) %>%
# Convert Date to date class
mutate(Date = ymd(Date)) %>%
# Create missing ID/Date combinations
complete(ID, Date = dt_merge$Dates) %>%
# Join dt_merge
left_join(dt_merge, by = c("Date" = "Dates"))
one.week <- dt2 %>%
mutate(Period = as.numeric(Period)) %>%
group_by(ID, Period) %>%
summarise(Value = max(Value, na.rm = TRUE)) %>%
spread(Period, Value)
one.week
#> # A tibble: 10 x 146
#> # Groups: ID [10]
#> ID `1` `2` `3` `4` `5` `6` `7` `8` `9` `10` `11`
#> * <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1214m 0 -Inf 0 0 -Inf 0 -Inf 0 -Inf 0 -Inf
#> 2 1844f 0 -Inf 0 0 -Inf 0 -Inf 0 -Inf 0 -Inf
#> 3 2113m 0 -Inf 0 0 -Inf 0 -Inf 0 -Inf 0 -Inf
#> 4 2254m 0 -Inf 0 0 -Inf 0 -Inf 0 -Inf 0 -Inf
#> 5 2721f 0 -Inf 0 0 -Inf 0 -Inf 0 -Inf 0 -Inf
#> 6 3121f 0 -Inf 0 0 -Inf 0 -Inf 0 -Inf 0 -Inf
#> 7 3486f 0 -Inf 0 0 -Inf 0 -Inf 0 -Inf 0 -Inf
#> 8 3540f 0 -Inf 0 0 -Inf 0 -Inf 0 -Inf 0 -Inf
#> 9 4175m 0 -Inf 0 0 -Inf 0 -Inf 0 -Inf 0 -Inf
#> 10 5167f 0 -Inf 0 0 -Inf 0 -Inf 0 -Inf 0 -Inf
#> # ... with 134 more variables: `12` <dbl>, `13` <dbl>, `14` <dbl>,
#> # `15` <dbl>, `16` <dbl>, `17` <dbl>, `18` <dbl>, `19` <dbl>,
#> # `20` <dbl>, `21` <dbl>, `22` <dbl>, `23` <dbl>, `24` <dbl>,
#> # `25` <dbl>, `26` <dbl>, `27` <dbl>, `28` <dbl>, `29` <dbl>,
#> # `30` <dbl>, `31` <dbl>, `32` <dbl>, `33` <dbl>, `34` <dbl>,
#> # `35` <dbl>, `36` <dbl>, `37` <dbl>, `38` <dbl>, `39` <dbl>,
#> # `40` <dbl>, `41` <dbl>, `42` <dbl>, `43` <dbl>, `44` <dbl>,
#> # `45` <dbl>, `46` <dbl>, `47` <dbl>, `48` <dbl>, `49` <dbl>,
#> # `50` <dbl>, `51` <dbl>, `52` <dbl>, `53` <dbl>, `54` <dbl>,
#> # `55` <dbl>, `56` <dbl>, `57` <dbl>, `58` <dbl>, `59` <dbl>,
#> # `60` <dbl>, `61` <dbl>, `62` <dbl>, `63` <dbl>, `64` <dbl>,
#> # `65` <dbl>, `66` <dbl>, `67` <dbl>, `68` <dbl>, `69` <dbl>,
#> # `70` <dbl>, `71` <dbl>, `72` <dbl>, `73` <dbl>, `74` <dbl>,
#> # `75` <dbl>, `76` <dbl>, `77` <dbl>, `78` <dbl>, `79` <dbl>,
#> # `80` <dbl>, `81` <dbl>, `82` <dbl>, `83` <dbl>, `84` <dbl>,
#> # `85` <dbl>, `86` <dbl>, `87` <dbl>, `88` <dbl>, `89` <dbl>,
#> # `90` <dbl>, `91` <dbl>, `92` <dbl>, `93` <dbl>, `94` <dbl>,
#> # `95` <dbl>, `96` <dbl>, `97` <dbl>, `98` <dbl>, `99` <dbl>,
#> # `100` <dbl>, `101` <dbl>, `102` <dbl>, `103` <dbl>, `104` <dbl>,
#> # `105` <dbl>, `106` <dbl>, `107` <dbl>, `108` <dbl>, `109` <dbl>,
#> # `110` <dbl>, `111` <dbl>, ...
如果在给定的星期没有该ID的值,则返回-Inf
.另外,也可以用complete(ID, Date = dt_merge$Dates, fill = list(Value = 0))
将缺失值填充为例如0,而不用NA
填充缺失值.对于任何未观察到的ID和日期组合,这会将Value变量设置为0.
Here -Inf
is returned if there were no values for that ID in a given week. Alternatively, instead of filling the missing values with NA
, they could be filled with, for example 0, using complete(ID, Date = dt_merge$Dates, fill = list(Value = 0))
. This will make the Value variable 0 for any of the unobserved ID and Date combinations.
这篇关于按日期对数据框进行分组:解决缺少时间段的错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!