问题描述
我使用大型数据集,我需要向它添加行。这个问题在另一个问题中提出,但我将这个问题与原始问题中的其他问题分开。
资料范例:
yr week id days rev p1 p2 p3 f1 f2 f3 f4
2016 3 1 1 5568.3 0 1 0 0 0 0 0
2016 4 1 3 8869.53 0 1 0 0 0 0 0
2016 5 1 2 12025.8 0 1 0 0 0 0 0
2016 6 1 2 9126.6 0 1 0 0 0 0 0
2016 7 1 3 4415.4 0 1 0 0 0 0 0
2016 8 1 2 11586.6 0 1 0 0 0 0 0
2016 10 1 1 2144.4 0 1 0 0 0 0 0
2016 11 1 1 2183.25 0 1 0 0 0 0 0
2016 14 1 2 4998 0 1 0 0 0 0 0
2016 15 1 3 117 0 1 0 0 0 0 0
2016 1 2 4 12743.3 0 0 1 1 1 0 0
2016 2 2 2 7473.48 0 0 1 1 1 0 0
2016 5 2 2 8885.52 0 0 1 1 1 0 0
2016 7 2 1 15330.6 0 0 1 1 1 0 0
2016 8 2 2 3763.8 0 0 1 1 1 0 0
2016 9 2 1 2274.05 0 0 1 1 1 0 0
对于id和yr的每个组合,有几行数据对应于周。
对于yr / id的每个组合,都有一周的最大值。p1:p3和f1:f4列是id / yr不变的,我想要做的是添加缺少的行,从一个星期= 1开始,直到该yr / id组合的最大值。
我想结束:
yr week id days rev p1 p2 p3 f1 f2 f3 f4
2016 1 1 0 NA 0 1 0 0 0 0 0
2016 2 1 0 NA 0 1 0 0 0 0 0
2016 3 1 1 5568.3 0 1 0 0 0 0 0
2016 4 1 3 8869.53 0 1 0 0 0 0 0
2016 5 1 2 12025.8 0 1 0 0 0 0 0
2016 6 1 2 9126.6 0 1 0 0 0 0 0
2016 7 1 3 4415.4 0 1 0 0 0 0 0
2016 8 1 2 11586.6 0 1 0 0 0 0 0
2016 9 1 0 NA 0 1 0 0 0 0 0
2016 10 1 1 2144.4 0 1 0 0 0 0 0
2016 11 1 1 2183.25 0 1 0 0 0 0 0
2016 12 1 0 NA 0 1 0 0 0 0 0
2016 13 1 0 NA 0 1 0 0 0 0 0
2016 14 1 2 4998 0 1 0 0 0 0 0
2016 15 1 3 117 0 1 0 0 0 0 0
2016 1 2 4 12743.3 0 0 1 1 1 0 0
2016 2 2 2 7473.48 0 0 1 1 1 0 0
2016 3 2 0 NA 0 0 1 1 1 0 0
2016 4 2 0 NA 0 0 1 1 1 0 0
2016 5 2 2 8885.52 0 0 1 1 1 0 0
2016 6 2 0 NA 0 0 1 1 1 0 0
2016 7 2 1 15330.6 0 0 1 1 1 0 0
2016 8 2 2 3763.8 0 0 1 1 1 0 0
2016 9 2 1 2274.05 0 0 1 1 1 0 0
我已经尝试使用CJ从data.table包,但问题是连接是不同的每个id /季节组。
使用 dplyr
(v0.4.3 ),并且 tidyr
(v0.4.1)中的完成
函数应该执行:
库(dplyr)
库(tidyr)
df%>%
group_by id)%>%
complete(week = 1:max(week))%>%
replace_na(list(days = 0))%>%
group_by(yr, id)%>%
mutate_each(funs(replace(。,is.na(。),mean(。,na.rm = T))),p1:f4)
I am working with a large dataset and I need to add rows to it. This issue was raised in another question but I am separating this particular issue from other questions in the original question. I'm fairly new to SO so please let me know if this is not a 'done' thing.
Data example:
yr week id days rev p1 p2 p3 f1 f2 f3 f4
2016 3 1 1 5568.3 0 1 0 0 0 0 0
2016 4 1 3 8869.53 0 1 0 0 0 0 0
2016 5 1 2 12025.8 0 1 0 0 0 0 0
2016 6 1 2 9126.6 0 1 0 0 0 0 0
2016 7 1 3 4415.4 0 1 0 0 0 0 0
2016 8 1 2 11586.6 0 1 0 0 0 0 0
2016 10 1 1 2144.4 0 1 0 0 0 0 0
2016 11 1 1 2183.25 0 1 0 0 0 0 0
2016 14 1 2 4998 0 1 0 0 0 0 0
2016 15 1 3 117 0 1 0 0 0 0 0
2016 1 2 4 12743.3 0 0 1 1 1 0 0
2016 2 2 2 7473.48 0 0 1 1 1 0 0
2016 5 2 2 8885.52 0 0 1 1 1 0 0
2016 7 2 1 15330.6 0 0 1 1 1 0 0
2016 8 2 2 3763.8 0 0 1 1 1 0 0
2016 9 2 1 2274.05 0 0 1 1 1 0 0
For each combination of id and yr there are several rows of data corresponding to week. The p1:p3 and f1:f4 columns are id/yr invariant, rev varies with week.
For each combination of yr/id there is a maximum value for week. What I want to do is to add missing rows starting from one for week=1, up to the maximum value for that yr/id combination.
I would like to end up with:
yr week id days rev p1 p2 p3 f1 f2 f3 f4
2016 1 1 0 NA 0 1 0 0 0 0 0
2016 2 1 0 NA 0 1 0 0 0 0 0
2016 3 1 1 5568.3 0 1 0 0 0 0 0
2016 4 1 3 8869.53 0 1 0 0 0 0 0
2016 5 1 2 12025.8 0 1 0 0 0 0 0
2016 6 1 2 9126.6 0 1 0 0 0 0 0
2016 7 1 3 4415.4 0 1 0 0 0 0 0
2016 8 1 2 11586.6 0 1 0 0 0 0 0
2016 9 1 0 NA 0 1 0 0 0 0 0
2016 10 1 1 2144.4 0 1 0 0 0 0 0
2016 11 1 1 2183.25 0 1 0 0 0 0 0
2016 12 1 0 NA 0 1 0 0 0 0 0
2016 13 1 0 NA 0 1 0 0 0 0 0
2016 14 1 2 4998 0 1 0 0 0 0 0
2016 15 1 3 117 0 1 0 0 0 0 0
2016 1 2 4 12743.3 0 0 1 1 1 0 0
2016 2 2 2 7473.48 0 0 1 1 1 0 0
2016 3 2 0 NA 0 0 1 1 1 0 0
2016 4 2 0 NA 0 0 1 1 1 0 0
2016 5 2 2 8885.52 0 0 1 1 1 0 0
2016 6 2 0 NA 0 0 1 1 1 0 0
2016 7 2 1 15330.6 0 0 1 1 1 0 0
2016 8 2 2 3763.8 0 0 1 1 1 0 0
2016 9 2 1 2274.05 0 0 1 1 1 0 0
I have tried using CJ from the data.table package but the issue is that the join is different for each id/season group. Any suggestions are appreciated.
Grouping with dplyr
(v0.4.3) and the complete
function from tidyr
(v0.4.1) should do the trick:
library(dplyr)
library(tidyr)
df %>%
group_by(yr, id) %>%
complete(week = 1:max(week)) %>%
replace_na(list(days = 0)) %>%
group_by(yr, id) %>%
mutate_each(funs(replace(., is.na(.), mean(., na.rm = T))), p1:f4)
这篇关于根据列值将行添加到data.table的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!