本文介绍了根据列值将行添加到data.table的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用大型数据集,我需要向它添加行。这个问题在另一个问题中提出,但我将这个问题与原始问题中的其他问题分开。



资料范例:

  yr week id days rev p1 p2 p3 f1 f2 f3 f4 
2016 3 1 1 5568.3 0 1 0 0 0 0 0
2016 4 1 3 8869.53 0 1 0 0 0 0 0
2016 5 1 2 12025.8 0 1 0 0 0 0 0
2016 6 1 2 9126.6 0 1 0 0 0 0 0
2016 7 1 3 4415.4 0 1 0 0 0 0 0
2016 8 1 2 11586.6 0 1 0 0 0 0 0
2016 10 1 1 2144.4 0 1 0 0 0 0 0
2016 11 1 1 2183.25 0 1 0 0 0 0 0
2016 14 1 2 4998 0 1 0 0 0 0 0
2016 15 1 3 117 0 1 0 0 0 0 0
2016 1 2 4 12743.3 0 0 1 1 1 0 0
2016 2 2 2 7473.48 0 0 1 1 1 0 0
2016 5 2 2 8885.52 0 0 1 1 1 0 0
2016 7 2 1 15330.6 0 0 1 1 1 0 0
2016 8 2 2 3763.8 0 0 1 1 1 0 0
2016 9 2 1 2274.05 0 0 1 1 1 0 0

对于id和yr的每个组合,有几行数据对应于周。

对于yr / id的每个组合,都有一周的最大值。p1:p3和f1:f4列是id / yr不变的,我想要做的是添加缺少的行,从一个星期= 1开始,直到该yr / id组合的最大值。



我想结束:

  yr week id days rev p1 p2 p3 f1 f2 f3 f4 
2016 1 1 0 NA 0 1 0 0 0 0 0
2016 2 1 0 NA 0 1 0 0 0 0 0
2016 3 1 1 5568.3 0 1 0 0 0 0 0
2016 4 1 3 8869.53 0 1 0 0 0 0 0
2016 5 1 2 12025.8 0 1 0 0 0 0 0
2016 6 1 2 9126.6 0 1 0 0 0 0 0
2016 7 1 3 4415.4 0 1 0 0 0 0 0
2016 8 1 2 11586.6 0 1 0 0 0 0 0
2016 9 1 0 NA 0 1 0 0 0 0 0
2016 10 1 1 2144.4 0 1 0 0 0 0 0
2016 11 1 1 2183.25 0 1 0 0 0 0 0
2016 12 1 0 NA 0 1 0 0 0 0 0
2016 13 1 0 NA 0 1 0 0 0 0 0
2016 14 1 2 4998 0 1 0 0 0 0 0
2016 15 1 3 117 0 1 0 0 0 0 0
2016 1 2 4 12743.3 0 0 1 1 1 0 0
2016 2 2 2 7473.48 0 0 1 1 1 0 0
2016 3 2 0 NA 0 0 1 1 1 0 0
2016 4 2 0 NA 0 0 1 1 1 0 0
2016 5 2 2 8885.52 0 0 1 1 1 0 0
2016 6 2 0 NA 0 0 1 1 1 0 0
2016 7 2 1 15330.6 0 0 1 1 1 0 0
2016 8 2 2 3763.8 0 0 1 1 1 0 0
2016 9 2 1 2274.05 0 0 1 1 1 0 0

我已经尝试使用CJ从data.table包,但问题是连接是不同的每个id /季节组。

解决方案

使用 dplyr (v0.4.3 ),并且 tidyr (v0.4.1)中的完成函数应该执行:

 库(dplyr)
库(tidyr)

df%>%
group_by id)%>%
complete(week = 1:max(week))%>%
replace_na(list(days = 0))%>%
group_by(yr, id)%>%
mutate_each(funs(replace(。,is.na(。),mean(。,na.rm = T))),p1:f4)


I am working with a large dataset and I need to add rows to it. This issue was raised in another question but I am separating this particular issue from other questions in the original question. I'm fairly new to SO so please let me know if this is not a 'done' thing.

Data example:

yr   week   id  days rev    p1  p2 p3   f1  f2  f3  f4
2016    3   1   1   5568.3  0   1   0   0   0   0   0
2016    4   1   3   8869.53 0   1   0   0   0   0   0
2016    5   1   2   12025.8 0   1   0   0   0   0   0
2016    6   1   2   9126.6  0   1   0   0   0   0   0
2016    7   1   3   4415.4  0   1   0   0   0   0   0
2016    8   1   2   11586.6 0   1   0   0   0   0   0
2016    10  1   1   2144.4  0   1   0   0   0   0   0
2016    11  1   1   2183.25 0   1   0   0   0   0   0
2016    14  1   2   4998    0   1   0   0   0   0   0
2016    15  1   3   117     0   1   0   0   0   0   0
2016    1   2   4   12743.3 0   0   1   1   1   0   0
2016    2   2   2   7473.48 0   0   1   1   1   0   0
2016    5   2   2   8885.52 0   0   1   1   1   0   0
2016    7   2   1   15330.6 0   0   1   1   1   0   0
2016    8   2   2   3763.8  0   0   1   1   1   0   0
2016    9   2   1   2274.05 0   0   1   1   1   0   0

For each combination of id and yr there are several rows of data corresponding to week. The p1:p3 and f1:f4 columns are id/yr invariant, rev varies with week.

For each combination of yr/id there is a maximum value for week. What I want to do is to add missing rows starting from one for week=1, up to the maximum value for that yr/id combination.

I would like to end up with:

yr  week    id  days    rev p1  p2  p3  f1  f2  f3  f4
2016    1   1   0   NA      0   1   0   0   0   0   0
2016    2   1   0   NA      0   1   0   0   0   0   0
2016    3   1   1   5568.3  0   1   0   0   0   0   0
2016    4   1   3   8869.53 0   1   0   0   0   0   0
2016    5   1   2   12025.8 0   1   0   0   0   0   0
2016    6   1   2   9126.6  0   1   0   0   0   0   0
2016    7   1   3   4415.4  0   1   0   0   0   0   0
2016    8   1   2   11586.6 0   1   0   0   0   0   0
2016    9   1   0   NA      0   1   0   0   0   0   0
2016    10  1   1   2144.4  0   1   0   0   0   0   0
2016    11  1   1   2183.25 0   1   0   0   0   0   0
2016    12  1   0   NA      0   1   0   0   0   0   0
2016    13  1   0   NA      0   1   0   0   0   0   0
2016    14  1   2   4998    0   1   0   0   0   0   0
2016    15  1   3   117     0   1   0   0   0   0   0
2016    1   2   4   12743.3 0   0   1   1   1   0   0
2016    2   2   2   7473.48 0   0   1   1   1   0   0
2016    3   2   0   NA      0   0   1   1   1   0   0
2016    4   2   0   NA      0   0   1   1   1   0   0
2016    5   2   2   8885.52 0   0   1   1   1   0   0
2016    6   2   0   NA      0   0   1   1   1   0   0
2016    7   2   1   15330.6 0   0   1   1   1   0   0
2016    8   2   2   3763.8  0   0   1   1   1   0   0
2016    9   2   1   2274.05 0   0   1   1   1   0   0

I have tried using CJ from the data.table package but the issue is that the join is different for each id/season group. Any suggestions are appreciated.

解决方案

Grouping with dplyr (v0.4.3) and the complete function from tidyr (v0.4.1) should do the trick:

library(dplyr)
library(tidyr)

df %>% 
  group_by(yr, id) %>% 
  complete(week = 1:max(week)) %>% 
  replace_na(list(days = 0)) %>% 
  group_by(yr, id) %>% 
  mutate_each(funs(replace(., is.na(.), mean(., na.rm = T))), p1:f4)

这篇关于根据列值将行添加到data.table的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-18 21:20