(对不起,标题很奇怪,但我想不出一种简短的表达方式)
由于在上一个问题中我设法简化了问题,所以这次我为您提供实际的问题。
提供的数据帧包含列“usr”,“usrMsgCnt”和“isRefound”,其中usr是名称,usrMsgCnt是数字,isRefound是二进制。
将添加一个新列,其值计算如下:
对于示例数据的第一行,新值将为:
考虑到原始数据集的大小,循环浏览不是一个选择
这是一小部分数据的讨论
structure(list(usr = structure(c(21L, 21L, 21L, 21L, 6L, 5L,
6L, 6L, 6L, 21L, 20L, 21L, 6L, 20L, 21L, 21L, 21L, 6L, 6L, 6L
), .Label = c("alsmith", "Amanda.Coles", "Andrew.Coles", "babsimieth",
"Bernd.Ludwig", "Bernhard.Schiemann", "bfueck", "Bram.Ridder",
"brian.tripney", "carlosgardeazabal", "christine.elsweiler",
"cmfinner", "daniel.goncalves", "david", "de56", "eko.ma", "freundlu",
"gmcphail", "ian.ferguson", "Ian.Ruthven", "Jan.Schrader", "jearmour",
"jyang", "Laura.Schnall", "Marc.Roper", "marek.maleika", "Martin.Hacker",
"martin.scholz", "maziminke", "mclanger", "Michael.Cashmore",
"morgan.harvey", "mrussell", "msherrif", "murray.wood", "Nadine.Mahrholz",
"noam.ascher", "pburns", "Peter.Gregory", "raina", "robertnm",
"ronald.teijeira", "ronaldtf", "sbenus", "starmstr", "steve.neely",
"Sven.Friedemann", "tinchen"), class = "factor"), usrMsgCnt = c(9L,
9L, 9L, 9L, 5L, 0L, 5L, 5L, 5L, 9L, 0L, 9L, 5L, 0L, 9L, 9L, 9L,
37L, 37L, 37L), isRefound = c(0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L,
1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L)), .Names = c("usr",
"usrMsgCnt", "isRefound"), row.names = c(NA, 20L), class = "data.frame")
最佳答案
假设isRefound
实际上是二进制的:
library(data.table)
DT <- data.table(DF,key="usr")
DT[,newvar:=usrMsgCnt/sum(isRefound),by=usr]
编辑:如果顺序是必不可少的,则不应设置键(对data.table进行排序)并创建索引变量(出于安全考虑)。
DT <- data.table(DF)
DT[,id:=.I]
DT[,newvar:=usrMsgCnt/sum(isRefound),by=usr]
print(DT)
# usr usrMsgCnt isRefound id newvar
# 1: Jan.Schrader 9 0 1 1.8
# 2: Jan.Schrader 9 1 2 1.8
# 3: Jan.Schrader 9 1 3 1.8
# 4: Jan.Schrader 9 1 4 1.8
# 5: Bernhard.Schiemann 5 1 5 1.0
# 6: Bernd.Ludwig 0 0 6 NaN
# 7: Bernhard.Schiemann 5 0 7 1.0
# 8: Bernhard.Schiemann 5 1 8 1.0
# 9: Bernhard.Schiemann 5 1 9 1.0
# 10: Jan.Schrader 9 1 10 1.8
# 11: Ian.Ruthven 0 0 11 NaN
# 12: Jan.Schrader 9 0 12 1.8
# 13: Bernhard.Schiemann 5 1 13 1.0
# 14: Ian.Ruthven 0 0 14 NaN
# 15: Jan.Schrader 9 0 15 1.8
# 16: Jan.Schrader 9 0 16 1.8
# 17: Jan.Schrader 9 1 17 1.8
# 18: Bernhard.Schiemann 37 0 18 7.4
# 19: Bernhard.Schiemann 37 1 19 7.4
# 20: Bernhard.Schiemann 37 0 20 7.4
可以将相同的概念方法与基本R方法和展示了at your previous question的plyr方法一起使用:
within(DF, {
newvar <- usrMsgCnt/ave(isRefound, usr, FUN = sum)
})
library(plyr)
ddply(DF, .(usr), transform,
newvar = usrMsgCnt/sum(isRefound))
但是,data.table包的性能对于大型数据集将更为出色。