r - 获取在同一列中具有相同值并在另一列中具有正二进制值的行数

(对不起，标题很奇怪，但我想不出一种简短的表达方式)

由于在上一个问题中我设法简化了问题，所以这次我为您提供实际的问题。

提供的数据帧包含列“usr”，“usrMsgCnt”和“isRefound”，其中usr是名称，usrMsgCnt是数字，isRefound是二进制。

将添加一个新列，其值计算如下:

对于示例数据的第一行，新值将为:

考虑到原始数据集的大小，循环浏览不是一个选择

这是一小部分数据的讨论

structure(list(usr = structure(c(21L, 21L, 21L, 21L, 6L, 5L,
6L, 6L, 6L, 21L, 20L, 21L, 6L, 20L, 21L, 21L, 21L, 6L, 6L, 6L
), .Label = c("alsmith", "Amanda.Coles", "Andrew.Coles", "babsimieth",
"Bernd.Ludwig", "Bernhard.Schiemann", "bfueck", "Bram.Ridder",
"brian.tripney", "carlosgardeazabal", "christine.elsweiler",
"cmfinner", "daniel.goncalves", "david", "de56", "eko.ma", "freundlu",
"gmcphail", "ian.ferguson", "Ian.Ruthven", "Jan.Schrader", "jearmour",
"jyang", "Laura.Schnall", "Marc.Roper", "marek.maleika", "Martin.Hacker",
"martin.scholz", "maziminke", "mclanger", "Michael.Cashmore",
"morgan.harvey", "mrussell", "msherrif", "murray.wood", "Nadine.Mahrholz",
"noam.ascher", "pburns", "Peter.Gregory", "raina", "robertnm",
"ronald.teijeira", "ronaldtf", "sbenus", "starmstr", "steve.neely",
"Sven.Friedemann", "tinchen"), class = "factor"), usrMsgCnt = c(9L,
9L, 9L, 9L, 5L, 0L, 5L, 5L, 5L, 9L, 0L, 9L, 5L, 0L, 9L, 9L, 9L,
37L, 37L, 37L), isRefound = c(0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L,
1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L)), .Names = c("usr",
"usrMsgCnt", "isRefound"), row.names = c(NA, 20L), class = "data.frame")

最佳答案

假设isRefound实际上是二进制的:

library(data.table)
DT <- data.table(DF,key="usr")

DT[,newvar:=usrMsgCnt/sum(isRefound),by=usr]

编辑:如果顺序是必不可少的，则不应设置键(对data.table进行排序)并创建索引变量(出于安全考虑)。

DT <- data.table(DF)
DT[,id:=.I]
DT[,newvar:=usrMsgCnt/sum(isRefound),by=usr]
print(DT)

#                    usr usrMsgCnt isRefound id newvar
#  1:       Jan.Schrader         9         0  1    1.8
#  2:       Jan.Schrader         9         1  2    1.8
#  3:       Jan.Schrader         9         1  3    1.8
#  4:       Jan.Schrader         9         1  4    1.8
#  5: Bernhard.Schiemann         5         1  5    1.0
#  6:       Bernd.Ludwig         0         0  6    NaN
#  7: Bernhard.Schiemann         5         0  7    1.0
#  8: Bernhard.Schiemann         5         1  8    1.0
#  9: Bernhard.Schiemann         5         1  9    1.0
# 10:       Jan.Schrader         9         1 10    1.8
# 11:        Ian.Ruthven         0         0 11    NaN
# 12:       Jan.Schrader         9         0 12    1.8
# 13: Bernhard.Schiemann         5         1 13    1.0
# 14:        Ian.Ruthven         0         0 14    NaN
# 15:       Jan.Schrader         9         0 15    1.8
# 16:       Jan.Schrader         9         0 16    1.8
# 17:       Jan.Schrader         9         1 17    1.8
# 18: Bernhard.Schiemann        37         0 18    7.4
# 19: Bernhard.Schiemann        37         1 19    7.4
# 20: Bernhard.Schiemann        37         0 20    7.4

可以将相同的概念方法与基本R方法和展示了at your previous question的plyr方法一起使用:

within(DF, {
  newvar <- usrMsgCnt/ave(isRefound, usr, FUN = sum)
})

library(plyr)
ddply(DF, .(usr), transform,
      newvar = usrMsgCnt/sum(isRefound))

但是，data.table包的性能对于大型数据集将更为出色。