问题描述
我有记录事件,例如用户ID,居住国家和事件。
Eg,
I have a data.table
of events recording, say, user ID, country of residence, and event.E.g.,
dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
country=c(rep(1,4),rep(2,6)),
event=1:10, key="user")
如您所见,数据有些损坏:事件5报告用户3在国家2旅行 - 这对我来说没关系)。
所以当我尝试总结数据:
As you can see, the data is somewhat corrupted: event 5 reports user 3 as being in country 2 (or maybe he traveled - it does not matter to me here).So when I try to summarize the data:
dt[, country[.N] , by=user]
user V1
1: 3 2
2: 4 2
我得到错误的国家为用户3.
理想情况下,我想得到一个用户最常见的国家和
百分比的时间,他在那里:
I get the wrong country for user 3.Ideally, I would like to get the most common country for a user and thepercentage of time he spent there:
user country support
1: 3 1 0.8
2: 4 2 1.0
我如何做?
实际数据有〜10 ^ 7行, (这是为什么我使用 data.table
而不是毕竟data.frame
)。
The actual data has ~10^7 rows, so the solution has to scale (this is why I am using data.table
and not data.frame
after all).
推荐答案
另一种方式:
表(。)
是罪魁祸首。更改为完成 data.table
语法。
Edited. table(.)
was the culprit. Changed it to complete data.table
syntax.
dt.out<- dt[, .N, by=list(user,country)][, list(country[which.max(N)],
max(N)/sum(N)), by=user]
setnames(dt.out, c("V1", "V2"), c("country", "support"))
# user country support
# 1: 3 1 0.8
# 2: 4 2 1.0
这篇关于总结具有不可靠数据的data.table的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!