问题描述
我希望在R中为某个变量(INTERVIEW_DAY)创建一个频率表,但要考虑到另一个变量的权重(WEIGHT).
I'm looking to create a frequency table in R for a certain variable (INTERVIEW_DAY) , but taking into account another variable as weight (WEIGHT).
我已经尝试使用软件包data.table
来做到这一点.我很乐意使用R-Base Package来做到这一点.
I've tried to do that with the package data.table
. I would love to do it with the R-Base Package though.
在下面,您会找到我想要的但仍未加权的表类型,这是我要学习的表.
Below you'll find the type of table I want, but still not weighted, which is what I'm looking to learn.
数据(变量TUCASEID,INTERVIEW_DAY和WEIGHT):
Data(variables TUCASEID, INTERVIEW_DAY, and WEIGHT):
TUCASEID INTERVIEW_DAY WEIGHT
1 2.00301e+13 5 8155462.7
2 2.00301e+13 6 1735322.5
3 2.00301e+13 6 3830527.5
4 2.00301e+13 4 6622023.0
5 2.00301e+13 4 3068387.3
6 2.00301e+13 4 3455424.9
7 2.00301e+13 1 1637826.3
8 2.00301e+13 2 6574426.8
9 2.00301e+13 6 1528296.3
10 2.00301e+13 4 4277052.8
11 2.00301e+13 6 1961482.3
12 2.00301e+13 7 505227.2
13 2.00301e+13 6 2135476.8
14 2.00301e+13 3 5366309.3
15 2.00301e+13 6 1058351.1
使用打包的data.table创建表:
Creating table with the packaged data.table:
df <- setDT(df)
df_freq_table <- df[,.(Freq = .N), by = INTERVIEW_DAY][, Prop := Freq / sum(Freq)][, Cum := cumsum(100 * Prop / sum(Prop))]
我的输出: df_freq_table []
My output: df_freq_table[]
INTERVIEW_DAY Freq Prop Cum
1: 5 1 0.06666667 6.666667
2: 6 6 0.40000000 46.666667
3: 4 4 0.26666667 73.333333
4: 1 1 0.06666667 80.000000
5: 2 1 0.06666667 86.666667
6: 7 1 0.06666667 93.333333
7: 3 1 0.06666667 100.000000
推荐答案
在base R
中,我们可以使用xtabs/prop.table
.根据OP的代码,cumsum
是根据"INTERVIEW_DAY"中唯一值的出现顺序来计算的.因此,为避免基于整数值的sort
,将其转换为指定了levels
的factor
,并通过xtabs
通过'INTERVIEW_DAY'获得'WEIGHT'的sum
,请使用prop.table
返回比例,然后在该输出上应用cumsum
In base R
, we can make use of xtabs/prop.table
. Based on the OP's code, the cumsum
is calculated from the order of occurrence of unique valuess in 'INTERVIEW_DAY'. So, to avoid the sort
ing based on the integer value, convert to factor
with levels
specified, get the sum
of 'WEIGHT' by 'INTERVIEW_DAY' with xtabs
, use prop.table
to return the proportion, and then apply cumsum
on that output
df$INTERVIEW_DAY <- factor(df$INTERVIEW_DAY, levels = unique(df$INTERVIEW_DAY))
tbl1 <- xtabs(WEIGHT ~ INTERVIEW_DAY, df)
Prop <- prop.table(tbl1)
Cum <- cumsum(100 * Prop / sum(Prop))
Cum
# 5 6 4 1 2 7 3
# 15.71029 39.30705 72.86967 76.02470 88.68935 89.66260 100.00000
out <- data.frame(INTERVIEW_DAY = names(tbl1), Freq = as.numeric(tbl1),
Prop = as.numeric(Prop), Cum = as.numeric(Cum))
row.names(out) <- NULL
out
# INTERVIEW_DAY Freq Prop Cum
#1 5 8155462.7 0.157102906 15.71029
#2 6 12249456.5 0.235967631 39.30705
#3 4 17422888.0 0.335626124 72.86967
#4 1 1637826.3 0.031550297 76.02470
#5 2 6574426.8 0.126646592 88.68935
#6 7 505227.2 0.009732453 89.66260
#7 3 5366309.3 0.103373998 100.00000
如果需要加权频率,请使用count
library(dplyr)
df %>%
mutate(INTERVIEW_DAY = factor(INTERVIEW_DAY, levels = unique(INTERVIEW_DAY))) %>%
count(INTERVIEW_DAY, wt = WEIGHT, sort = FALSE) %>%
mutate(Prop = n / sum(n),
Cum = cumsum(100 * Prop/sum(Prop)))
# A tibble: 7 x 4
# INTERVIEW_DAY n Prop Cum
# <fct> <dbl> <dbl> <dbl>
#1 5 8155463. 0.157 15.7
#2 6 12249456. 0.236 39.3
#3 4 17422888 0.336 72.9
#4 1 1637826. 0.0316 76.0
#5 2 6574427. 0.127 88.7
#6 7 505227. 0.00973 89.7
#7 3 5366309. 0.103 100.
或使用data.table
library(data.table)
setDT(df)[, .(Freq = sum(WEIGHT)), by = INTERVIEW_DAY
][, Prop := Freq / sum(Freq)][, Cum := cumsum(100 * Prop / sum(Prop))][]
# INTERVIEW_DAY Freq Prop Cum
#1: 5 8155462.7 0.157102906 15.71029
#2: 6 12249456.5 0.235967631 39.30705
#3: 4 17422888.0 0.335626124 72.86967
#4: 1 1637826.3 0.031550297 76.02470
#5: 2 6574426.8 0.126646592 88.68935
#6: 7 505227.2 0.009732453 89.66260
#7: 3 5366309.3 0.103373998 100.00000
数据
df <- structure(list(TUCASEID = c(2.00301e+13, 2.00301e+13, 2.00301e+13,
2.00301e+13, 2.00301e+13, 2.00301e+13, 2.00301e+13, 2.00301e+13,
2.00301e+13, 2.00301e+13, 2.00301e+13, 2.00301e+13, 2.00301e+13,
2.00301e+13, 2.00301e+13), INTERVIEW_DAY = c(5L, 6L, 6L, 4L,
4L, 4L, 1L, 2L, 6L, 4L, 6L, 7L, 6L, 3L, 6L), WEIGHT = c(8155462.7,
1735322.5, 3830527.5, 6622023, 3068387.3, 3455424.9, 1637826.3,
6574426.8, 1528296.3, 4277052.8, 1961482.3, 505227.2, 2135476.8,
5366309.3, 1058351.1)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15"))
这篇关于R中的加权频率表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!