本文介绍了使用data.table聚合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 29岁程序员,3月因学历无情被辞! 我正在寻找一种更简单的方法来汇总和计算使用 data.table 的数值变量的百分比。 以下代码输出所需的结果,我的问题是如果有一个更好的方法来获得相同的结果。我不是真的熟悉的包,所以任何提示将是有用的。 我想拥有以下列: second_factor_variable third_factor_variable factor_variable porc porcentaje 1:HIGH C> 200 0.04456544 4% 2:低A 51 - 100 0.31739130 32% 3:低A 101 - 200 0.68260870 68% 4:低A 26 - 50 0.00000000 0% 其中 porc 是数字百分比, porcentage 库(ggplot2)库(ggplot2 scale)库(data.table) ###生成一些数据 set.seed(123) df< - data.frame (x = rnorm(10000,mean = 100,sd = 50)) df 0) df $ factor_variable breaks = c(0,25,50,100,200,100000), labels = c(0 - 25,26 - 50,51 - 100,101 - 200,> 200)) df $ second_factor_variable = c(0,100,100000), labels = c(LOW,HIGH)) df $ third_factor_variable< - cut ,right = TRUE, breaks = c(0,50,100,100000), labels = c(A,B,C))$ b b str(df) ### Aggregate DT dt = DT [,list(factor_variable = unique DT $ factor_variable), porc = as.numeric(table(factor_variable)/ length(factor_variable)), porcentaje = paste(round(as.numeric 0)* 100),%)),by =second_factor_variable,third_factor_variable] EDIT 我试过用一个变量的agstudy解决方案分组,我相信它没有生产标签(porcentaje列)。在实际数据集中,我最终遇到了类似的问题,我不能发现这个函数的错误。 grp< ; - function(factor_variable){ porc = as.numeric(table(factor_variable)/ length(factor_variable)) list(factor_variable = factor_variable [1], porc = porc, porcentaje = paste(round(porc,0)* 100,%))} DT [,grp(factor_variable),by =second_factor_variable] 数值是正确的 DT2 表(DT2 $ factor_variable)/长度(DT2 $ factor_variable) / pre> 我相信如果我用2个因子变量分组,会出现相同的问题: DT [,grp(factor_variable),by =second_factor_variable,third_factor_variable] 解决方案 2更改:factorize porc 变量,不使用DT计算factor_variable DT [,{porc = as.numeric(table(factor_variable)/ length(factor_variable)) list(factor_variable = factor_variable [1], porc = porc, porcentaje = paste(round(porc,0)* 100,%))} ,by =second_factor_variable,third_factor_variable] / pre> I'm looking for a simpler way to aggregate and calculate percentages of a numerical variable using data.table.The following code outputs the desired result, my question is if there is a better way to get the same result. I'm not really familiarized with the package, so any tips would be useful.I'd like to have the following columns: second_factor_variable third_factor_variable factor_variable porc porcentaje1: HIGH C > 200 0.04456544 4 %2: LOW A 51 - 100 0.31739130 32 %3: LOW A 101 - 200 0.68260870 68 %4: LOW A 26 - 50 0.00000000 0 %Where porc is the numerical percentage and porcentage would be the percentage rounded to be used as a label in a ggplot call.library("ggplot2")library("scales")library("data.table")### Generate some dataset.seed(123)df <- data.frame(x = rnorm(10000, mean = 100, sd = 50))df <- subset(df, x > 0)df$factor_variable <- cut(df$x, right = TRUE, breaks = c(0, 25, 50, 100, 200, 100000), labels = c("0 - 25", "26 - 50", "51 - 100", "101 - 200", "> 200") )df$second_factor_variable <- cut(df$x, right = TRUE, breaks = c(0, 100, 100000), labels = c("LOW", "HIGH") )df$third_factor_variable <- cut(df$x, right = TRUE, breaks = c(0, 50, 100, 100000), labels = c("A", "B","C") )str(df)### AggregateDT <- data.table(df)dt = DT[, list(factor_variable = unique(DT$factor_variable), porc = as.numeric(table(factor_variable)/length(factor_variable)), porcentaje = paste( round( as.numeric(table(factor_variable)/length(factor_variable), 0 ) * 100 ), "%") ), by="second_factor_variable,third_factor_variable"]EDITI've tried agstudy's solution grouping by with just one variable, and I believe it didn't work for producing the labels (porcentaje column). In the real dataset, I ended up having a similar issue and I can't spot whats wrong about this function.grp <- function(factor_variable) { porc = as.numeric(table(factor_variable)/length(factor_variable)) list(factor_variable = factor_variable[1], porc =porc, porcentaje = paste( round( porc, 0 ) * 100 , "%"))}DT[, grp(factor_variable) , by="second_factor_variable"]The numerical values are correctDT2 <- DT[DT$second_factor_variable %in% "LOW"]table(DT2$factor_variable)/length(DT2$factor_variable)I believe the same problems appears if i group by with 2 factor variables:DT[, grp(factor_variable) , by="second_factor_variable,third_factor_variable"] 解决方案 2 changes : factorize porc variable and don't use DT to compute factor_variableDT[, { porc = as.numeric(table(factor_variable)/length(factor_variable)) list(factor_variable = factor_variable[1], porc =porc, porcentaje = paste( round( porc, 0 ) * 100 , "%")) }, by="second_factor_variable,third_factor_variable"] 这篇关于使用data.table聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!
06-24 17:05