本文介绍了基于计数的数据帧中所有因子变量的折叠因子水平的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我只想根据频率保持前2个因素水平,并将所有其他因素归类为其他".我试过了,但没有帮助.

I would like to keep only the top 2 factor levels based on the frequency and group all other factors into Other. I tried this but it doesnt help.

df=data.frame(a=as.factor(c(rep('D',3),rep('B',5),rep('C',2))),
              b=as.factor(c(rep('A',5),rep('B',5))),
              c=as.factor(c(rep('A',3),rep('B',5),rep('C',2))))

myfun=function(x){
    if(is.factor(x)){
        levels(x)[!levels(x) %in% names(sort(table(x),decreasing = T)[1:2])]='Others'
    }
}

df=as.data.frame(lapply(df, myfun))

预期产量

       a b      c
       D A      A
       D A      A
       D A      A
       B A      B
       B A      B
       B B      B
       B B      B
       B B      B
  others B others
  others B others

推荐答案

这可能会有点混乱,但这是通过基数R的一种方法,

This might get a bit messy, but here is one approach via base R,

fun1 <- function(x){levels(x) <-
                    c(names(sort(table(x), decreasing = TRUE)[1:2]),
                    rep('others', length(levels(x))-2));
                    return(x)}

不过,首先需要对上述功能进行重新排序,并且当OP在注释中指出时,正确的功能应该是

However the above function will need to first be re-ordered and as OP states in comment, the correct one will be,

fun1 <- function(x){ x=factor(x,
                     levels = names(sort(table(x), decreasing = TRUE)));
                     levels(x) <- c(names(sort(table(x), decreasing = TRUE)[1:2]),
                     rep('others', length(levels(x))-2));
                     return(x) }

这篇关于基于计数的数据帧中所有因子变量的折叠因子水平的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-11 21:20