本文介绍了结合低频计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通过将低频计数组合到其他类别中来尝试折叠名义分类矢量:

Trying to collapse a nominal categorical vector by combining low frequency counts into an 'Other' category:

数据(数据帧的列)如下所示,并且包含所有50个州的信息:

The data (column of a dataframe) looks like this, and contains information for all 50 states:

California
Florida
Alabama
...

表(名称)/长度(名称)正确返回频率,而我想做的是将低于给定阈值(例如f = 0.02)的任何东西都聚集在一起。正确的方法是什么?

table(colname)/length(colname)correctly returns the frequencies, and what I'm trying to do is to lump anything below a given threshold (say f=0.02) together. What is the correct approach?

推荐答案

从听起来来看,类似以下的内容应该对您有用:

From the sounds of it, something like the following should work for you:

condenseMe <- function(vector, threshold = 0.02, newName = "Other") {
  toCondense <- names(which(prop.table(table(vector)) < threshold))
  vector[vector %in% toCondense] <- newName
  vector
}

尝试一下:

## Sample data
set.seed(1)
a <- sample(c("A", "B", "C", "D", "E", sample(letters[1:10], 55, TRUE)))

round(prop.table(table(a)), 2)
# a
#    a    A    b    B    c    C    d    D    e    E    f    g    h
# 0.07 0.02 0.07 0.02 0.10 0.02 0.10 0.02 0.12 0.02 0.07 0.12 0.13
#    i    j
# 0.08 0.07

a
#  [1] "c" "d" "d" "e" "j" "h" "c" "h" "g" "i" "g" "d" "f" "D" "g" "h"
# [17] "h" "a" "b" "h" "e" "g" "h" "b" "d" "e" "e" "g" "i" "f" "d" "e"
# [33] "g" "c" "g" "a" "B" "i" "i" "b" "i" "j" "f" "d" "c" "h" "E" "j"
# [49] "j" "c" "C" "e" "f" "a" "a" "h" "e" "c" "A" "b"

condenseMe(a)
#  [1] "c"     "d"     "d"     "e"     "j"     "h"     "c"     "h"
#  [9] "g"     "i"     "g"     "d"     "f"     "Other" "g"     "h"
# [17] "h"     "a"     "b"     "h"     "e"     "g"     "h"     "b"
# [25] "d"     "e"     "e"     "g"     "i"     "f"     "d"     "e"
# [33] "g"     "c"     "g"     "a"     "Other" "i"     "i"     "b"
# [41] "i"     "j"     "f"     "d"     "c"     "h"     "Other" "j"
# [49] "j"     "c"     "Other" "e"     "f"     "a"     "a"     "h"
# [57] "e"     "c"     "Other" "b"

但是请注意,如果要处理 factor s,应该先使用 as.character 转换它们。

Note, however, that if you are dealing with factors, you should convert them with as.character first.

这篇关于结合低频计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-20 01:03