新 R 用户。我正在尝试根据 this question 中的过程使用 cut 分割基于十分位数的数据集。我想将十分位数值添加为数据框中的新列,但是当我这样做时,出于某种原因,最低值被列为 NA。无论 include.lowest=TRUE 还是 FALSE 都会发生这种情况。任何人都知道为什么?
当我使用这个样本集时也会发生,所以它不是我的数据所独有的。
> decile <- cut(data, quantile(data, (0:10)/10, labels=TRUE, include.lowest=FALSE))
> df <- cbind(data, decile)
> df
data decile
[1,] 1 NA
[2,] 2 1
[3,] 3 2
[4,] 4 2
[5,] 5 3
[6,] 6 3
[7,] 7 4
[8,] 8 4
[9,] 9 5
[10,] 10 5
[11,] 11 6
[12,] 12 6
[13,] 13 7
[14,] 14 7
[15,] 15 8
[16,] 16 8
[17,] 17 9
[18,] 18 9
[19,] 19 10
[20,] 20 10
最佳答案
有两个问题,首先你的 cut
调用有一些问题。我想你的意思是
cut(data, quantile(data, (0:10)/10), include.lowest=FALSE)
## ^missing parenthesis
此外,
labels
应该是 FALSE
、 NULL
或包含所需标签的 length(breaks)
向量。其次,主要问题是因为您设置了
include.lowest=FALSE
,而 data[1]
是 1
,它对应于定义的第一个中断> quantile(data, (0:10)/10)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
1.0 2.9 4.8 6.7 8.6 10.5 12.4 14.3 16.2 18.1 20.0
1
值不属于任何类别;它超出了您的休息时间定义的类别的下限。我不确定您想要什么,但您可以尝试这两种选择之一,具体取决于您希望
1
所在的类:> cut(data, quantile(data, (0:10)/10), include.lowest=TRUE)
[1] [1,2.9] [1,2.9] (2.9,4.8] (2.9,4.8] (4.8,6.7] (4.8,6.7]
[7] (6.7,8.6] (6.7,8.6] (8.6,10.5] (8.6,10.5] (10.5,12.4] (10.5,12.4]
[13] (12.4,14.3] (12.4,14.3] (14.3,16.2] (14.3,16.2] (16.2,18.1] (16.2,18.1]
[19] (18.1,20] (18.1,20]
10 Levels: [1,2.9] (2.9,4.8] (4.8,6.7] (6.7,8.6] (8.6,10.5] ... (18.1,20]
> cut(data, c(0, quantile(data, (0:10)/10)), include.lowest=FALSE)
[1] (0,1] (1,2.9] (2.9,4.8] (2.9,4.8] (4.8,6.7] (4.8,6.7]
[7] (6.7,8.6] (6.7,8.6] (8.6,10.5] (8.6,10.5] (10.5,12.4] (10.5,12.4]
[13] (12.4,14.3] (12.4,14.3] (14.3,16.2] (14.3,16.2] (16.2,18.1] (16.2,18.1]
[19] (18.1,20] (18.1,20]
11 Levels: (0,1] (1,2.9] (2.9,4.8] (4.8,6.7] (6.7,8.6] ... (18.1,20]
关于使用 cut() 添加十分位数列时接收 NA,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/17932617/