问题描述
我想将向量(长度约为10 ^ 5)划分为五个类.使用包classInt
中的函数classIntervals
时,我想使用style = "jenks"
自然中断,但是即使对于较小的向量(仅500),这也会花费大量时间.设置style = "kmeans"
几乎立即执行.
I want to partition a vector (length around 10^5) into five classes. With the function classIntervals
from package classInt
I wanted to use style = "jenks"
natural breaks but this takes an inordinate amount of time even for a much smaller vector of only 500. Setting style = "kmeans"
executes almost instantaneously.
library(classInt)
my_n <- 100
set.seed(1)
x <- mapply(rnorm, n = my_n, mean = (1:5) * 5)
system.time(classIntervals(x, n = 5, style = "jenks"))
R> system.time(classIntervals(x, n = 5, style = "jenks"))
user system elapsed
13.46 0.00 13.45
system.time(classIntervals(x, n = 5, style = "kmeans"))
R> system.time(classIntervals(x, n = 5, style = "kmeans"))
user system elapsed
0.02 0.00 0.02
是什么让Jenks算法如此缓慢,并且有更快的方法来运行它?
What makes the Jenks algorithm so slow, and is there a faster way to run it?
如果需要,我将把问题的最后两部分移到stats.stackexchange.com:
If need be I will move the last two parts of the question to stats.stackexchange.com:
- 在什么情况下kmeans是Jenks的合理替代品?
- 通过在随机的1%数据点子集上运行classInt来定义类是否合理?
推荐答案
要回答您的原始问题:
实际上,与此同时,还有一种更快的方法来应用Jenks算法,即BAMMtools
软件包中的setjenksBreaks
函数.
Indeed, meanwhile there is a faster way to apply the Jenks algorithm, the setjenksBreaks
function in the BAMMtools
package.
但是,请注意,必须将中断次数设置为不同,即,如果在classInt
包的classIntervals
函数中将中断次数设置为5,则必须将中断次数设置为6,而BAMMtools包中的>函数以获取相同的结果.
However, be aware that you have to set the number of breaks differently, i.e. if you set the breaks to 5 in the the classIntervals
function of the classInt
package you have to set the breaks to 6 the setjenksBreaks
function in the BAMMtools
package to get the same results.
# Install and load library
install.packages("BAMMtools")
library(BAMMtools)
# Set up example data
my_n <- 100
set.seed(1)
x <- mapply(rnorm, n = my_n, mean = (1:5) * 5)
# Apply function
getJenksBreaks(x, 6)
速度极大,即
> microbenchmark( getJenksBreaks(x, 6, subset = NULL), classIntervals(x, n = 5, style = "jenks"), unit="s", times=10)
Unit: seconds
expr min lq mean median uq max neval cld
getJenksBreaks(x, 6, subset = NULL) 0.002824861 0.003038748 0.003270575 0.003145692 0.003464058 0.004263771 10 a
classIntervals(x, n = 5, style = "jenks") 2.008109622 2.033353970 2.094278189 2.103680325 2.111840853 2.231148846 10
这篇关于分成几类:混蛋vs kmeans的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!