我知道这在很多情况下都是在线回答的,但是由于这是依赖于数据集的,我想知道是否有一种简单的方法可以使用相对简单的数据集在knn算法中找到最佳k值。
我的响应变量是行为类(列E:事件),并且我的预测变量是活动传感器(列B到D)的三轴。以下是我的数据的样子。
下面是我为运行knn分析而编写的代码。datanet对象看起来就像我上传的示例图像我用前150行作为训练,剩下的[151到240]行作为测试。
在本例中,我使用了k值10,但是在为不同的k值运行脚本之后,我显然得到了不同的输出,所以想知道选择最适合我的数据集的k值的最佳方法是什么。特别是,我需要帮助在R中编码。

library(data.table)

#From the file "Collar_#.txt", just select the columns ACTIVITY_X, ACTIVITY_Y, ACTIVITY_Z and Event
dataraw<-fread("Collar_41361.txt", select = c("ACTIVITY_X","ACTIVITY_Y","ACTIVITY_Z","Event"))

#Now, delete all rows containg the string "End"
datanet<-dataraw[!grepl("End", dataraw$Event),]

#Then, read only the columns ACTIVITY_X, ACTIVITY_Y and ACTIVITY_Z for a selected interval that will act as a trainning set
trainset <- datanet[1:150, !"Event"]
View(trainset)

#Create the behavioural classes. Note that the number of rows should be in the same interval as the trainset dataset
behaviour<-datanet[1:150,!1:3]
View(behaviour)

#Test file. This file contains sensor data only, and behaviours would be associated based on the trainset and behaviour datasets
testset<-datanet[151:240,!"Event"]
View(testset)

#Converting inputs into matrix
train = as.matrix(trainset, byrow = T, ncol=3)
test = as.matrix(testset, byrow = T, ncol=3)
classes=as.matrix(behaviour,byrow=T,ncol=1)

library(stats)
library(class)

#Now running the algorithm. But first we set the k value.

for kk=10

kn1 = knn(train, test, classes, k=kk, prob=TRUE)

prob = attributes(.Last.value)
clas1=factor(kn1)

#Write results, this is the classification of the testing set in a single column
filename = paste("results", kk, ".csv", sep="")
write.csv(clas1, filename)

#Write probs to file, this is the proportion of k nearest datapoints that contributed to the winning class
fileprobs = paste("probs", kk, ".csv", sep="")
write.csv (prob$prob, fileprobs)

我还上传了脚本输出的一个sample示例。在D列上,A到C列的值见“实际行为类”,在E、G、I、K、M和O列上,根据第[1:150]行的训练,算法分配的类用于不同的K值。
感谢您的帮助!!!

最佳答案

在KNN中发现K不是一个容易的任务,一个小的K值意味着噪声会对结果产生更大的影响,而一个大的值则会使计算代价更高。
我经常看到人们使用:K = SQRT(N)。但是,如果您不想找到更好的K到您的cenario,请使用carret包中的knn,这里有一个例子:

library(ISLR)
library(caret)

# Split the data:
data(iris)
indxTrain <- createDataPartition(y = iris$Sepal.Length,p = 0.75,list = FALSE)
training <- iris[indxTrain,]
testing <- iris[-indxTrain,]

# Run k-NN:
set.seed(400)
ctrl <- trainControl(method="repeatedcv",repeats = 3)
knnFit <- train(Species ~ ., data = training, method = "knn", trControl = ctrl, preProcess = c("center","scale"),tuneLength = 20)
knnFit

#Use plots to see optimal number of clusters:
#Plotting yields Number of Neighbours Vs accuracy (based on repeated cross validation)
plot(knnFit)

r - 使用R中的简单数据集在KNN中选择K值-LMLPHP
这表明5的准确率最高,因此K的值为5。

08-24 12:06