r - 使用R中的简单数据集在KNN中选择K值

我知道这在很多情况下都是在线回答的，但是由于这是依赖于数据集的，我想知道是否有一种简单的方法可以使用相对简单的数据集在knn算法中找到最佳k值。
我的响应变量是行为类（列E：事件），并且我的预测变量是活动传感器（列B到D）的三轴。以下是我的数据的样子。
下面是我为运行knn分析而编写的代码。datanet对象看起来就像我上传的示例图像我用前150行作为训练，剩下的[151到240]行作为测试。
在本例中，我使用了k值10，但是在为不同的k值运行脚本之后，我显然得到了不同的输出，所以想知道选择最适合我的数据集的k值的最佳方法是什么。特别是，我需要帮助在R中编码。

library(data.table)

#From the file "Collar_#.txt", just select the columns ACTIVITY_X, ACTIVITY_Y, ACTIVITY_Z and Event
dataraw<-fread("Collar_41361.txt", select = c("ACTIVITY_X","ACTIVITY_Y","ACTIVITY_Z","Event"))

#Now, delete all rows containg the string "End"
datanet<-dataraw[!grepl("End", dataraw$Event),]

#Then, read only the columns ACTIVITY_X, ACTIVITY_Y and ACTIVITY_Z for a selected interval that will act as a trainning set
trainset <- datanet[1:150, !"Event"]
View(trainset)

#Create the behavioural classes. Note that the number of rows should be in the same interval as the trainset dataset
behaviour<-datanet[1:150,!1:3]
View(behaviour)

#Test file. This file contains sensor data only, and behaviours would be associated based on the trainset and behaviour datasets
testset<-datanet[151:240,!"Event"]
View(testset)

#Converting inputs into matrix
train = as.matrix(trainset, byrow = T, ncol=3)
test = as.matrix(testset, byrow = T, ncol=3)
classes=as.matrix(behaviour,byrow=T,ncol=1)

library(stats)
library(class)

#Now running the algorithm. But first we set the k value.

for kk=10

kn1 = knn(train, test, classes, k=kk, prob=TRUE)

prob = attributes(.Last.value)
clas1=factor(kn1)

#Write results, this is the classification of the testing set in a single column
filename = paste("results", kk, ".csv", sep="")
write.csv(clas1, filename)

#Write probs to file, this is the proportion of k nearest datapoints that contributed to the winning class
fileprobs = paste("probs", kk, ".csv", sep="")
write.csv (prob$prob, fileprobs)

我还上传了脚本输出的一个sample示例。在D列上，A到C列的值见“实际行为类”，在E、G、I、K、M和O列上，根据第[1:150]行的训练，算法分配的类用于不同的K值。
感谢您的帮助！！！

最佳答案

在KNN中发现K不是一个容易的任务，一个小的K值意味着噪声会对结果产生更大的影响，而一个大的值则会使计算代价更高。
我经常看到人们使用：K = SQRT(N)。但是，如果您不想找到更好的K到您的cenario，请使用carret包中的knn，这里有一个例子：

library(ISLR)
library(caret)

# Split the data:
data(iris)
indxTrain <- createDataPartition(y = iris$Sepal.Length,p = 0.75,list = FALSE)
training <- iris[indxTrain,]
testing <- iris[-indxTrain,]

# Run k-NN:
set.seed(400)
ctrl <- trainControl(method="repeatedcv",repeats = 3)
knnFit <- train(Species ~ ., data = training, method = "knn", trControl = ctrl, preProcess = c("center","scale"),tuneLength = 20)
knnFit

#Use plots to see optimal number of clusters:
#Plotting yields Number of Neighbours Vs accuracy (based on repeated cross validation)
plot(knnFit)

这表明5的准确率最高，因此K的值为5。