问题描述
我想实现tomek的链接来处理不平衡的数据。
此代码用于二进制分类问题,其中1类是多数类,0类是少数。 X输入,Y输出
我编写了以下代码,但我正在寻找一种加快计算速度的方法。
i want to implement tomek's link for dealing with unbalanced data.This code is used for binary classification problem, where the 1 class is the majority class and the 0 class is the minority. X the imput, Y the outputI've written the following code but i'm looking for a way to speed up computation.
我该如何改进我的代码?
How can i improve my code?
#########################
#remove overlapping observation using tomek links
#given observations i and j belonging to different classes
#(i,j) is a Tomek link if there is NO example z, such that d(i, z) < d(i, j) or d(j , z) < d(i, j)
#find tomek links and remove only the observations of the tomek links belonging to majority class (0 class).
#########################
tomekLink<-function(X,Y,distType="euclidean"){
i.1<-which(Y==1)
i.0<-which(Y==0)
X.1<-X[i.1,]
X.0<-X[i.0,]
i.tomekLink=NULL
j.tomekLink=NULL
#i and j belong to different classes
timeTomek<-system.time({
for(i in i.1){
for(j in i.0){
d<-dst(X,i,j,distType)
obsleft<-setdiff(1:nrow(X),c(i,j))
for(z in obsleft){
if ( dst(X,i,z,distType)<d | dst(X,j,z,distType)<d ){
break() #(i,j) is not a Tomek link, get next pair (i,j)
}
#if z is the last obs and d(i, z) > d(i, j) and d(j , z) > d(i, j),then (i,j) is a Tomek link
if(z==obsleft[length(obsleft)]){
if ( dst(X,i,z,distType)>d & dst(X,j,z,distType)>d ){
#(i,j) is a Tomek link
#cat("\n tomeklink obs",i,"and",j)
i.tomekLink=c(i.tomekLink,i)
j.tomekLink=c(j.tomekLink,j)
#since we want to eliminate only majority class observations
#remove j from i.0 to speed up the loop
i.0<-setdiff(i.0,j)
}
}
}
}
}
})
print(paste("Time to find tomek links:",round(timeTomek[3],digit=2)))
#id2keep<-setdiff(1:nrow(X),c(i.tomekLink,j.tomekLink))
id2keep<-setdiff(1:nrow(X),j.tomekLink)
cat("numb of obs removed usign tomeklink",nrow(X)-length(id2keep),"\n",
(nrow(X)-length(id2keep))/nrow(X)*100,"% of training ;",
(length(j.tomekLink))/length(which(Y==0))*100,"% of 0 class")
X<-X[id2keep,]
Y<-Y[id2keep]
cat("\n prop of 1 afer TomekLink:",(length(which(Y==1))/length(Y))*100,"% \n")
return(list(X=X,Y=Y))
}
#distance measure used in tomekLink function
dst<-function(X,i,j,distType="euclidean"){
d<-dist(rbind(X[i,],X[j,]), method= distType)
return(d)
}
推荐答案
我尚未测试您的代码,但是乍一看似乎预分配会有所帮助。
不要使用i.tomekLink = c(i.tomekLink,i),而是尝试分配内存以存储Tomek链接先验。
I haven't tested your code, but from a first glance it seems that preallocation would help.don't use i.tomekLink=c(i.tomekLink,i) but try to allocate the memory for storing the Tomek links a-priori.
另一个想法是计算所有样本到所有样本的距离矩阵,并查看每个样本的最近邻居。如果来自其他班级,则您有tomek链接。
Another idea is to calculate a distance matrix from all samples to all samples, and just look at the closest neighbors for each sample. if it's from a different class, then you have a tomek link.
这篇关于R中Tomek链接的快速计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!