我正在使用Bioconductor软件包CMA对微阵列数据集中的SVM分类器执行内部蒙特卡洛交叉验证(MCCV)。 CMA在内部将e1071 R软件包用于SVM工作。
该数据集有387个变量(属性),用于45个样本(观测值),它们属于两个类别之一(标签0或1;按大约1:1的比例)。所有数据均为数值,无NA。我正在尝试使用limma statistics对SVM选择15个变量的1000迭代MCCV,以进行差异基因表达分析。在MCCV期间,将45个样本集中的一部分用于训练SVM分类器,然后将其用于测试其余部分,我正在尝试对训练集部分使用不同的值。 CMA还执行内循环验证(默认情况下,在训练集中进行3倍交叉验证),以对要用于测试集交叉验证的分类器进行微调。所有这些都是从CMA包中完成的。
有时,对于较小的训练集,CMA在控制台中显示错误,并停止执行其余分类代码。
[snip]tuning iteration 575 tuning iteration 576 tuning iteration 577 Error in predict.svm(ret, xhold, decision.values = TRUE) : Model is empty!
It occurs even when I use a test other than limma's for variable selection, or use two instead of 15 variables for classifier generation. The R code I use should ensure that the training-sets always have members of both classes. I would appreciate any insight on this.
Below is the R code I use, with Mac OS X 10.6.6, R 2.12.1, Biobase 2.10.0, CMA 1.8.1, limma 3.6.9, and WilcoxCV 1.0.2. The data file hy3ExpHsaMir.txt can be downloaded from http://rapidshare.com/files/447062901/hy3ExpHsaMir.txt.
Everything goes OK until g is 9 in the for(g in 0:10) loop (for varying the training/test-set sizes).
# exp is the expression table, a matrix; 'classes' is list of known classes
exp <- as.matrix(read.table(file='hy3ExpHsaMir.txt', sep='\t', row.names=1, header=T, check.names=F))
#best is to use 0 and 1 as class labels (instead of 'p', 'g', etc.) with 1 for 'positive' items (positive for recurrence, or for disease, etc.)
classes <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
yesPredVal = 1 # class label for 'positive' items in 'classes'
library(CMA)
library(WilcoxCV)
myNumFun <- function(x, y){round(y(as.numeric(x), na.rm=T), 4)}
set.seed(631)
out = ''
out2 = '\nEffect of varying the training-set size:\nTraining-set size\tSuccessful iterations\tMean acc.\tSD acc.\tMean sens.\tSD sens.\tMean spec.\tSD spec.\tMean PPV\tSD PPV\tMean NPV\tSD NPV\tTotal genes in the classifiers\n'
niter = 1000
diffTest = 'limma'
diffGeneNum = 15
svmCost <- c(0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50)
for(g in 0:10){ # varying the training/test-set sizes
ntest = 3+g*3 # test-set size
result <- matrix(nrow=0, ncol=7)
colnames(result) <- c('trainSetSize', 'iteration', 'acc', 'sens', 'spec', 'ppv', 'npv')
diffGenes <- numeric()
# generate training and test sets
lsets <- GenerateLearningsets(n=ncol(exp), y=classes, method=c('MCCV'), niter=niter, ntrain=ncol(exp)-ntest)
# actual prediction work
svm <- classification(t(exp), factor(classes), learningsets=lsets, genesellist= list(method=diffTest), classifier=svmCMA, nbgene= diffGeneNum, tuninglist=list(grids=list(cost=svmCost)), probability=TRUE)
svm <- join(svm)
# genes in classifiers
svmGenes <- GeneSelection(t(exp), classes, learningsets=lsets, method=diffTest)
actualIters=0
for(h in 1:niter){
m <- ntest*(h-1)
# valid SVM classification requires min. 2 classes
if(1 < length(unique(classes[-lsets@learnmatrix[h,]]))){
actualIters = actualIters+1
tp <- tn <- fp <- fn <- 0
for(i in 1:ntest){
pred <- svm@yhat[m+i]
known <- svm@y[m+i]
if(pred == known){
if(pred == yesPredVal){tp <- tp+1}
else{tn <- tn+1}
}else{
if(pred == yesPredVal){fp <- fp+1}
else{fn <- fn+1}
}
}
result <- rbind(result, c(ncol(exp)-ntest, h, (tp+tn)/(tp+tn+fp+fn), tp/(tp+fn), tn/(tn+fp), tp/(tp+fp), tn/(tn+fn)))
diffGenes <- c(diffGenes, toplist(svmGenes, k=diffGeneNum, iter=h, show=F)$index)
} # end if valid SVM
} # end for h
# output accuracy, etc.
out = paste(out, 'SVM MCCV using ', niter, ' attempted iterations and ', actualIters, ' successful iterations, with ', ncol(exp)-ntest, ' of ', ncol(exp), ' total samples used for training:\nThe means (ranges; SDs) of prediction accuracy, sensitivity, specificity, PPV and NPV in fractions are ',
myNumFun(result[, 'acc'],mean), ' (', myNumFun(result[, 'acc'], min), '-', myNumFun(result[, 'acc'], max), '; ', myNumFun(result[, 'acc'], sd), '), ',
myNumFun(result[, 'sens'], mean), ' (', myNumFun(result[, 'sens'], min), '-', myNumFun(result[, 'sens'], max), '; ', myNumFun(result[, 'sens'], sd), '), ',
myNumFun(result[, 'spec'], mean), ' (', myNumFun(result[, 'spec'], min), '-', myNumFun(result[, 'spec'], max), '; ', myNumFun(result[, 'spec'], sd), '), ',
myNumFun(result[, 'ppv'], mean), ' (', myNumFun(result[, 'ppv'], min), '-', myNumFun(result[, 'ppv'], max), '; ', myNumFun(result[, 'ppv'], sd), '), and ',
myNumFun(result[, 'npv'], mean), ' (', myNumFun(result[, 'npv'], min), '-', myNumFun(result[, 'npv'], max), '; ', myNumFun(result[, 'npv'], sd), '), respectively.\n', sep='')
# output classifier genes
diffGenesUnq <- unique(diffGenes)
out = paste(out, 'A total of ', length(diffGenesUnq), ' genes occur in the ', actualIters, ' classifiers, with occurrence frequencies in fractions:\n', sep='')
for(i in 1:length(diffGenesUnq)){
out = paste(out, rownames(exp)[diffGenesUnq[i]], '\t', round(sum(diffGenes == diffGenesUnq[i])/actualIters, 3), '\n', sep='')
}
# output split-size effect
out2 = paste(out2, ncol(exp)-ntest, '\t', actualIters, '\t', myNumFun(result[, 'acc'], mean), '\t', myNumFun(result[, 'acc'], sd), '\t', myNumFun(result[, 'sens'], mean), '\t', myNumFun(result[, 'sens'], sd), '\t', myNumFun(result[, 'spec'], mean), '\t', myNumFun(result[, 'spec'], sd), '\t', myNumFun(result[, 'ppv'], mean), '\t', myNumFun(result[, 'ppv'], sd),
'\t', myNumFun(result[, 'npv'], mean), '\t', myNumFun(result[, 'npv'], sd), '\t', length(diffGenesUnq), '\n', sep='')
} # end for g
cat(out, out2, sep='')
traceback()的输出:
20:停止(“模型为空!”)
19:predict.svm(ret,xhold,decision.values = TRUE)
18:predict(ret,xhold,decision.values = TRUE)
17:na.action(predict(ret,xhold,Decision.values = TRUE))
16:svm.default(cost = 0.1,kernel =“linear”,type =“C-classification”,...
15:svm(成本= 0.1,内核=“线性”,类型=“C分类”,...
14:do.call(“svm”,args = ll)
13:函数(X,y,f,learningind,概率,模型= FALSE,...)...
12:函数(X,y,f,learningind,概率,模型= FALSE,...)...
11:do.call(classifier,args = c(list(X = X,y = y,learningind = learningmatrix [i,...
10:分类(X = c(83.5832768669369,83.146333099001,94.253534443549,...
9:分类(X = c(83.5832768669369,83.146333099001,94.253534443549,...
8:do.call(“classification”,args = c(list(X = Xi,y = yi,learningsets = lsi,...
7:调整(grid = list(cost = c(0.1,0.2,0.5,1,2,5,5,10,20,50 ...
6:调整(grid = list(cost = c(0.1,0.2,0.5,1,2,5,5,10,20,50 ...
5:do.call(“tune”,args = c(tuninglist,ll))
4:分类(X,y = as.numeric(y)-1,learningsets = learningsets,...
3:分类(X,y = as.numeric(y)-1,learningsets = learningsets,...
2:分类(t(exp),factor(classes),learningsets = lsets,...
1:分类(t(exp),factor(classes),learningsets = lsets,...
最佳答案
CMA软件包的维护者迅速回复了我发送的有关此问题的消息。
CMA通过在训练集内k倍CV步骤(默认k = 3)中测试不同的参数值来调整从训练集生成的分类器。在小的训练集大小下,如果仅对一类的观察进行子集,则此内部循环可能会失败。减少这种情况发生的两种方法是对内部CV进行2倍处理,并指定分层采样,这两种方法都要求通过CMA的tune()单独调用调整步骤,并在category()中使用其输出。
在我发布的代码中,从分类()中调用调优,这不允许分层采样或2倍CV。但是,对于分层采样和2倍CV而言,分别调用tune()对我的情况没有帮助。这并不奇怪,因为在进行少量培训的情况下,CMA会遇到只有一组成员的案例。
我希望CMA不会突然结束所有工作,而会在遇到一个学习集的问题时继续研究剩余的学习集。如果在出现此问题时CMA为内部k折CV步骤尝试不同的k值,那也很好。
[2月14日编辑] CMA为CV生成的学习集无法检查训练集中是否存在足够的两个类(class)的成员。因此,对原始帖子中的部分代码进行以下替换应可使其正常工作:
numInnerFold = 3 # k for the k-fold inner validation called through tune()
# generate learning-sets with 2*niter members in case some have to be removed
lsets <- GenerateLearningsets(n=ncol(exp), y=classes, method=c('MCCV'), niter=2*niter, ntrain=ncol(exp)-ntest)
temp <- lsets@learnmatrix
for(i in 1:(2*niter)){
unq <- unique(classes[lsets@learnmatrix[i, ]])
if((2 > length(unique(classes[lsets@learnmatrix[i, ]])))
| (numInnerFold > sum(classes[lsets@learnmatrix[i, ]] == yesPredVal))
| (numInnerFold > sum(classes[lsets@learnmatrix[i, ]] != yesPredVal))){
# cat('removed set', i,'\n')
temp <- lsets@learnmatrix[-i, ]
}
}
lsets@learnmatrix <- temp[1:niter, ]
lsets@iter <- niter
# genes in classifiers
svmGenes <- GeneSelection(t(exp), classes, learningsets=lsets, method=diffTest)
svmTune <- tune(t(exp), factor(classes), learningsets=lsets, genesel=svmGenes, classifier=svmCMA, nbgene=diffGeneNum, grids=list(cost=svmCost), strat=T, fold=numInnerFold)
# actual prediction work
svm <- classification(t(exp), factor(classes), learningsets=lsets, genesel=svmGenes, classifier=svmCMA, nbgene=diffGeneNum, tuneres=svmTune)