问题描述
我正在使用支持向量机执行文档分类任务!它将我的所有文章归入训练集中,但未能归类到我的测试集中!
trainDTM是我的训练集的文档术语矩阵。 testDTM是用于测试装置的套件。
这是我的代码(不是很漂亮):
I'm using a support vector machine for my document classification task! it classifies all my Articles in the training-set, but fails to classify the ones in my test-set!trainDTM is the document term matrix of my training-set. testDTM is the one for the test-set.here's my (not so beautiful) code:
# create data.frame with labelled sentences
labeled <- as.data.frame(read.xlsx("C:\\Users\\LABELED.xlsx", 1, header=T))
# create training set and test set
traindata <- as.data.frame(labeled[1:700,c("ARTICLE","CLASS")])
testdata <- as.data.frame(labeled[701:1000, c("ARTICLE","CLASS")])
# Vector, Source Transformation
trainvector <- as.vector(traindata$"ARTICLE")
testvector <- as.vector(testdata$"ARTICLE")
trainsource <- VectorSource(trainvector)
testsource <- VectorSource(testvector)
# CREATE CORPUS FOR DATA
traincorpus <- Corpus(trainsource)
testcorpus <- Corpus(testsource)
# my own stopwords
sw <- c("i", "me", "my")
## CLEAN TEXT
# FUNCTION FOR CLEANING
cleanCorpus <- function(corpus){
corpus.tmp <- tm_map(corpus, removePunctuation)
corpus.tmp <- tm_map(corpus.tmp,stripWhitespace)
corpus.tmp <- tm_map(corpus.tmp,tolower)
corpus.tmp <- tm_map(corpus.tmp, removeWords, sw)
corpus.tmp <- tm_map(corpus.tmp, removeNumbers)
corpus.tmp <- tm_map(corpus.tmp, stemDocument, language="en")
return(corpus.tmp)}
# CLEAN CORP WITH ABOVE FUNCTION
traincorpus.cln <- cleanCorpus(traincorpus)
testcorpus.cln <- cleanCorpus(testcorpus)
## CREATE N-GRAM DOCUMENT TERM MATRIX
# CREATE N-GRAM TOKENIZER
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
# CREATE DTM
trainmatrix.cln.bi <- DocumentTermMatrix(traincorpus.cln, control = list(tokenize = BigramTokenizer))
testmatrix.cln.bi <- DocumentTermMatrix(testcorpus.cln, control = list(tokenize = BigramTokenizer))
# REMOVE SPARSE TERMS
trainDTM <- removeSparseTerms(trainmatrix.cln.bi, 0.98)
testDTM <- removeSparseTerms(testmatrix.cln.bi, 0.98)
# train the model
SVM <- svm(as.matrix(trainDTM), as.factor(traindata$CLASS))
# get classifications for training-set
results.train <- predict(SVM, as.matrix(trainDTM)) # works fine!
# get classifications for test-set
results <- predict(SVM,as.matrix(testDTM))
Error in scale.default(newdata[, object$scaled, drop = FALSE], center = object$x.scale$"scaled:center", :
length of 'center' must equal the number of columns of 'x'
我不明白此错误,什么是中心?
i don't understand this error. and what is 'center' ?
谢谢!
推荐答案
培训和测试数据必须在同一要素空间中;在其中构建两个单独的DTM
Train and test data must be in the same features space ; building two separates DTM in that way can't work.
使用RTextTools的解决方案:
A solution with using RTextTools :
DocTermMatrix <- create_matrix(labeled, language="english", removeNumbers=TRUE, stemWords=TRUE, ...)
container <- create_container(DocTermMatrix, labels, trainSize=1:700, testSize=701:1000, virgin=FALSE)
models <- train_models(container, "SVM")
results <- classify_models(container, models)
或者,回答y我们的问题(使用e1071),您可以在投影(DocumentTermMatrix)中指定词汇表(功能):
Or, to answer your question (with e1071), you can specify the vocabulary ('features') in the projection (DocumentTermMatrix) :
DocTermMatrixTrain <- DocumentTermMatrix(Corpus(VectorSource(trainDoc)));
Features <- DocTermMatrixTrain$dimnames$Terms;
DocTermMatrixTest <- DocumentTermMatrix(Corpus(VectorSource(testDoc)),control=list(dictionary=Features));
这篇关于支持向量机适用于R中的训练集,但不适用于R中的测试集(使用e1071)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!