我是使用R进行文本处理的新手。我正在尝试下面的简单代码
library(RTextTools)texts <- c("This is the first document.", "This is the second file.", "This is the third text.")matrix <- create_matrix(texts,ngramLength=3)
这是问题Finding 2 & 3 word Phrases Using R TM Package的答案之一

但是,它给出了一个错误的Error in FUN(X[[2L]], ...) : non-character argument

当我删除ngramLength参数时,我可以生成一个文档术语矩阵,但是我确实需要搜索某些单词长度的短语。对替代或更正有任何建议吗?

最佳答案

ngramLength似乎不起作用。这是一种解决方法:

library(RTextTools)
library(tm)
library(RWeka) # this library is needed for NGramTokenizer
library
texts <- c("This is the first document.",
           "Is this a text?",
           "This is the second file.",
           "This is the third text.",
           "File is not this.")
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
dtm <- DocumentTermMatrix(Corpus(VectorSource(texts)),
                         control=list(
                                      weighting = weightTf,
                                      tokenize = TrigramTokenizer))

as.matrix(dtm)

token 生成器使用RWekaNGramTokenizer代替create_matrix调用的 token 生成器。现在,您可以在其他RTextTools函数中使用dtm,例如在下面训练分类模型:
isText <- c(T,F,T,T,F)
container <- create_container(dtm, isText, virgin=F, trainSize=1:3, testSize=4:5)

models=train_models(container, algorithm=c("SVM","BOOSTING"))
classify_models(container, models)

关于RTextTools create_matrix返回非字符参数错误,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/25054617/

10-12 17:55