RTextTools create_matrix返回非字符参数错误

我是使用R进行文本处理的新手。我正在尝试下面的简单代码
library(RTextTools)texts <- c("This is the first document.", "This is the second file.", "This is the third text.")matrix <- create_matrix(texts,ngramLength=3)
这是问题Finding 2 & 3 word Phrases Using R TM Package的答案之一

但是，它给出了一个错误的Error in FUN(X[[2L]], ...) : non-character argument。

当我删除ngramLength参数时，我可以生成一个文档术语矩阵，但是我确实需要搜索某些单词长度的短语。对替代或更正有任何建议吗？

最佳答案

ngramLength似乎不起作用。这是一种解决方法:

library(RTextTools)
library(tm)
library(RWeka) # this library is needed for NGramTokenizer
library
texts <- c("This is the first document.",
           "Is this a text?",
           "This is the second file.",
           "This is the third text.",
           "File is not this.")
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
dtm <- DocumentTermMatrix(Corpus(VectorSource(texts)),
                         control=list(
                                      weighting = weightTf,
                                      tokenize = TrigramTokenizer))

as.matrix(dtm)

token 生成器使用RWeka的NGramTokenizer代替create_matrix调用的 token 生成器。现在，您可以在其他RTextTools函数中使用dtm，例如在下面训练分类模型:

isText <- c(T,F,T,T,F)
container <- create_container(dtm, isText, virgin=F, trainSize=1:3, testSize=4:5)

models=train_models(container, algorithm=c("SVM","BOOSTING"))
classify_models(container, models)

关于RTextTools create_matrix返回非字符参数错误，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/25054617/