我正在尝试在R中做一些处理,但它似乎只适用于单个文档。我的最终目标是一个术语文档矩阵,该矩阵显示文档中每个术语的频率。
这是一个例子:
require(RWeka)
require(tm)
require(Snowball)
worder1<- c("I am taking","these are the samples",
"He speaks differently","This is distilled","It was placed")
df1 <- data.frame(id=1:5, words=worder1)
> df1
id words
1 1 I am taking
2 2 these are the samples
3 3 He speaks differently
4 4 This is distilled
5 5 It was placed
此方法适用于词干部分,但不适用于术语文档矩阵部分:
> corp1 <- Corpus(VectorSource(df1$words))
> inspect(corp1)
A corpus with 5 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
[[1]]
I am taking
[[2]]
these are the samples
[[3]]
He speaks differently
[[4]]
This is distilled
[[5]]
It was placed
> corp1 <- tm_map(corp1, SnowballStemmer)
> inspect(corp1)
A corpus with 5 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
[[1]]
[1] I am tak
[[2]]
[1] these are the sampl
[[3]]
[1] He speaks differ
[[4]]
[1] This is distil
[[5]]
[1] It was plac
> class(corp1)
[1] "VCorpus" "Corpus" "list"
> tdm1 <- TermDocumentMatrix(corp1)
Error in UseMethod("Content", x) :
no applicable method for 'Content' applied to an object of class "character"
因此,我尝试首先创建术语“文档矩阵”,但是这次并没有阻止单词:
> corp1 <- Corpus(VectorSource(df1$words))
> tdm1 <- TermDocumentMatrix(corp1, control=list(stemDocument=TRUE))
> as.matrix(tdm1)
Docs
Terms 1 2 3 4 5
are 0 1 0 0 0
differently 0 0 1 0 0
distilled 0 0 0 1 0
placed 0 0 0 0 1
samples 0 1 0 0 0
speaks 0 0 1 0 0
taking 1 0 0 0 0
the 0 1 0 0 0
these 0 1 0 0 0
this 0 0 0 1 0
was 0 0 0 0 1
这里的单词显然不是词干。
有什么建议么?
最佳答案
CRAN上的RTextTools包允许您执行此操作。
library(RTextTools)
worder1<- c("I am taking","these are the samples",
"He speaks differently","This is distilled","It was placed")
df1 <- data.frame(id=1:5, words=worder1)
matrix <- create_matrix(df1, stemWords=TRUE, removeStopwords=FALSE, minWordLength=2)
colnames(matrix) # SEE THE STEMMED TERMS
这将返回一个
DocumentTermMatrix
,可以与tm
包一起使用。您可以使用其他参数(例如,删除停用词,更改最小字长,使用其他语言的词干分析器)来获得所需的结果。当显示as.matrix
时,该示例将生成以下术语矩阵: Terms
Docs am are differ distil he is it place sampl speak take the these this was
1 I am taking 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0
2 these are the samples 0 1 0 0 0 0 0 0 1 0 0 1 1 0 0
3 He speaks differently 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0
4 This is distilled 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0
5 It was placed 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1