检查 DocumentTermMatrix 中的特定文档以获取特定术语

本文介绍了检查 DocumentTermMatrix 中的特定文档以获取特定术语的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用 R 的 tm 包进行文本挖掘.这是我的代码的样子:

I used tm package from R for text mining. This is what my code looks like:

library(tm)

在 R 中加载数据

pathToData = "R/group_data"
 newsCorpus = Corpus(DirSource(pathToData, recursive = TRUE),
                readerControl = list(reader = readPlain))

新闻语料长度

      length(newsCorpus)

预处理语料库数据

newsCorpus = tm_map(newsCorpus,removePunctuation)
newsCorpus[["103806"]]

newsCorpus = tm_map(newsCorpus,removeNumbers)
newsCorpus[["103806"]]

newsCorpus = tm_map(newsCorpus, content_transformer(tolower))
newsCorpus[["103806"]]

newsCorpus = tm_map(newsCorpus, removeWords, stopwords("english"))
newsCorpus[["103806"]]

newsCorpus = tm_map(newsCorpus, stripWhitespace)
newsCorpus[["103806"]]

语料库元素到纯文本

newsCorpus = Corpus(VectorSource(newsCorpus))

具有 TFIDF 权重的文档术语矩阵

Document Term matrix with TFIDF weights

docTermMatrix = DocumentTermMatrix(newsCorpus,
                               control = list(weighting = weightTfIdf,
                                              minWordLength = 1,
                                              minDocFreq = 1))

结果矩阵的维度

dim(docTermMatrix)

docTermMatrix 如下所示:

The docTermMatrix looks like this:

<<DocumentTermMatrix (documents: 1986, terms: 22213)>>
 Non-/sparse entries: 173995/43941023
 Sparsity           : 100%
 Maximal term length: 163
 Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

现在我想检查文档101287"的 docTermMatrix并查找术语文本挖掘"、聚类".但是由于文档术语矩阵已将文档名称(行)更改为 1,2,3,4... ，我无法再找到名为101287"的文档.并查找列textmining"、clustering".有没有办法可以保留文档名称?如有遗漏，请见谅..

Now I want to inspect the docTermMatrix for the document "101287" and look for the terms "textmining", "clustering". But since the document term matrix has changed the document names(rows) to 1,2,3,4... , I can no longer find the document named "101287" and look for the columns "textmining", "clustering". Is there a way I can preserve the document name ?Apologies if I am missing on something..

> library(tm)
  > pathToData = "R/group_data"
  > newsCorpus = Corpus(DirSource(pathToData, recursive = TRUE),
              readerControl = list(reader = readPlain))

 > length(newsCorpus)
    [1] 1986

 > newsCorpus[["103806"]]
  <<PlainTextDocument (metadata: 7)>>
  From: [email protected] (Desmond Chan)
  Subject: Re: Honda clutch chatter
  Organization: The University of Western Australia
  Lines: 8
  NNTP-Posting-Host: tartarus.uwa.edu.au
  X-Newsreader: NN version 6.4.19 #1

  I also experience this kinda problem in my 89 BMW 318. During cold
  start ups, the clutch seems to be sticky and everytime i drive out, for
  about 5km, the clutch seems to stick onto somewhere that if i depress
  the clutch, the whole chassis moves along. But after preheating, it
  becomes smooth again. I think that your suggestion of being some
  humudity is right but there should be some remedy. I also found out that
  my clutch is already thin but still alright for a couple grand more!

 > newsCorpus = tm_map(newsCorpus,removePunctuation)
 > newsCorpus = tm_map(newsCorpus,removeNumbers)
 > newsCorpus = tm_map(newsCorpus, content_transformer(tolower))
 > newsCorpus = tm_map(newsCorpus, removeWords, stopwords("english"))
 > newsCorpus = tm_map(newsCorpus, stripWhitespace)

 > newsCorpus = Corpus(VectorSource(newsCorpus))

 > docTermMatrix = DocumentTermMatrix(newsCorpus, control = list(weighting =     weightTfIdf,minWordLength = 1,minDocFreq = 1))


 > dim(docTermMatrix)
 [1]  1986 22213



>inspect(docTermMatrix["1","bmw"])
<<DocumentTermMatrix (documents: 1, terms: 1)>>
Non-/sparse entries: 0/1
Sparsity           : 100%
Maximal term length: 3
Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

    Terms
Docs bmw
  1   0

>inspect(docTermMatrix["103806", "bmw"])
Error in `[.simple_triplet_matrix`(docTermMatrix, "103806", "bmw") :
Subscript out of bounds.

DocumentTermMatrix

检查 DocumentTermMatrix 中的特定文档以获取特定术语

问题描述

推荐答案