本文介绍了检查 DocumentTermMatrix 中的特定文档以获取特定术语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 R 的 tm 包进行文本挖掘.这是我的代码的样子:

I used tm package from R for text mining. This is what my code looks like:

library(tm)

在 R 中加载数据

pathToData = "R/group_data"
 newsCorpus = Corpus(DirSource(pathToData, recursive = TRUE),
                readerControl = list(reader = readPlain))

新闻语料长度

      length(newsCorpus)

预处理语料库数据

newsCorpus = tm_map(newsCorpus,removePunctuation)
newsCorpus[["103806"]]

newsCorpus = tm_map(newsCorpus,removeNumbers)
newsCorpus[["103806"]]

newsCorpus = tm_map(newsCorpus, content_transformer(tolower))
newsCorpus[["103806"]]

newsCorpus = tm_map(newsCorpus, removeWords, stopwords("english"))
newsCorpus[["103806"]]

newsCorpus = tm_map(newsCorpus, stripWhitespace)
newsCorpus[["103806"]]

语料库元素到纯文本

newsCorpus = Corpus(VectorSource(newsCorpus))

具有 TFIDF 权重的文档术语矩阵

Document Term matrix with TFIDF weights

docTermMatrix = DocumentTermMatrix(newsCorpus,
                               control = list(weighting = weightTfIdf,
                                              minWordLength = 1,
                                              minDocFreq = 1))

结果矩阵的维度

dim(docTermMatrix)

docTermMatrix 如下所示:

The docTermMatrix looks like this:

<<DocumentTermMatrix (documents: 1986, terms: 22213)>>
 Non-/sparse entries: 173995/43941023
 Sparsity           : 100%
 Maximal term length: 163
 Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

现在我想检查文档101287"的 docTermMatrix并查找术语文本挖掘"、聚类".但是由于文档术语矩阵已将文档名称(行)更改为 1,2,3,4... ,我无法再找到名为101287"的文档.并查找列textmining"、clustering".有没有办法可以保留文档名称?如有遗漏,请见谅..

Now I want to inspect the docTermMatrix for the document "101287" and look for the terms "textmining", "clustering". But since the document term matrix has changed the document names(rows) to 1,2,3,4... , I can no longer find the document named "101287" and look for the columns "textmining", "clustering". Is there a way I can preserve the document name ?Apologies if I am missing on something..

> library(tm)
  > pathToData = "R/group_data"
  > newsCorpus = Corpus(DirSource(pathToData, recursive = TRUE),
              readerControl = list(reader = readPlain))

 > length(newsCorpus)
    [1] 1986

 > newsCorpus[["103806"]]
  <<PlainTextDocument (metadata: 7)>>
  From: [email protected] (Desmond Chan)
  Subject: Re: Honda clutch chatter
  Organization: The University of Western Australia
  Lines: 8
  NNTP-Posting-Host: tartarus.uwa.edu.au
  X-Newsreader: NN version 6.4.19 #1

  I also experience this kinda problem in my 89 BMW 318. During cold
  start ups, the clutch seems to be sticky and everytime i drive out, for
  about 5km, the clutch seems to stick onto somewhere that if i depress
  the clutch, the whole chassis moves along. But after preheating, it
  becomes smooth again. I think that your suggestion of being some
  humudity is right but there should be some remedy. I also found out that
  my clutch is already thin but still alright for a couple grand more!

 > newsCorpus = tm_map(newsCorpus,removePunctuation)
 > newsCorpus = tm_map(newsCorpus,removeNumbers)
 > newsCorpus = tm_map(newsCorpus, content_transformer(tolower))
 > newsCorpus = tm_map(newsCorpus, removeWords, stopwords("english"))
 > newsCorpus = tm_map(newsCorpus, stripWhitespace)

 > newsCorpus = Corpus(VectorSource(newsCorpus))

 > docTermMatrix = DocumentTermMatrix(newsCorpus, control = list(weighting =     weightTfIdf,minWordLength = 1,minDocFreq = 1))


 > dim(docTermMatrix)
 [1]  1986 22213



>inspect(docTermMatrix["1","bmw"])
<<DocumentTermMatrix (documents: 1, terms: 1)>>
Non-/sparse entries: 0/1
Sparsity           : 100%
Maximal term length: 3
Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

    Terms
Docs bmw
  1   0

>inspect(docTermMatrix["103806", "bmw"])
Error in `[.simple_triplet_matrix`(docTermMatrix, "103806", "bmw") :
Subscript out of bounds.

推荐答案

您实际上希望在文档术语矩阵中对您的文档 id 进行编码.您可以通过将其保存为文本语料库中的属性来实现.查看此更详细的答案.

You essentially want to encode your doc's id in the Document Term Matrix. You can do that by saving it as an attribute in your text corpus. Check out this more detailed answer.

这篇关于检查 DocumentTermMatrix 中的特定文档以获取特定术语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-11 05:53