在 R 中查找 ngrams 并比较跨语料库的 ngrams

本文介绍了在 R 中查找 ngrams 并比较跨语料库的 ngrams的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在开始使用 R 中的 tm 包，所以请耐心等待，并为大量的文字墙道歉.我创建了一个相当大的社会主义/共产主义宣传语料库，并想提取新创造的政治术语(多个词，例如斗争-批评-转型运动").

I'm getting started with the tm package in R, so please bear with me and apologies for the big ol' wall of text. I have created a fairly large corpus of Socialist/Communist propaganda and would like to extract newly coined political terms (multiple words, e.g. "struggle-criticism-transformation movement").

这是一个两步问题，一个是关于我目前的代码，另一个是关于我应该如何继续.

This is a two-step question, one regarding my code so far and one regarding how I should go on.

第 1 步:为此，我想首先确定一些常见的 ngram.但我很早就陷入困境.这是我一直在做的事情:

Step 1: To do this, I wanted to identify some common ngrams first. But I get stuck very early on. Here is what I've been doing:

library(tm)
library(RWeka)

a  <-Corpus(DirSource("/mycorpora/1965"), readerControl = list(language="lat")) # that dir is full of txt files
summary(a)
a <- tm_map(a, removeNumbers)
a <- tm_map(a, removePunctuation)
a <- tm_map(a , stripWhitespace)
a <- tm_map(a, tolower)
a <- tm_map(a, removeWords, stopwords("english"))
a <- tm_map(a, stemDocument, language = "english")
# everything works fine so far, so I start playing around with what I have
adtm <-DocumentTermMatrix(a)
adtm <- removeSparseTerms(adtm, 0.75)

inspect(adtm)

findFreqTerms(adtm, lowfreq=10) # find terms with a frequency higher than 10

findAssocs(adtm, "usa",.5) # just looking for some associations
findAssocs(adtm, "china",.5)

# ... and so on, and so forth, all of this works fine

我加载到 R 中的语料库适用于我抛出的大多数函数.我在从我的语料库中创建 TDM、查找常用词、关联、创建词云等方面没有任何问题.但是，当我尝试使用 tm FAQ 中概述的方法使用识别 ngram 时，我显然在使用 tdm-constructor 时犯了一些错误:

The corpus I load into R works fine with most functions I throw at it. I haven't had any problems creating TDMs from my corpus, finding frequent words, associations, creating word clouds and so on. But when I try to use identify ngrams using the approach outlined in the tm FAQ, I'm apparently making some mistake with the tdm-constructor:

# Trigram

TrigramTokenizer <- function(x) NGramTokenizer(x,
                                Weka_control(min = 3, max = 3))

tdm <- TermDocumentMatrix(a, control = list(tokenize = TrigramTokenizer))

inspect(tdm)

我收到此错误消息:

Error in rep(seq_along(x), sapply(tflist, length)) :
invalid 'times' argument
In addition: Warning message:
In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'

有什么想法吗?a"不是正确的类/对象吗?我糊涂了.我认为这里有一个根本性的错误，但我没有看到.:(

Any ideas? Is "a" not the right class/object? I'm confused. I assume there's a fundamental mistake here, but I'm not seeing it. :(

第 2 步:然后，当我将语料库与其他语料库进行比较时，我想确定明显过度表示的 ngram.例如，我可以将我的语料库与大型标准英语语料库进行比较.或者我创建可以相互比较的子集(例如苏联与中国共产党的术语).你有什么建议我应该怎么做吗?我应该研究任何脚本/函数?只是一些想法或指针会很棒.

Step 2: Then I would like to identify ngrams that are significantly overrepresented, when I compare the corpus against other corpora. For example I could compare my corpus against a large standard english corpus. Or I create subsets that I can compare against each other (e.g. Soviet vs. a Chinese Communist terminology). Do you have any suggestions how I should go about doing this? Any scripts/functions I should look into? Just some ideas or pointers would be great.

感谢您的耐心等待！

推荐答案

我无法重现您的问题，您使用的是最新版本的 R、tm、RWeka 等吗?

I could not reproduce your problem, are you using the latest versions of R, tm, RWeka, etc.?

require(tm)
a <- Corpus(DirSource("C:\\Downloads\\Only1965\\Only1965"))
summary(a)
a <- tm_map(a, removeNumbers)
a <- tm_map(a, removePunctuation)
a <- tm_map(a , stripWhitespace)
a <- tm_map(a, tolower)
a <- tm_map(a, removeWords, stopwords("english"))
# a <- tm_map(a, stemDocument, language = "english")
# I also got it to work with stemming, but it takes so long...
adtm <-DocumentTermMatrix(a)
adtm <- removeSparseTerms(adtm, 0.75)

inspect(adtm)

findFreqTerms(adtm, lowfreq=10) # find terms with a frequency higher than 10
findAssocs(adtm, "usa",.5) # just looking for some associations
findAssocs(adtm, "china",.5)

# Trigrams
require(RWeka)
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm <- TermDocumentMatrix(a, control = list(tokenize = TrigramTokenizer))
tdm <- removeSparseTerms(tdm, 0.75)
inspect(tdm[1:5,1:5])

这是我得到的

A term-document matrix (5 terms, 5 documents)

Non-/sparse entries: 11/14
Sparsity           : 56%
Maximal term length: 28
Weighting          : term frequency (tf)

                                   Docs
Terms                               PR1965-01.txt PR1965-02.txt PR1965-03.txt
  â€ chinese press                              0             0             0
  â€ renmin ribao                               0             1             1
  â€" renmin ribao                              2             5             2
  â€œ chinese people                            0             0             0
  â€œrenmin ribaoâ€\u009d editorial             0             1             0
  etc.

关于你的第二步，这里有一些有用的开始:

Regarding your step two, here are some pointers to useful starts:

http://quantifyingmemory.blogspot.com/2013/02/mapping-important-textual-differences.html http://tedunderwood.com/2012/08/14/where-to-start-with-text-mining/ 这是他的代码 https://dl.dropboxusercontent.com/u/4713959/Neuchatel/NassrProgram.R

这篇关于在 R 中查找 ngrams 并比较跨语料库的 ngrams的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

1403页，肝出来的..