本文介绍了用 quanteda 一步步创建 dfm的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想分析一个大 (n=500,000) 的文档语料库.我使用 quanteda 期望 将比 tm 中的 tm_map() 更快.我想一步一步地进行,而不是使用 dfm() 的自动化方式.我有理由这样做:在一种情况下,我不想在删除停用词之前进行标记化,因为这会导致许多无用的二元组,在另一种情况下,我必须使用特定于语言的程序对文本进行预处理.

我希望实现这个序列:
1)删除标点符号和数字
2)删除停用词(即在标记化之前以避免无用的标记)
3) 使用 unigrams 和 bigrams 标记化
4)创建dfm

我的尝试:

>图书馆(quanteda)>packageVersion("quanteda")[1] ‘0.9.8’>文本 <- ie2010Corpus$documents$texts>text.corpus <- quanteda:::corpus(text, docnames=rownames(ie2010Corpus$documents))>类(文本.语料库)[1]语料库"列表">stopw <- c("a","the", "all", "some")>TextNoStop <- removeFeatures(text.corpus, features = stopw)# UseMethod("selectFeatures") 中的错误:# 没有适用于 'selectFeatures' 的方法应用于类 "c('corpus', 'list')" 的对象# 这就是我理论上会继续的方式:>标记 <- 标记化(TextNoStop,removePunct=TRUE,removeNumbers=TRUE)>token2 

奖励问题如何删除 quanteda 中的稀疏标记?(即相当于 tm 中的 removeSparseTerms().

更新根据@Ken 的回答,这里是使用 quanteda 逐步进行的代码:

库(quanteda)packageVersion("quanteda")[1] ‘0.9.8’

1) 删除自定义标点符号和数字.例如.注意ie2010语料中的\n"

text.corpus 

关于人们可能更喜欢预处理的原因的进一步说明.我目前的语料库是意大利语,这种语言的文章与带有撇号的单词相关.因此,直接的 dfm() 会导致不准确的标记化.例如:

broken.tokens <- dfm(corpus(c("L'abile Presidente Renzi.Un'abile mossa di Berlusconi"), removePunct=TRUE))

将为同一个词(un'abile"和l'abile")产生两个分开的标记,因此这里需要使用 gsub() 进行额外的步骤.

2) 在 quanteda 中,不可能在标记化之前直接删除文本中的停用词.在我之前的示例中,必须删除l"和un"以免产生误导性的二元组.这可以在 tm 中使用 tm_map(..., removeWords) 处理.

3) 标记化

token <- tokenize(text.corpus[1], removePunct=TRUE, removeNumbers=TRUE, ngrams = 1:2)

4) 创建 dfm:

dfm 

5) 去除稀疏特征

dfm 
解决方案

我们将 dfm() 设计成一个黑匣子",而不是一个结合了许多选项的瑞士军刀典型用户希望在将其文本转换为文档和特征矩阵时应用.但是,如果您希望进行更精细的控制,所有这些选项也可以通过较低级别的处理命令使用.

然而,quanteda 的设计原则之一是文本只有通过标记化过程才能成为特征".如果您有一组想要排除的标记化特征,您必须首先标记您的文本,否则您无法排除它们.与 R 的其他文本包(例如 tm)不同,这些步骤是在语料库的下游"应用的,因此语料库仍然是一组未处理的文本,将对其应用操作(但本身不会是一组转换后的文本).这样做的目的是为了保持通用性,同时也提高文本分析的可重复性和透明度.

回答您的问题:

  1. 然而,您可以使用 texts(myCorpus) <- 函数覆盖我们鼓励的行为,其中分配给文本的内容将覆盖现有文本.因此,您可以使用正则表达式来删除标点符号和数字——例如,stringi 命令和使用 Unicode 类标点符号和数字来识别模式.

  2. 我建议您在删除停用词之前标记化.停止单词"是标记,因此在标记文本之前无法从文本中删除它们.即使应用正则表达式来代替 "" 也需要在正则表达式中指定某种形式的词边界——这也是标记化.

  3. 标记为一元组和二元组:

    tokens(myCorpus, ngrams = 1:2)

  4. 要创建 dfm,只需调用 dfm(myTokens).(您也可以在此阶段为 ngram 应用步骤 3.

奖励 1:n=2 搭配产生与二元组相同的列表,但格式不同.你有别的打算吗?(也许是单独的 SO 问题?)

奖励 2:参见 dfm_trim(x, sparsity = ).removeSparseTerms() 选项对大多数人来说相当混乱,但这包括来自 tm 的移民.有关完整说明,请参阅这篇文章.>

顺便说一句:使用 texts() 而不是 ie2010Corpus$documents$texts -- 我们将很快重写语料库的对象结构,因此您不应访问其内部结构这样当有提取器功能时.(此外,这一步是不必要的 - 在这里您只是重新创建了语料库.)

2018-01 更新:

语料库对象的新名称为data_corpus_irishbudget2010,搭配评分函数为textstat_collocations().

I want to analyze a big (n=500,000) corpus of documents. I am using quanteda in the expectation that will be faster than tm_map() from tm. I want to proceed step by step instead of using the automated way with dfm(). I have reasons for this: in one case, I don't want to tokenize before removing stopwords as this would result in many useless bigrams, in another I have to preprocess the text with language-specific procedures.

I would like this sequence to be implemented:
1) remove the punctuation and numbers
2) remove stopwords (i.e. before the tokenization to avoid useless tokens)
3) tokenize using unigrams and bigrams
4) create the dfm

My attempt:

> library(quanteda)
> packageVersion("quanteda")
[1] ‘0.9.8’
> text <- ie2010Corpus$documents$texts
> text.corpus <- quanteda:::corpus(text, docnames=rownames(ie2010Corpus$documents))

> class(text.corpus)
[1] "corpus" "list"

> stopw <- c("a","the", "all", "some")
> TextNoStop <- removeFeatures(text.corpus, features = stopw)
# Error in UseMethod("selectFeatures") :
# no applicable method for 'selectFeatures' applied to an object of class "c('corpus', 'list')"

# This is how I would theoretically continue:
> token <- tokenize(TextNoStop, removePunct=TRUE, removeNumbers=TRUE)
> token2 <- ngrams(token,c(1,2))

Bonus questionHow do I remove sparse tokens in quanteda? (i.e. equivalent of removeSparseTerms() in tm.


UPDATEAt the light of @Ken's answer, here is the code to proceed step by step with quanteda:

library(quanteda)
packageVersion("quanteda")
[1] ‘0.9.8’

1) Remove custom punctuation and numbers. E.g. notice that the "\n" in the ie2010 corpus

text.corpus <- ie2010Corpus
texts(text.corpus)[1]      # Use texts() to extrapolate text
# 2010_BUDGET_01_Brian_Lenihan_FF
# "When I presented the supplementary budget to this House last April, I said we
# could work our way through this period of severe economic distress. Today, I
# can report that notwithstanding the difficulties of the past eight months, we
# are now on the road to economic recovery.\nIt is

texts(text.corpus)[1] <- gsub("\\s"," ",text.corpus[1])    # remove all spaces (incl \n, \t, \r...)
texts(text.corpus)[1]
2010_BUDGET_01_Brian_Lenihan_FF
# "When I presented the supplementary budget to this House last April, I said we
# could work our way through this period of severe economic distress. Today, I
# can report that notwithstanding the difficulties of the past eight months, we
# are now on the road to economic recovery. It is of e

A further note on the reason why one may prefer to preprocess. My present corpus is in Italian, a language that has articles connected to the words with an apostrophe. Thus, the straight dfm() can lead to inexact tokenization.e.g.:

broken.tokens <- dfm(corpus(c("L'abile presidente Renzi. Un'abile mossa di Berlusconi"), removePunct=TRUE))

will produce two separated tokens for the same word ("un'abile" and "l'abile"), hence the need of an additional step with gsub() here.

2) In quanteda it is not possible to remove stopwords directly in the text before the tokenization. In my previous example "l" and "un" have to be removed not to produce misleading bigrams. This can be handled in tm with tm_map(..., removeWords).

3) Tokenization

token <- tokenize(text.corpus[1], removePunct=TRUE, removeNumbers=TRUE, ngrams = 1:2)

4) Create the dfm:

dfm <- dfm(token)

5) Remove sparse features

dfm <- trim(dfm, minCount = 5)
解决方案

We designed dfm() not as a "black box" but more as a Swiss army knife that combines many of the options that typical users want to apply when converting their texts to a matrix of documents and features. However all of these options are also available through lower-level processing commands, should you wish to exert a finer level of control.

However one of the design principles of quanteda is that text only becomes "features" through the process of tokenisation. If you have a set of tokenised features that you wish to exclude, you must first tokenise your text, or you cannot exclude them. Unlike other text packages for R (e.g. tm), these steps are applied "downstream" from a corpus, so that the corpus remains an unprocessed set of texts to which manipulations will be applied (but will not itself be a transformed set of texts). The purpose of this is to preserve generality but also to promote reproducibility and transparency in text analysis.

In response to your questions:

  1. You can however override our encouraged behaviour using the texts(myCorpus) <- function, where what is assigned to the texts will override the existing texts. So you could use regular expressions to remove punctuation and numbers -- for example the stringi commands and using the Unicode classes for punctuation and numerals to identify patterns.

  2. I would recommend you tokenise before removing stopwords. Stop "words" are tokens, so there is no way to remove these from the text before you tokenise the text. Even applying regular expressions to substitute them for "" involves specifying some form of word boundary in the regex - again, this is tokenisation.

  3. To tokenise into unigrams and bigrams:

    tokens(myCorpus, ngrams = 1:2)

  4. To create the dfm, simply call dfm(myTokens). (You could also have applied step 3, for ngrams, at this stage.

Bonus 1: n=2 collocations produces the same list as bigrams, except in a different format. Did you intend something else? (Separate SO question perhaps?)

Bonus 2: See dfm_trim(x, sparsity = ). The removeSparseTerms() options are quite confusing to most people, but this included for migrants from tm. See this post for a full explanation.

BTW: Use texts() instead of ie2010Corpus$documents$texts -- we will rewrite the object structure of a corpus soon, so you should not access its internals this way when there is an extractor function. (Also, this step is unnecessary - here you have simply recreated the corpus.)

Update 2018-01:

The new name for the corpus object is data_corpus_irishbudget2010, and the collocation scoring function is textstat_collocations().

这篇关于用 quanteda 一步步创建 dfm的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 13:12
查看更多