问题描述
我有一个数据集(Facebook 帖子)(通过 netvizz)并且我在 R 中使用了 quanteda 包.这是我的 R 代码.
I have a data set (Facebook posts) (via netvizz) and I use the quanteda package in R. Here is my R code.
# Load the relevant dictionary (relevant for analysis)
liwcdict <- dictionary(file = "D:/LIWC2001_English.dic", format = "LIWC")
# Read File
# Facebooks posts could be generated by FB Netvizz
# https://apps.facebook.com/netvizz
# Load FB posts as .csv-file from .zip-file
fbpost <- read.csv("D:/FB-com.csv", sep=";")
# Define the relevant column(s)
fb_test <-as.character(FB_com$comment_message) #one column with 2700 entries
# Define as corpus
fb_corp <-corpus(fb_test)
class(fb_corp)
# LIWC Application
fb_liwc<-dfm(fb_corp, dictionary=liwcdict)
View(fb_liwc)
一切正常,直到:
> fb_liwc<-dfm(fb_corp, dictionary=liwcdict)
Creating a dfm from a corpus ...
... indexing 2,760 documents
... tokenizing texts, found 77,923 total tokens
... cleaning the tokens, 1584 removed entirely
... applying a dictionary consisting of 68 key entries
Error in `dimnames<-.data.frame`(`*tmp*`, value = list(docs = c("text1", :
invalid 'dimnames' given for data frame
您如何解释错误消息?有什么解决问题的建议吗?
How would you interpret the error message? Are there any suggestions to solve the problem?
推荐答案
在 quanteda 0.7.2 版本中存在一个错误,导致 dfm()
在使用字典时失败不包含任何功能.您的示例失败了,因为在清理阶段,某些 Facebook 帖子文档"最终通过清理步骤删除了所有功能.
There was a bug in quanteda version 0.7.2 that caused dfm()
to fail when using a dictionary when one of the documents contains no features. Your example fails because in the cleaning stage, some of the Facebook post "documents" end up having all of their features removed through the cleaning steps.
这不仅在 0.8.0 中得到修复,而且我们还在 dfm()
中更改了字典的底层实现,从而显着提高了速度.(LIWC 仍然是一个庞大而复杂的字典,正则表达式仍然意味着使用它比简单地索引标记要慢得多.我们将进一步优化它.)
This is not only fixed in 0.8.0, but also we changed the underlying implementation of dictionaries in dfm()
, resulting in a significant speed improvement. (The LIWC is still a large and complicated dictionary, and the regular expressions still mean that it is much slower to use than simply indexing tokens. We will work on optimising this further.)
devtools::install_github("kbenoit/quanteda")
liwcdict <- dictionary(file = "LIWC2001_English.dic", format = "LIWC")
mydfm <- dfm(inaugTexts, dictionary = liwcdict)
## Creating a dfm from a character vector ...
## ... indexing 57 documents
## ... lowercasing
## ... tokenizing
## ... shaping tokens into data.table, found 134,024 total tokens
## ... applying a dictionary consisting of 68 key entries
## ... summing dictionary-matched features by document
## ... indexing 68 feature types
## ... building sparse matrix
## ... created a 57 x 68 sparse dfm
## ... complete. Elapsed time: 14.005 seconds.
topfeatures(mydfm, decreasing=FALSE)
## Fillers Nonfl Swear TV Eating Sleep Groom Death Sports Sexual
## 0 0 0 42 47 49 53 76 81 100
如果文档在标记化和清理后包含零个特征,它也可以工作,这可能是破坏您在 Facebook 文本中使用的旧 dfm
的原因.
It will also work if a document contains zero features after tokenization and cleaning, which is probably what is breaking the older dfm
you are using with your Facebook texts.
mytexts <- inaugTexts
mytexts[3] <- ""
mydfm <- dfm(mytexts, dictionary = liwcdict, verbose = FALSE)
which(rowSums(mydfm)==0)
## 1797-Adams
## 3
这篇关于使用 quanteda 进行 R 文本挖掘的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!