问题描述
我正在尝试找到一种方法来实现英语单词情感规范"(荷兰语),以便与Quanteda进行纵向情感分析.我最终想要拥有的是一种平均情绪",以便显示任何纵向趋势.
I am trying to find a way to implement the Affective Norms for English Words (in dutch) for a longitudinal sentiment analysis with Quanteda. What I ultimately want to have is a "mean sentiment" per year in order to show any longitudinal trends.
在数据集中,所有单词均由64个编码器的四个类别在7分李克特量表上评分,这为每个单词提供了平均值.我想做的是选择其中一个维度,并使用它来分析情绪随时间的变化.我意识到Quanteda具有实现LIWC词典的功能,但是如果可能的话,我希望使用开源的ANEW数据.
In the data-set all words a scored on a 7-point Likert-scale by 64 coders on four categories, which provides a mean for each word. What I want to do is take one of the dimensions and use this to analyse changes in emotions over time. I realise that Quanteda has a function for implementing the LIWC-dictionary, but I would prefer using the open-source ANEW-data if possible.
本质上,我需要实现方面的帮助,因为我是编码和R的新手.
Essentially, I need help with the implementation because I am new to coding and R
ANEW文件如下所示(在.csv中):
The ANEW file looks something like this (in .csv):
字/分数:癌症:1.01,马铃薯:3.56,爱心:6.56
WORD/SCORE: cancer: 1.01, potato: 3.56, love: 6.56
推荐答案
暂时不直接使用,但是... ANEW与其他字典不同,因为它不使用键:值对格式,而是将数字分数分配给每个学期.这意味着您不是在计算与键匹配的值匹配,而是在选择功能,然后使用加权计数对它们评分.
Not yet, directly, but... ANEW differs from other dictionaries since it does not use a key: value pair format, but rather assigns a numerical score to each term. This means you are not counting matches of values against a key, but rather selecting features and then scoring them using weighted counts.
可以通过以下方式在 quanteda 中完成
This could be done in quanteda by:
-
将ANEW功能添加到字符向量中.
Get ANEW features into a character vector.
使用dfm(yourtext, select = ANEWfeatures)
创建仅具有ANEW功能的dfm.
Use dfm(yourtext, select = ANEWfeatures)
to create a dfm with just the ANEW features.
将每个计数值乘以每个ANEW值的价,然后逐列循环,以便每个特征计数都乘以其ANEW值.
Multiple each counted value by the valence of each ANEW value, recycled column-wise so that each feature count gets multiplied by its ANEW value.
在加权矩阵上使用rowSums()
获取文档级化合价得分.
Use rowSums()
on the weighted matrix to get document-level valence scores.
或者,
- 提交问题,我们会将此功能添加到 quanteda .
- File an issue and we will add this functionality to quanteda.
还请注意, tidytext 使用ANEW进行情感评分,如果您想将dfm转换为他们的对象并使用该方法(基本上是我上面建议的版本).
Note also that tidytext uses ANEW for its sentiment scoring, if you want to convert your dfm into their object and use that approach (which is basically a version of what I've suggested above).
事实证明,我已经将该功能内置到您需要的 quanteda 中,而根本没有意识到!
It turns out I already built the feature into quanteda that you need, and had simply not realised it!
这将起作用.首先,加载ANEW词典. (您必须自己提供ANEW文件.)
This will work. First, load in the ANEW dictionary. (You have to supply the ANEW file yourself.)
# read in the ANEW data
df_anew <- read.delim("ANEW2010All.txt", stringsAsFactors = FALSE)
# construct a vector of weights with the term as the name
vector_anew <- df_anew$ValMn
names(vector_anew) <- df_anew$Word
现在我们有了一个权重的命名向量,我们可以使用dfm_weight()
来应用它.在下面,我首先通过相对频率对dfm进行了归一化,以使文档聚合得分不取决于令牌中文档的长度.如果您不希望这样做,只需删除下面指示的行即可.
Now that we have a named vector of weights, we can apply that using dfm_weight()
. Below, I've first normalised the dfm by relative frequency, so that the document aggregate score is not dependent on the document length in tokens. If you don't want that, just remove the line indicated below.
library("quanteda")
dfm_anew <- dfm(data_corpus_inaugural, select = df_anew$Word)
# weight by the ANEW weights
dfm_anew_weighted <- dfm_anew %>%
dfm_weight(scheme = "prop") %>% # remove if you don't want normalized scores
dfm_weight(weights = vector_anew)
## Warning message:
## dfm_weight(): ignoring 1,427 unmatched weight features
tail(dfm_anew_weighted)[, c("life", "day", "time")]
## Document-feature matrix of: 6 documents, 3 features (5.56% sparse).
## 6 x 3 sparse Matrix of class "dfm"
## features
## docs life day time
## 1997-Clinton 0.07393220 0.06772881 0.21600000
## 2001-Bush 0.10004587 0.06110092 0.09743119
## 2005-Bush 0.09380645 0.12890323 0.11990323
## 2009-Obama 0.06669725 0.10183486 0.09743119
## 2013-Obama 0.08047970 0 0.19594096
## 2017-Trump 0.06826291 0.12507042 0.04985915
# total scores
tail(rowSums(dfm_anew_weighted))
## 1997-Clinton 2001-Bush 2005-Bush 2009-Obama 2013-Obama 2017-Trump
## 5.942169 6.071918 6.300318 5.827410 6.050216 6.223944
这篇关于ANEW词典可以用于Quanteda中的情感分析吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!