问题描述
我在 pandas
中有一个数据框,其中包含根据维基百科文章计算的指标.两个分类变量 nation
文章是关于哪个国家的,以及 lang
来自维基百科的哪种语言.对于单个指标,我想了解民族和语言变量的相关程度,我相信这是使用 Cramer 的统计数据完成的.
I have a dataframe in pandas
which contains metrics calculated on Wikipedia articles. Two categorical variables nation
which nation the article is about, and lang
which language Wikipedia this was taken from. For a single metric, I would like to see how closely the nation and language variable correlate, I believe this is done using Cramer's statistic.
index qid subj nation lang metric value
5 Q3488399 economy cdi fr informativeness 0.787117
6 Q3488399 economy cdi fr referencerate 0.000945
7 Q3488399 economy cdi fr completeness 43.200000
8 Q3488399 economy cdi fr numheadings 11.000000
9 Q3488399 economy cdi fr articlelength 3176.000000
10 Q7195441 economy cdi en informativeness 0.626570
11 Q7195441 economy cdi en referencerate 0.008610
12 Q7195441 economy cdi en completeness 6.400000
13 Q7195441 economy cdi en numheadings 7.000000
14 Q7195441 economy cdi en articlelength 2323.000000
我想生成一个矩阵,显示所有国家组合(法国、美国、科特迪瓦和乌干达)之间的克莱默系数['fra','usa','uga']
和三种语言 ['fr','en','sw']
.所以会有一个 4 x 3 的矩阵,如:
I would like to generate a matrix that displays Cramer's coefficient between all combinations of nation (france, usa, cote d'ivorie, and uganda) ['fra','usa','uga']
and three languages ['fr','en','sw']
. So there would be a resulting 4 by 3 matrix like:
en fr sw
usa Cramer11 Cramer12 ...
fra Cramer21 Cramer22 ...
cdi ...
uga ...
最终,我将对我正在跟踪的所有不同指标执行此操作.
Eventually then I will do this over all the different metrics I am tracking.
for subject in list_of_subjects:
for metric in list_of_metrics:
cramer_matrix(metric, df)
然后我可以检验我的假设,即语言是维基百科语言的文章的指标会更高.谢谢
Then I can test my hypothesis that metrics will be higher for articles whose language is the language of the Wikipedia. Thanks
推荐答案
cramers V 在我所做的一些测试中似乎过于乐观.维基百科推荐了一个更正的版本.
cramers V seems pretty over optimistic in a few tests that I did. Wikipedia recommends a corrected version.
import scipy.stats as ss
def cramers_corrected_stat(confusion_matrix):
""" calculate Cramers V statistic for categorial-categorial association.
uses correction from Bergsma and Wicher,
Journal of the Korean Statistical Society 42 (2013): 323-328
"""
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum()
phi2 = chi2/n
r,k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))
另请注意,混淆矩阵可以通过用于分类列的内置 Pandas 方法计算:
Also note that the confusion matrix can be calculated via a built-in pandas method for categorical columns via:
import pandas as pd
confusion_matrix = pd.crosstab(df[column1], df[column2])
这篇关于使用pandas,计算Cramér的系数矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!