使用pandas，计算Cramér的系数矩阵

本文介绍了使用pandas，计算Cramér的系数矩阵的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在 pandas 中有一个数据框，其中包含根据维基百科文章计算的指标.两个分类变量 nation 文章是关于哪个国家的，以及 lang 来自维基百科的哪种语言.对于单个指标，我想了解民族和语言变量的相关程度，我相信这是使用 Cramer 的统计数据完成的.

I have a dataframe in pandas which contains metrics calculated on Wikipedia articles. Two categorical variables nation which nation the article is about, and lang which language Wikipedia this was taken from. For a single metric, I would like to see how closely the nation and language variable correlate, I believe this is done using Cramer's statistic.

index   qid     subj    nation  lang    metric          value
5   Q3488399    economy     cdi     fr  informativeness 0.787117
6   Q3488399    economy     cdi     fr  referencerate   0.000945
7   Q3488399    economy     cdi     fr  completeness    43.200000
8   Q3488399    economy     cdi     fr  numheadings     11.000000
9   Q3488399    economy     cdi     fr  articlelength   3176.000000
10  Q7195441    economy     cdi     en  informativeness 0.626570
11  Q7195441    economy     cdi     en  referencerate   0.008610
12  Q7195441    economy     cdi     en  completeness    6.400000
13  Q7195441    economy     cdi     en  numheadings     7.000000
14  Q7195441    economy     cdi     en  articlelength   2323.000000

我想生成一个矩阵，显示所有国家组合(法国、美国、科特迪瓦和乌干达)之间的克莱默系数['fra','usa','uga'] 和三种语言 ['fr','en','sw'].所以会有一个 4 x 3 的矩阵，如:

I would like to generate a matrix that displays Cramer's coefficient between all combinations of nation (france, usa, cote d'ivorie, and uganda) ['fra','usa','uga'] and three languages ['fr','en','sw']. So there would be a resulting 4 by 3 matrix like:

       en         fr          sw
usa    Cramer11   Cramer12    ...
fra    Cramer21   Cramer22    ...
cdi    ...
uga    ...

最终，我将对我正在跟踪的所有不同指标执行此操作.

Eventually then I will do this over all the different metrics I am tracking.

for subject in list_of_subjects:
    for metric in list_of_metrics:
        cramer_matrix(metric, df)

然后我可以检验我的假设，即语言是维基百科语言的文章的指标会更高.谢谢

Then I can test my hypothesis that metrics will be higher for articles whose language is the language of the Wikipedia. Thanks

推荐答案

cramers V 在我所做的一些测试中似乎过于乐观.维基百科推荐了一个更正的版本.

cramers V seems pretty over optimistic in a few tests that I did. Wikipedia recommends a corrected version.

import scipy.stats as ss

def cramers_corrected_stat(confusion_matrix):
    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher,
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))

另请注意，混淆矩阵可以通过用于分类列的内置 Pandas 方法计算:

Also note that the confusion matrix can be calculated via a built-in pandas method for categorical columns via:

import pandas as pd
confusion_matrix = pd.crosstab(df[column1], df[column2])

这篇关于使用pandas，计算Cramér的系数矩阵的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！