我的问题是:我这样做正确吗?我的结果说p值为0.0,这意味着我的变量之间存在显着的关系(这当然是我想要的...但是0对于p值似乎有点太完美了,所以我想知道我是否在编码方面做得不正确).这就是我所做的:import numpy as npimport pandas as pdimport scipy.stats as statsd = {'Previously Successful' : pd.Series([129.3, 182.7, 312], index=['Yes - changed strategy', 'No', 'col_totals']), 'Previously Unsuccessful' : pd.Series([260.17, 711.83, 972], index=['Yes - changed strategy', 'No', 'col_totals']), 'row_totals' : pd.Series([(129.3+260.17), (182.7+711.83), (312+972)], index=['Yes - changed strategy', 'No', 'col_totals'])}total_summarized = pd.DataFrame(d)observed = total_summarized.ix[0:2,0:2]输出:已观察 expected = np.outer(total_summarized["row_totals"][0:2], total_summarized.ix["col_totals"][0:2])/1000expected = pd.DataFrame(expected)expected.columns = ["Previously Successful","Previously Unsuccessful"]expected.index = ["Yes - changed strategy","No"]chi_squared_stat = (((observed-expected)**2)/expected).sum().sum()print(chi_squared_stat)crit = stats.chi2.ppf(q = 0.95, # Find the critical value for 95% confidence* df = 8) # *print("Critical value")print(crit)p_value = 1 - stats.chi2.cdf(x=chi_squared_stat, # Find the p-value df=8)print("P value")print(p_value)stats.chi2_contingency(observed= observed)输出统计信息 解决方案一些更正:您的expected数组不正确.您必须除以observed.sum().sum(),即1284,而不是1000.对于像这样的2x2列联表,自由度是1,而不是8.您计算的chi_squared_stat不包括连续性校正. (但是,不使用它并不一定是错误的-这是对统计学家的判断.)您执行的所有计算(预期矩阵,统计数据,自由度,p值)均由 chi2_contingency :In [65]: observedOut[65]: Previously Successful Previously UnsuccessfulYes - changed strategy 129.3 260.17No 182.7 711.83In [66]: from scipy.stats import chi2_contingencyIn [67]: chi2, p, dof, expected = chi2_contingency(observed)In [68]: chi2Out[68]: 23.383138325890453In [69]: pOut[69]: 1.3273696199438626e-06In [70]: dofOut[70]: 1In [71]: expectedOut[71]:array([[ 94.63757009, 294.83242991], [ 217.36242991, 677.16757009]])默认情况下,当列联表为2x2时,chi2_contingency使用连续性校正.如果您不想使用更正,可以使用参数correction=False:禁用它.In [73]: chi2, p, dof, expected = chi2_contingency(observed, correction=False)In [74]: chi2Out[74]: 24.072616672232893In [75]: pOut[75]: 9.2770200776879643e-07I am quite new to Python as well as Statistics. I'm trying to apply the Chi Squared Test to determine whether previous success affects the level of change of a person (percentage wise, this does seem to be the case, but I wanted to see whether my results were statistically significant).My question is: Did I do this correctly? My results say the p-value is 0.0, which means that there is a significant relationship between my variables (which is what I want of course...but 0 seems a little bit too perfect for a p-value, so I'm wondering whether I did it incorrectly coding wise).Here's what I did:import numpy as npimport pandas as pdimport scipy.stats as statsd = {'Previously Successful' : pd.Series([129.3, 182.7, 312], index=['Yes - changed strategy', 'No', 'col_totals']), 'Previously Unsuccessful' : pd.Series([260.17, 711.83, 972], index=['Yes - changed strategy', 'No', 'col_totals']), 'row_totals' : pd.Series([(129.3+260.17), (182.7+711.83), (312+972)], index=['Yes - changed strategy', 'No', 'col_totals'])}total_summarized = pd.DataFrame(d)observed = total_summarized.ix[0:2,0:2]Output:Observedexpected = np.outer(total_summarized["row_totals"][0:2], total_summarized.ix["col_totals"][0:2])/1000expected = pd.DataFrame(expected)expected.columns = ["Previously Successful","Previously Unsuccessful"]expected.index = ["Yes - changed strategy","No"]chi_squared_stat = (((observed-expected)**2)/expected).sum().sum()print(chi_squared_stat)crit = stats.chi2.ppf(q = 0.95, # Find the critical value for 95% confidence* df = 8) # *print("Critical value")print(crit)p_value = 1 - stats.chi2.cdf(x=chi_squared_stat, # Find the p-value df=8)print("P value")print(p_value)stats.chi2_contingency(observed= observed)OutputStatistics 解决方案 A few corrections:Your expected array is not correct. You must divide by observed.sum().sum(), which is 1284, not 1000.For a 2x2 contingency table such as this, the degrees of freedom is 1, not 8.You calculation of chi_squared_stat does not include a continuity correction. (But it isn't necessarily wrong to not use it--that's a judgment call for the statistician.)All the calculations that you perform (expected matrix, statistics, degrees of freedom, p-value) are computed by chi2_contingency:In [65]: observedOut[65]: Previously Successful Previously UnsuccessfulYes - changed strategy 129.3 260.17No 182.7 711.83In [66]: from scipy.stats import chi2_contingencyIn [67]: chi2, p, dof, expected = chi2_contingency(observed)In [68]: chi2Out[68]: 23.383138325890453In [69]: pOut[69]: 1.3273696199438626e-06In [70]: dofOut[70]: 1In [71]: expectedOut[71]:array([[ 94.63757009, 294.83242991], [ 217.36242991, 677.16757009]])By default, chi2_contingency uses a continuity correction when the contingency table is 2x2. If you prefer to not use the correction, you can disable it with the argument correction=False:In [73]: chi2, p, dof, expected = chi2_contingency(observed, correction=False)In [74]: chi2Out[74]: 24.072616672232893In [75]: pOut[75]: 9.2770200776879643e-07 这篇关于Python,Pandas&卡方独立检验的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!
09-14 00:36