问题描述
如何关联两个熊猫数据框,为所有值找到一个r值?我不想关联列或行,而是关联所有标量值.一个数据框是x轴,另一个数据框是y轴.
我在这里下载了结构相同的csv文件:
问题的第二部分将是如何对结构不完全相同的表执行此操作.我要比较的每个表(数据框)都有国家/地区记录和年份列,但不是所有的国家/地区或年份都相同.在上面的示例中,它们可以.如何仅为数据帧的共享行和列获取单个r值?
我模拟了一个我想模仿您的设置-三个数据框,其中国家/地区跨行,年份跨列.然后,我将不同的数据集连接到单个数据帧中.并展示如何计算它们之间的相关性.让我知道有关此示例的信息是否与您的设置不符.
将pandas导入为pdset1 = pd.DataFrame({1980:[4,11,0],1981:[5,10,2],1982:[0,3,1]},index = pd.Index(['USA','UK','Iran'],name ='country'))set1.columns.name ='年份'set1
1980年1981年1982年国家美国4 5 0英国11 10 3伊朗0 2 1
set2 = pd.DataFrame({1981:[2,1,10],1982:[15,1,12],1983:[10,13,1]},index = pd.Index(['USA','UK','Turkey'],name ='country'))set2.columns.name ='年份'set2
年1981 1982 1983国家美国2 15 10英国1 1 13土耳其10 12 1
请注意,就像您的设置一样,某些国家/地区不在不同的数据集中.
set3 = pd.DataFrame({1980:[12,11,4],1982:[9,8,11]},index = pd.Index(['USA','UK','Turkey'],name ='country'))set3.columns.name ='年份'
我们可以通过沿 year
进行堆叠,将它们变成多索引序列,然后使用 pd.concat
将它们跨列连接.
df = pd.concat([set1.stack('year'),set2.stack('year'),set3.stack('year')],键= ['set1','set2','set3'],名称= ['set'],轴= 1)df
set set1 set2 set3国家年伊朗1980 0.0 NaN NaN1981 2.0 NaN NaN1982年1.0 NaN NaN土耳其1980年NaN NaN 4.01981 NaN 10.0 NaN1982年NaN 12.0 11.01983 NaN 1.0 NaNUK 1980 11.0 NaN 11.01981 10.0 1.0 NaN1982年3.0 1.0 8.01983 NaN 13.0 NaN美国1980 4.0 NaN 12.01981 5.0 2.0 NaN1982 0.0 15.0 9.01983 NaN 10.0 NaN
我们可以在三个不同的集合之间计算一个3x3的相关矩阵.
df.corr()
set set1 set2 set3放set1 1.000000 -0.723632 0.509902set2 -0.723632 1.000000 0.606891set3 0.509902 0.606891 1.000000
How do I correlate two pandas dataframes, find a single r value for all values? I don't want to correlate columns or rows, but all scalar values. One dataframe is the x axis, and the other dataframe is the y axis.
I downloaded identically structured csv files here: https://www.gapminder.org/data/The tables have years for columns, countries for rows, with numerical values for the indicator that each table reports.
For instance, I want to see how the Political Participation Indicator (gapminder calls it an index, but I don't want to confuse it with a dataframe index) correlates overall with the Government Functioning Indicator, by year and country.
pol_partix_idx_EIU_df = pd.read_csv('polpartix_eiu.csv',index_col=0)
govt_idx_EIU_df = pd.read_csv('gvtx_eiu.csv',index_col=0)
pol_partix_idx_EIU_df.head()
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
country
Afghanistan 0.222 0.222 0.222 0.250 0.278 0.278 0.278 0.278 0.389 0.389 0.278 0.278 0.444
Albania 0.444 0.444 0.444 0.444 0.444 0.500 0.500 0.500 0.500 0.556 0.556 0.556 0.556
Algeria 0.222 0.194 0.167 0.223 0.278 0.278 0.389 0.389 0.389 0.389 0.389 0.389 0.389
Angola 0.111 0.250 0.389 0.416 0.444 0.444 0.500 0.500 0.500 0.500 0.556 0.556 0.556
Argentina 0.556 0.556 0.556 0.556 0.556 0.556 0.556 0.556 0.556 0.611 0.611 0.611 0.611
You can correlate by column or row:
pol_partix_idx_EIU_df.corrwith(govt_idx_EIU_df, axis=0)
2006 0.738297
2007 0.745321
2008 0.731913
...
2018 0.718520
dtype: float64
pol_partix_idx_EIU_df.corrwith(govt_idx_EIU_df, axis=1)
country
Afghanistan 6.790123e-01
Albania -5.664265e-01
...
Zimbabwe 4.456537e-01
Length: 164, dtype: float64
But, I want a single r value that compares every field in one table with every corresponding field in the other table. Essentially, I want the r value of this scatterplot:
plt.scatter(pol_cultx_idx_EIU_df,govt_idx_EIU_df)
plt.xlabel('Political participation')
plt.ylabel('Government functioning')
(The example code won't color the plot like this, but plots the same points.)
The second part of the question would be how to do this with tables that aren't exactly identical in structure. Every table (dataframe) I want to compare has country records and year columns, but not all of them have the same countries or years. In the example above, they do. How do I get a single r value for only the shared rows and columns of the dataframes?
I've simulated a setup that I think mimics yours--three dataframes with countries across rows and years across columns. I then concatenate the different sets of data into a single dataframe. And show how to compute the correlation between them. Let me know if something about this example doesn't match your setup.
import pandas as pd
set1 = pd.DataFrame({1980:[4, 11, 0], 1981:[5, 10, 2], 1982:[0, 3, 1]},
index=pd.Index(['USA', 'UK', 'Iran'], name='country'))
set1.columns.name = 'year'
set1
year 1980 1981 1982
country
USA 4 5 0
UK 11 10 3
Iran 0 2 1
set2 = pd.DataFrame({1981:[2, 1, 10], 1982:[15, 1, 12], 1983:[10, 13, 1]},
index=pd.Index(['USA', 'UK', 'Turkey'], name='country'))
set2.columns.name = 'year'
set2
year 1981 1982 1983
country
USA 2 15 10
UK 1 1 13
Turkey 10 12 1
Notice that, like your setup, some countries/years are not present in different datasets.
set3 = pd.DataFrame({1980:[12, 11, 4], 1982:[9, 8, 11]},
index=pd.Index(['USA', 'UK', 'Turkey'], name='country'))
set3.columns.name = 'year'
We can turns these into multi-indexed series by stacking along year
and then concatenate these across columns using pd.concat
.
df = pd.concat([set1.stack('year'), set2.stack('year'), set3.stack('year')],
keys=['set1', 'set2', 'set3'], names=['set'], axis=1)
df
set set1 set2 set3
country year
Iran 1980 0.0 NaN NaN
1981 2.0 NaN NaN
1982 1.0 NaN NaN
Turkey 1980 NaN NaN 4.0
1981 NaN 10.0 NaN
1982 NaN 12.0 11.0
1983 NaN 1.0 NaN
UK 1980 11.0 NaN 11.0
1981 10.0 1.0 NaN
1982 3.0 1.0 8.0
1983 NaN 13.0 NaN
USA 1980 4.0 NaN 12.0
1981 5.0 2.0 NaN
1982 0.0 15.0 9.0
1983 NaN 10.0 NaN
And we can compute a 3x3 correlation matrix across the three different sets.
df.corr()
set set1 set2 set3
set
set1 1.000000 -0.723632 0.509902
set2 -0.723632 1.000000 0.606891
set3 0.509902 0.606891 1.000000
这篇关于如何关联两个 pandas 数据帧的标量值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!