问题描述
我已经在pandas DataFrame
上运行了一个相关矩阵:
I've run a correlation matrix on a pandas DataFrame
:
df=pd.DataFrame( {'one':[0.1, .32, .2, 0.4, 0.8], 'two':[.23, .18, .56, .61, .12], 'three':[.9, .3, .6, .5, .3], 'four':[.34, .75, .91, .19, .21], 'zive': [0.1, .32, .2, 0.4, 0.8], 'six':[.9, .3, .6, .5, .3], 'drive':[.9, .3, .6, .5, .3]})
corrMatrix=df.corr()
corrMatrix
drive four one six three two zive
drive 1.00 -0.04 -0.75 1.00 1.00 0.24 -0.75
four -0.04 1.00 -0.49 -0.04 -0.04 0.16 -0.49
one -0.75 -0.49 1.00 -0.75 -0.75 -0.35 1.00
six 1.00 -0.04 -0.75 1.00 1.00 0.24 -0.75
three 1.00 -0.04 -0.75 1.00 1.00 0.24 -0.75
two 0.24 0.16 -0.35 0.24 0.24 1.00 -0.35
zive -0.75 -0.49 1.00 -0.75 -0.75 -0.35 1.00
现在,我想编写一些代码以返回分组中完全相关(即相关== 1)的列.
Now, I want to write some code to return the columns that are perfectly correlated (ie correlation ==1) in groups.
理想情况下,我希望这样做:[['zive', 'one'], ['three', 'six', 'drive']]
Optimally, I would want this:[['zive', 'one'], ['three', 'six', 'drive']]
我编写了以下代码,为我提供了['drive', 'one', 'six', 'three', 'zive']
,但是如您所见,它们只是一袋列,与其他某些列具有某种完美的关联-不会将它们放入与表亲列完全相关的独特分组.
I've written the below code, which gives me ['drive', 'one', 'six', 'three', 'zive']
, but as you can see, they are just a bag of columns that have some sort of perfect correlation with some other column-- it does not put them in a distinctive grouping with their perfectly correlated cousin columns.
correlatedCols=[]
for col in corrMatrix:
data=corrMatrix[col][corrMatrix[col]==1]
if len(data)>1:
correlatedCols.append(data.name)
correlatedCols
['drive','one', 'six', 'three', 'zive']
使用@Karl D.给出的建议,我得到了:
Using the advice given by @Karl D., I get this:
cor = df.corr()
cor.loc[:,:] = np.tril(cor.values, k=-1)
cor = cor.stack()
cor[cor ==1]
six drive 1.00
three drive 1.00
six 1.00
zive one 1.00
..这不是我想要的-因为[six, drive]
不是分组-它缺少'three'
.
..which is not quite what I want -- since [six, drive]
is not a grouping -- it's missing 'three'
.
推荐答案
这是一种幼稚的方法:
df=pd.DataFrame( {'one':[0.1, .32, .2, 0.4, 0.8], 'two':[.23, .18, .56, .61, .12], 'three':[.9, .3, .6, .5, .3], 'four':[.34, .75, .91, .19, .21], 'zive': [0.1, .32, .2, 0.4, 0.8], 'six':[.9, .3, .6, .5, .3], 'drive':[.9, .3, .6, .5, .3]})
corrMatrix=df.corr()
corrMatrix.loc[:,:] = np.tril(corrMatrix, k=-1) # borrowed from Karl D's answer
already_in = set()
result = []
for col in corrMatrix:
perfect_corr = corrMatrix[col][corrMatrix[col] == 1].index.tolist()
if perfect_corr and col not in already_in:
already_in.update(set(perfect_corr))
perfect_corr.append(col)
result.append(perfect_corr)
结果:
>>> result
[['six', 'three', 'drive'], ['zive', 'one']]
这篇关于返回 pandas 数据框中相关列的组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!