返回 pandas 数据框中相关列的组

本文介绍了返回 pandas 数据框中相关列的组的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经在pandas DataFrame上运行了一个相关矩阵:

I've run a correlation matrix on a pandas DataFrame:

df=pd.DataFrame( {'one':[0.1, .32, .2, 0.4, 0.8], 'two':[.23, .18, .56, .61, .12], 'three':[.9, .3, .6, .5, .3], 'four':[.34, .75, .91, .19, .21], 'zive': [0.1, .32, .2, 0.4, 0.8], 'six':[.9, .3, .6, .5, .3], 'drive':[.9, .3, .6, .5, .3]})

corrMatrix=df.corr()
corrMatrix
           drive  four   one   six  three   two  zive
drive       1.00 -0.04 -0.75  1.00   1.00  0.24 -0.75
four       -0.04  1.00 -0.49 -0.04  -0.04  0.16 -0.49
one        -0.75 -0.49  1.00 -0.75  -0.75 -0.35  1.00
six         1.00 -0.04 -0.75  1.00   1.00  0.24 -0.75
three       1.00 -0.04 -0.75  1.00   1.00  0.24 -0.75
two         0.24  0.16 -0.35  0.24   0.24  1.00 -0.35
zive       -0.75 -0.49  1.00 -0.75  -0.75 -0.35  1.00

现在，我想编写一些代码以返回分组中完全相关(即相关== 1)的列.

Now, I want to write some code to return the columns that are perfectly correlated (ie correlation ==1) in groups.

理想情况下，我希望这样做:[['zive', 'one'], ['three', 'six', 'drive']]

Optimally, I would want this:[['zive', 'one'], ['three', 'six', 'drive']]

我编写了以下代码，为我提供了['drive', 'one', 'six', 'three', 'zive']，但是如您所见，它们只是一袋列，与其他某些列具有某种完美的关联-不会将它们放入与表亲列完全相关的独特分组.

I've written the below code, which gives me ['drive', 'one', 'six', 'three', 'zive'], but as you can see, they are just a bag of columns that have some sort of perfect correlation with some other column-- it does not put them in a distinctive grouping with their perfectly correlated cousin columns.

correlatedCols=[]
for col in corrMatrix:
    data=corrMatrix[col][corrMatrix[col]==1]
    if len(data)>1:
        correlatedCols.append(data.name)

correlatedCols  
['drive','one', 'six', 'three', 'zive']

使用@Karl D.给出的建议，我得到了:

Using the advice given by @Karl D., I get this:

cor = df.corr()
cor.loc[:,:] =  np.tril(cor.values, k=-1)
cor = cor.stack()
cor[cor ==1]
six    drive   1.00
three  drive   1.00
       six     1.00
zive   one     1.00

..这不是我想要的-因为[six, drive]不是分组-它缺少'three'.

..which is not quite what I want -- since [six, drive] is not a grouping -- it's missing 'three'.

推荐答案

这是一种幼稚的方法:

df=pd.DataFrame( {'one':[0.1, .32, .2, 0.4, 0.8], 'two':[.23, .18, .56, .61, .12], 'three':[.9, .3, .6, .5, .3], 'four':[.34, .75, .91, .19, .21], 'zive': [0.1, .32, .2, 0.4, 0.8], 'six':[.9, .3, .6, .5, .3], 'drive':[.9, .3, .6, .5, .3]})

corrMatrix=df.corr()

corrMatrix.loc[:,:] =  np.tril(corrMatrix, k=-1) # borrowed from Karl D's answer

already_in = set()
result = []
for col in corrMatrix:
    perfect_corr = corrMatrix[col][corrMatrix[col] == 1].index.tolist()
    if perfect_corr and col not in already_in:
        already_in.update(set(perfect_corr))
        perfect_corr.append(col)
        result.append(perfect_corr)

结果:

>>> result
[['six', 'three', 'drive'], ['zive', 'one']]

这篇关于返回 pandas 数据框中相关列的组的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！