问题描述
我有一个矩阵,是相当大的(约5万行),我想打印每行之间的相关系数矩阵。我写的Python code是这样的:
I have a matrix which is fairly large (around 50K rows), and I want to print the correlation coefficient between each row in the matrix. I have written Python code like this:
for i in xrange(rows): # rows are the number of rows in the matrix.
for j in xrange(i, rows):
r = scipy.stats.pearsonr(data[i,:], data[j,:])
print r
请注意,我利用了 pearsonr
功能可从SciPy的模块(http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html).
Please note that I am making use of the pearsonr
function available from the scipy module (http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html).
我的问题是:是否有这样做的更快的方法?有一些,我可以使用矩阵分区技术?
My question is: Is there a quicker way of doing this? Is there some matrix partition technique that I can use?
谢谢!
推荐答案
新解
在看着乔金顿的回答,我决定寻找到 corrcoef()
code和灵感来自于它做了以下实施
After looking at Joe Kington's answer, I decided to look into the corrcoef()
code and was inspired by it to do the following implementation.
ms = data.mean(axis=1)[(slice(None,None,None),None)]
datam = data - ms
datass = np.sqrt(scipy.stats.ss(datam,axis=1))
for i in xrange(rows):
temp = np.dot(datam[i:],datam[i].T)
rs = temp / (datass[i:]*datass[i])
每个循环遍历行我和我行通过之间的皮尔森系数产生到最后一行。它是非常快的。它至少1.5倍的速度与使用 corrcoef()
孤独,因为它没有冗余计算系数和其他一些东西。这也将是更快,不会给你5万行矩阵的内存问题,因为这样你可以选择将每个存储组的r或生成另一组之前处理它们。如果没有存储任何R的长期的,我能得到上面的code下我相当新的笔记本电脑一分钟50000×10集随机生成数据的运行。
Each loop through generates the Pearson coefficients between row i and rows i through to the last row. It is very fast. It is at least 1.5x as fast as using corrcoef()
alone because it doesn't redundantly calculate the coefficients and a few other things. It will also be faster and won't give you the memory problems with a 50,000 row matrix because then you can choose to either store each set of r's or process them before generating another set. Without storing any of the r's long term, I was able to get the above code to run on 50,000 x 10 set of randomly generated data in under a minute on my fairly new laptop.
老办法
首先,我不会建议打印出的r到屏幕上。对于100行(10列),这是19.79秒,打印与0.301秒无需使用code的差异。只是存储的r,并利用它们以后,如果你想,或者做一些处理它们,当您去喜欢找一些最大的r的。
First, I wouldn't recommend printing out the r's to the screen. For 100 rows (10 columns), this is a difference of 19.79 seconds with printing vs. 0.301 seconds without using your code. Just store the r's and use them later if you would like, or do some processing on them as you go along like looking for some of the largest r's.
其次,可以通过不冗余计算一些量得到一些积蓄。皮尔逊系数使用了一些数量,你可以precalculate,而不是计算每一个行所用时间计算SciPy的。此外,在不使用的p值(也是由返回pearsonr()
让我们从头开始,过多使用低于code:
Second, you can get some savings by not redundantly calculating some quantities. The Pearson coefficient is calculated in scipy using some quantities that you can precalculate rather than calculating every time that a row is used. Also, you aren't using the p-value (which is also returned by pearsonr()
so let's scratch that too. Using the below code:
r = np.zeros((rows,rows))
ms = data.mean(axis=1)
datam = np.zeros_like(data)
for i in xrange(rows):
datam[i] = data[i] - ms[i]
datass = scipy.stats.ss(datam,axis=1)
for i in xrange(rows):
for j in xrange(i,rows):
r_num = np.add.reduce(datam[i]*datam[j])
r_den = np.sqrt(datass[i]*datass[j])
r[i,j] = min((r_num / r_den), 1.0)
我得到一个加速的约4.8倍以上时,我已经删除了p值的东西直SciPy的code - 8.8倍,如果我离开p值的东西在里面(我用10柱数百行)。我还检查,它并产生相同的结果。这是不是一个真正巨大的进步,但它可能会有所帮助。
I get a speed-up of about 4.8x over the straight scipy code when I've removed the p-value stuff - 8.8x if I leave the p-value stuff in there (I used 10 columns with hundreds of rows). I also checked that it does give the same results. This isn't a really huge improvement, but it might help.
最后,你被卡住,你是计算问题(50000)*(50001)/ 2 = 1250025000皮尔森系数(如果我正确计算)。太多了。顺便说一句,真的没有必要计算每行的Pearson相关系数与本身(这将等于1),但只把你从计算50000皮尔森系数。有了上面code,我预计,这将需要大约4 1/4小时做你的计算,如果你有10列,基于我对小数据集的结果数据。
Ultimately, you are stuck with the problem that you are computing (50000)*(50001)/2 = 1,250,025,000 Pearson coefficients (if I'm counting correctly). That's a lot. By the way, there's really no need to compute each row's Pearson coefficient with itself (it will equal 1), but that only saves you from computing 50,000 Pearson coefficients. With the above code, I expect that it would take about 4 1/4 hours to do your computation if you have 10 columns to your data based on my results on smaller datasets.
您可以通过采取上述code到用Cython或类似的东西得到一些改善。我希望你将可能获得高达10倍的改进直SciPy的,如果你是幸运的。此外,所建议的pyInTheSky,你可以做一些多。
You can get some improvement by taking the above code into Cython or something similar. I expect that you'll maybe get up to a 10x improvement over straight Scipy if you're lucky. Also, as suggested by pyInTheSky, you can do some multiprocessing.
这篇关于找到相关矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!