本文介绍了稀疏稀疏...数组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以,我正在使用稀疏的numpy数组进行一些Kmeans分类-很多零.我发现我会使用scipy的稀疏"包来减少存储开销,但是我对如何创建数组而不是矩阵有些困惑.

我已经完成了有关如何创建稀疏矩阵的教程:> http://www.scipy.org/SciPy_Tutorial#head-c60163f2fd2bab79edd94be43682414f18b>

要模拟一个数组,我只创建了一个1xN的矩阵,但是正如您可能猜到的那样,Asp.dot(Bsp)不能完全起作用,因为您不能将两个1xN的矩阵相乘.我必须将每个数组转置为Nx1,这很la脚,因为我将在每次点积计算中使用它.

接下来,我尝试创建一个NxN矩阵,其中第1列==第1行(这样,您可以将两个矩阵相乘,并且只将左上角作为点积),但是事实证明效率很低

我很想使用scipy的稀疏包作为numpy的array()的魔术替代品,但是到目前为止,我还不确定该怎么做.

有什么建议吗?

使用基于行或列的scipy.sparse格式:csc_matrixcsr_matrix.

这些函数在底层使用高效的C实现(包括乘法),并且转置是无操作的(特别是如果您调用transpose(copy=False)),就像使用numpy数组一样.

通过 ipython 的一些时间安排:

import numpy, scipy.sparse
n = 100000
x = (numpy.random.rand(n) * 2).astype(int).astype(float) # 50% sparse vector
x_csr = scipy.sparse.csr_matrix(x)
x_dok = scipy.sparse.dok_matrix(x.reshape(x_csr.shape))

现在x_csrx_dok稀疏50%:

print repr(x_csr)
<1x100000 sparse matrix of type '<type 'numpy.float64'>'
        with 49757 stored elements in Compressed Sparse Row format>

时间:

timeit numpy.dot(x, x)
10000 loops, best of 3: 123 us per loop

timeit x_dok * x_dok.T
1 loops, best of 3: 1.73 s per loop

timeit x_csr.multiply(x_csr).sum()
1000 loops, best of 3: 1.64 ms per loop

timeit x_csr * x_csr.T
100 loops, best of 3: 3.62 ms per loop

所以看来我在说谎.换位非常便宜,但是并没有高效的csr * csc C实现(最新版本为scipy 0.9.0).每个调用中都会构造一个新的csr对象:-(

作为骇客(尽管现在的情况相对稳定),您可以直接在稀疏数据上进行点积运算:

timeit numpy.dot(x_csr.data, x_csr.data)
10000 loops, best of 3: 62.9 us per loop

请注意,这最后一种方法会再次执行Numpy密集乘法.稀疏度为50%,因此实际上比dot(x, x)快2倍.

So, I'm doing some Kmeans classification using numpy arrays that are quite sparse-- lots and lots of zeroes. I figured that I'd use scipy's 'sparse' package to reduce the storage overhead, but I'm a little confused about how to create arrays, not matrices.

I've gone through this tutorial on how to create sparse matrices:http://www.scipy.org/SciPy_Tutorial#head-c60163f2fd2bab79edd94be43682414f18b90df7

To mimic an array, I just create a 1xN matrix, but as you may guess, Asp.dot(Bsp) doesn't quite work because you can't multiply two 1xN matrices. I'd have to transpose each array to Nx1, and that's pretty lame, since I'd be doing it for every dot-product calculation.

Next up, I tried to create an NxN matrix where column 1 == row 1 (such that you can multiply two matrices and just take the top-left corner as the dot product), but that turned out to be really inefficient.

I'd love to use scipy's sparse package as a magic replacement for numpy's array(), but as yet, I'm not really sure what to do.

Any advice?

解决方案

Use a scipy.sparse format that is row or column based: csc_matrix and csr_matrix.

These use efficient, C implementations under the hood (including multiplication), and transposition is a no-op (esp. if you call transpose(copy=False)), just like with numpy arrays.

EDIT: some timings via ipython:

import numpy, scipy.sparse
n = 100000
x = (numpy.random.rand(n) * 2).astype(int).astype(float) # 50% sparse vector
x_csr = scipy.sparse.csr_matrix(x)
x_dok = scipy.sparse.dok_matrix(x.reshape(x_csr.shape))

Now x_csr and x_dok are 50% sparse:

print repr(x_csr)
<1x100000 sparse matrix of type '<type 'numpy.float64'>'
        with 49757 stored elements in Compressed Sparse Row format>

And the timings:

timeit numpy.dot(x, x)
10000 loops, best of 3: 123 us per loop

timeit x_dok * x_dok.T
1 loops, best of 3: 1.73 s per loop

timeit x_csr.multiply(x_csr).sum()
1000 loops, best of 3: 1.64 ms per loop

timeit x_csr * x_csr.T
100 loops, best of 3: 3.62 ms per loop

So it looks like I told a lie. Transposition is very cheap, but there is no efficient C implementation of csr * csc (in the latest scipy 0.9.0). A new csr object is constructed in each call :-(

As a hack (though scipy is relatively stable these days), you can do the dot product directly on the sparse data:

timeit numpy.dot(x_csr.data, x_csr.data)
10000 loops, best of 3: 62.9 us per loop

Note this last approach does a numpy dense multiplication again. The sparsity is 50%, so it's actually faster than dot(x, x) by a factor of 2.

这篇关于稀疏稀疏...数组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-14 22:19