问题描述
所以,我正在使用稀疏的numpy数组进行一些Kmeans分类-很多零.我发现我会使用scipy的稀疏"包来减少存储开销,但是我对如何创建数组而不是矩阵有些困惑.
我已经完成了有关如何创建稀疏矩阵的教程:> http://www.scipy.org/SciPy_Tutorial#head-c60163f2fd2bab79edd94be43682414f18b> 要模拟一个数组,我只创建了一个1xN的矩阵,但是正如您可能猜到的那样,Asp.dot(Bsp)不能完全起作用,因为您不能将两个1xN的矩阵相乘.我必须将每个数组转置为Nx1,这很la脚,因为我将在每次点积计算中使用它. 接下来,我尝试创建一个NxN矩阵,其中第1列==第1行(这样,您可以将两个矩阵相乘,并且只将左上角作为点积),但是事实证明效率很低 我很想使用scipy的稀疏包作为numpy的array()的魔术替代品,但是到目前为止,我还不确定该怎么做. 有什么建议吗? 使用基于行或列的 这些函数在底层使用高效的C实现(包括乘法),并且转置是无操作的(特别是如果您调用 通过 ipython 的一些时间安排: 现在 时间: 所以看来我在说谎.换位非常便宜,但是并没有高效的csr * csc C实现(最新版本为scipy 0.9.0).每个调用中都会构造一个新的csr对象:-( 作为骇客(尽管现在的情况相对稳定),您可以直接在稀疏数据上进行点积运算: 请注意,这最后一种方法会再次执行Numpy密集乘法.稀疏度为50%,因此实际上比 So, I'm doing some Kmeans classification using numpy arrays that are quite sparse-- lots and lots of zeroes. I figured that I'd use scipy's 'sparse' package to reduce the storage overhead, but I'm a little confused about how to create arrays, not matrices. I've gone through this tutorial on how to create sparse matrices:http://www.scipy.org/SciPy_Tutorial#head-c60163f2fd2bab79edd94be43682414f18b90df7 To mimic an array, I just create a 1xN matrix, but as you may guess, Asp.dot(Bsp) doesn't quite work because you can't multiply two 1xN matrices. I'd have to transpose each array to Nx1, and that's pretty lame, since I'd be doing it for every dot-product calculation. Next up, I tried to create an NxN matrix where column 1 == row 1 (such that you can multiply two matrices and just take the top-left corner as the dot product), but that turned out to be really inefficient. I'd love to use scipy's sparse package as a magic replacement for numpy's array(), but as yet, I'm not really sure what to do. Any advice? Use a These use efficient, C implementations under the hood (including multiplication), and transposition is a no-op (esp. if you call EDIT: some timings via ipython: Now And the timings: So it looks like I told a lie. Transposition is very cheap, but there is no efficient C implementation of csr * csc (in the latest scipy 0.9.0). A new csr object is constructed in each call :-( As a hack (though scipy is relatively stable these days), you can do the dot product directly on the sparse data: Note this last approach does a numpy dense multiplication again. The sparsity is 50%, so it's actually faster than 这篇关于稀疏稀疏...数组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!scipy.sparse
格式:csc_matrix
和csr_matrix
. transpose(copy=False)
),就像使用numpy数组一样.import numpy, scipy.sparse
n = 100000
x = (numpy.random.rand(n) * 2).astype(int).astype(float) # 50% sparse vector
x_csr = scipy.sparse.csr_matrix(x)
x_dok = scipy.sparse.dok_matrix(x.reshape(x_csr.shape))
x_csr
和x_dok
稀疏50%:print repr(x_csr)
<1x100000 sparse matrix of type '<type 'numpy.float64'>'
with 49757 stored elements in Compressed Sparse Row format>
timeit numpy.dot(x, x)
10000 loops, best of 3: 123 us per loop
timeit x_dok * x_dok.T
1 loops, best of 3: 1.73 s per loop
timeit x_csr.multiply(x_csr).sum()
1000 loops, best of 3: 1.64 ms per loop
timeit x_csr * x_csr.T
100 loops, best of 3: 3.62 ms per loop
timeit numpy.dot(x_csr.data, x_csr.data)
10000 loops, best of 3: 62.9 us per loop
dot(x, x)
快2倍.scipy.sparse
format that is row or column based: csc_matrix
and csr_matrix
. transpose(copy=False)
), just like with numpy arrays.import numpy, scipy.sparse
n = 100000
x = (numpy.random.rand(n) * 2).astype(int).astype(float) # 50% sparse vector
x_csr = scipy.sparse.csr_matrix(x)
x_dok = scipy.sparse.dok_matrix(x.reshape(x_csr.shape))
x_csr
and x_dok
are 50% sparse:print repr(x_csr)
<1x100000 sparse matrix of type '<type 'numpy.float64'>'
with 49757 stored elements in Compressed Sparse Row format>
timeit numpy.dot(x, x)
10000 loops, best of 3: 123 us per loop
timeit x_dok * x_dok.T
1 loops, best of 3: 1.73 s per loop
timeit x_csr.multiply(x_csr).sum()
1000 loops, best of 3: 1.64 ms per loop
timeit x_csr * x_csr.T
100 loops, best of 3: 3.62 ms per loop
timeit numpy.dot(x_csr.data, x_csr.data)
10000 loops, best of 3: 62.9 us per loop
dot(x, x)
by a factor of 2.