问题描述
我有一个存储为 scipy.sparse.csc_matrix 的大矩阵,我想从大矩阵的每一列中减去一个列向量.当您执行标准化/标准化等操作时,这是一项非常常见的任务,但我似乎无法找到有效执行此操作的正确方法.
I've got a large matrix stored as a scipy.sparse.csc_matrix and want to subtract a column vector from each one of the columns in the large matrix. This is a pretty common task when you're doing things like normalization/standardization, but I can't seem to find the proper way to do this efficiently.
这是一个演示示例:
# mat is a 3x3 matrix
mat = scipy.sparse.csc_matrix([[1, 2, 3],
[2, 3, 4],
[3, 4, 5]])
#vec is a 3x1 matrix (or a column vector)
vec = scipy.sparse.csc_matrix([1,2,3]).T
"""
I want to subtract `vec` from each of the columns in `mat` yielding...
[[0, 1, 2],
[0, 1, 2],
[0, 1, 2]]
"""
实现我想要的一种方法是将 vec
与自身 hstack 3 次,产生一个 3x3 矩阵,其中每一列都是 vec
,然后从 中减去它垫
.但同样,我正在寻找一种有效地执行此操作的方法,并且 hstacked 矩阵需要很长时间才能创建.我确信有一些神奇的方法可以通过切片和广播来做到这一点,但它让我望而却步.
One way to accomplish what I want is to hstack vec
to itself 3 times, yielding a 3x3 matrix where each column is vec
and then subtract that from mat
. But again, I'm looking for a way to do this efficiently, and the hstacked matrix takes a long time to create. I'm sure there's some magical way to do this with slicing and broadcasting, but it eludes me.
谢谢!
删除了就地"约束,因为在就地分配方案中,稀疏结构会不断变化.
Removed the 'in-place' constraint, because sparsity structure would be constantly changing in an in-place assignment scenario.
推荐答案
首先,我们会用密集数组做什么?
For a start what would we do with dense arrays?
mat-vec.A # taking advantage of broadcasting
mat-vec.A[:,[0]*3] # explicit broadcasting
mat-vec[:,[0,0,0]] # that also works with csr matrix
在 https://codereview.stackexchange.com/questions/32664/numpy-scipy-优化/33566我们发现在 mat.indptr
向量上使用 as_strided
是遍历稀疏矩阵行的最有效方法.(lil_matrix
的 x.rows
、x.cols
几乎一样好.getrow
很慢).该函数实现了迭代等.
In https://codereview.stackexchange.com/questions/32664/numpy-scipy-optimization/33566we found that using as_strided
on the mat.indptr
vector is the most efficient way of stepping through the rows of a sparse matrix. (The x.rows
, x.cols
of an lil_matrix
are nearly as good. getrow
is slow). This function implements such as iteration.
def sum(X,v):
rows, cols = X.shape
row_start_stop = as_strided(X.indptr, shape=(rows, 2),
strides=2*X.indptr.strides)
for row, (start, stop) in enumerate(row_start_stop):
data = X.data[start:stop]
data -= v[row]
sum(mat, vec.A)
print mat.A
为了简单起见,我使用 vec.A
.如果我们保持 vec
稀疏,我们就必须在 row
添加一个非零值测试.此外,这种类型的迭代仅修改 mat
的非零元素.0's
不变.
I'm using vec.A
for simplicity. If we keep vec
sparse we'd have to add a test for nonzero value at row
. Also this type of iteration only modifies the nonzero elements of mat
. 0's
are unchanged.
我怀疑时间优势很大程度上取决于矩阵和向量的稀疏性.如果 vec
有很多零,那么迭代是有意义的,只修改 mat
中 vec
非零的那些行.但是vec
像这个例子一样几乎是密集的,可能很难击败mat-vec.A
.
I suspect the time advantages will depend a lot on the sparsity of matrix and vector. If vec
has lots of zeros, then it makes sense to iterate, modifying only those rows of mat
where vec
is nonzero. But vec
is nearly dense like this example, it may be hard to beat mat-vec.A
.
这篇关于有效地从矩阵中减去向量 (Scipy)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!