本文介绍了非常大和非常稀疏的非负矩阵分解的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常大且稀疏的矩阵(531K x 315K),总细胞数约为1670亿.非零值仅为1s.非零值的总数约为45K.是否有有效的NMF软件包来解决我的问题?我知道有几个软件包,它们仅适用于较小的数据矩阵.任何想法都可以.预先感谢.

I have a very large and also sparse matrix (531K x 315K), the number of total cells is ~167 Billion. The non-zero values are only 1s. Total number of non-zero values are around 45K. Is there an efficient NMF package to solve my problem? I know there are couple of packages for that and they are working well only for small size of data matrix. Any idea helps. Thanks in advance.

推荐答案

scikit学习可以轻松地

from time import perf_counter as pc
import numpy as np
import scipy.sparse as sps
from sklearn.decomposition import NMF

""" Create sparse data """
nnz_i, nnz_j, nnz_val = np.random.choice(531000, size=45000), \
                        np.random.choice(315000, size=45000), \
                        np.random.random(size=45000)
X =  sps.csr_matrix((nnz_val, (nnz_i, nnz_j)), shape=(531000, 315000))
print('X-shape: ', X.shape, ' X nnzs: ', X.nnz)
print('type(X): ', type(X))
# <class 'scipy.sparse.csr.csr_matrix'> #                          !!!!!!!!!!

""" NMF """
model = NMF(n_components=50, init='random', random_state=0, verbose=True)

start_time = pc()
W = model.fit_transform(X)
end_time = pc()

print('Used (secs): ', end_time - start_time)
print(model.reconstruction_err_)
print(model.n_iter_)

输出:

X-shape:  (531000, 315000)  X nnzs:  45000
type(X):  <class 'scipy.sparse.csr.csr_matrix'>
violation: 1.0
violation: 0.2318929397542804
violation: 0.11045394409727402
violation: 0.08104138988253409
...
violation: 9.659665625799714e-05
Converged at iteration 71
Used (secs):  247.94092973091756
122.27109041
70

备注:

  • 确保您使用稀疏矩阵作为输入,否则您将无法利用稀疏性
  • 我正在使用版本 0.19.1 ,因此使用了 multiplicative-update 求解器(> = 0.19)
    • 但是较旧的基于CD的求解器也应该处理这个问题!
    • Remarks:

      • Make sure you use sparse-matrices as input or you can't exploit sparsity
      • I'm using version 0.19.1, so the multiplicative-update solver is used (>= 0.19)
        • But the older CD-based solver should handle this too!
        • 如评论中所述,OP希望添加其他约束,同时仍未正式指定这些约束.

          As mentioned in the comments, OP wants to add additional constraints, while still not specifying these formally.

          这将需要对优化过程进行全新的实现,包括一些理论上的工作(取决于约束).

          This will need a whole new implementation of some optimization-procedure including some theory-footwork (depending on the constraints).

          作为替代方案,这可以通过通用凸编程求解器解决.例如.由cvxpy制定并由SCS解决.当然,也需要进行交替最小化过程(因为联合问题是非凸的),并且比这种专门的sklearn实现,其伸缩性更差.但这可能适用于OP数据.

          As an alternative, this can be solved by general-purpose Convex-Programming solvers. E.g. formulated by cvxpy and solved by SCS. Of course the alternating-minimization procedure needs to be done too (as the joint-problem is non-convex) and it will scale worse than this specialized sklearn-implementation. But it might work for OPs data.

          这篇关于非常大和非常稀疏的非负矩阵分解的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-18 04:43