非常大和非常稀疏的非负矩阵分解

本文介绍了非常大和非常稀疏的非负矩阵分解的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个非常大且稀疏的矩阵(531K x 315K)，总细胞数约为1670亿.非零值仅为1s.非零值的总数约为45K.是否有有效的NMF软件包来解决我的问题?我知道有几个软件包，它们仅适用于较小的数据矩阵.任何想法都可以.预先感谢.

I have a very large and also sparse matrix (531K x 315K), the number of total cells is ~167 Billion. The non-zero values are only 1s. Total number of non-zero values are around 45K. Is there an efficient NMF package to solve my problem? I know there are couple of packages for that and they are working well only for small size of data matrix. Any idea helps. Thanks in advance.

推荐答案

scikit学习可以轻松地！

from time import perf_counter as pc
import numpy as np
import scipy.sparse as sps
from sklearn.decomposition import NMF

""" Create sparse data """
nnz_i, nnz_j, nnz_val = np.random.choice(531000, size=45000), \
                        np.random.choice(315000, size=45000), \
                        np.random.random(size=45000)
X =  sps.csr_matrix((nnz_val, (nnz_i, nnz_j)), shape=(531000, 315000))
print('X-shape: ', X.shape, ' X nnzs: ', X.nnz)
print('type(X): ', type(X))
# <class 'scipy.sparse.csr.csr_matrix'> #                          !!!!!!!!!!

""" NMF """
model = NMF(n_components=50, init='random', random_state=0, verbose=True)

start_time = pc()
W = model.fit_transform(X)
end_time = pc()

print('Used (secs): ', end_time - start_time)
print(model.reconstruction_err_)
print(model.n_iter_)

输出:

X-shape:  (531000, 315000)  X nnzs:  45000
type(X):  <class 'scipy.sparse.csr.csr_matrix'>
violation: 1.0
violation: 0.2318929397542804
violation: 0.11045394409727402
violation: 0.08104138988253409
...
violation: 9.659665625799714e-05
Converged at iteration 71
Used (secs):  247.94092973091756
122.27109041
70

备注:

确保您使用稀疏矩阵作为输入，否则您将无法利用稀疏性
我正在使用版本 0.19.1 ，因此使用了 multiplicative-update 求解器(> = 0.19)
- 但是较旧的基于CD的求解器也应该处理这个问题！

非常