问题描述
我寻找一个 NMF 实现,它有一个 python 接口,并处理丢失的数据和零.
I look for a NMF implementation that has a python interface, and handles both missing data and zeros.
我不想在开始分解之前估算我的缺失值,我希望它们在最小化函数中被忽略.
I don't want to impute my missing values before starting the factorization, I want them to be ignored in the minimized function.
scikit-learn、nimfa、graphlab 和 mahout 似乎都没有提出这样的选择.
It seems that neither scikit-learn, nor nimfa, nor graphlab, nor mahout propose such an option.
谢谢!
推荐答案
使用这个Matlab 到 python 代码转换表 我能够从 Matlab 工具箱 库重写 NMF.
我必须分解一个 40k X 1k
矩阵,稀疏度为 0.7%.我的机器使用 500 个潜在特征需要 20 分钟进行 100 次迭代.
Using this Matlab to python code conversion sheet I was able to rewrite NMF from Matlab toolbox library.
I had to decompose a 40k X 1k
matrix with sparsity of 0.7%. Using 500 latent features my machine took 20 minutes for 100 iteration.
方法如下:
import numpy as np
from scipy import linalg
from numpy import dot
def nmf(X, latent_features, max_iter=100, error_limit=1e-6, fit_error_limit=1e-6):
"""
Decompose X to A*Y
"""
eps = 1e-5
print 'Starting NMF decomposition with {} latent features and {} iterations.'.format(latent_features, max_iter)
X = X.toarray() # I am passing in a scipy sparse matrix
# mask
mask = np.sign(X)
# initial matrices. A is random [0,1] and Y is AX.
rows, columns = X.shape
A = np.random.rand(rows, latent_features)
A = np.maximum(A, eps)
Y = linalg.lstsq(A, X)[0]
Y = np.maximum(Y, eps)
masked_X = mask * X
X_est_prev = dot(A, Y)
for i in range(1, max_iter + 1):
# ===== updates =====
# Matlab: A=A.*(((W.*X)*Y')./((W.*(A*Y))*Y'));
top = dot(masked_X, Y.T)
bottom = (dot((mask * dot(A, Y)), Y.T)) + eps
A *= top / bottom
A = np.maximum(A, eps)
# print 'A', np.round(A, 2)
# Matlab: Y=Y.*((A'*(W.*X))./(A'*(W.*(A*Y))));
top = dot(A.T, masked_X)
bottom = dot(A.T, mask * dot(A, Y)) + eps
Y *= top / bottom
Y = np.maximum(Y, eps)
# print 'Y', np.round(Y, 2)
# ==== evaluation ====
if i % 5 == 0 or i == 1 or i == max_iter:
print 'Iteration {}:'.format(i),
X_est = dot(A, Y)
err = mask * (X_est_prev - X_est)
fit_residual = np.sqrt(np.sum(err ** 2))
X_est_prev = X_est
curRes = linalg.norm(mask * (X - X_est), ord='fro')
print 'fit residual', np.round(fit_residual, 4),
print 'total residual', np.round(curRes, 4)
if curRes < error_limit or fit_residual < fit_error_limit:
break
return A, Y
这里我使用 Scipy 稀疏矩阵作为输入,使用 toarray()
方法将缺失值转换为 0
.因此,掩码是使用 numpy.sign()
函数创建的.但是,如果您有 nan
值,您可以使用 numpy.isnan()
函数获得相同的结果.
Here I was using Scipy sparse matrix as input and missing values were converted to 0
using toarray()
method. Therefore, the mask was created using numpy.sign()
function. However, if you have nan
values you could get same results by using numpy.isnan()
function.
这篇关于处理零和缺失数据的 Python 非负矩阵分解?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!