本文介绍了PyMC3如何实现潜在的狄利克雷分配?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在尝试使用PyMC3实现lda.
I am trying to implement lda using PyMC3.
但是,当定义模型的最后一部分时,会根据其主题对单词进行采样,但我不断收到错误消息:TypeError:列表索引必须是整数,而不是TensorVariable
However, when defining the last part of the model in which words are sampled based on their topics, I keep getting the error: TypeError: list indices must be integers, not TensorVariable
如何解决这个问题?
代码如下:
## Data Preparation
K = 2 # number of topics
N = 4 # number of words
D = 3 # number of documents
import numpy as np
data = np.array([[1, 1, 1, 1], [1, 1, 1, 1], [0, 0, 0, 0]])
Wd = [len(doc) for doc in data] # length of each document
## Model Specification
from pymc3 import Model, Normal, HalfNormal, Dirichlet, Categorical, constant
lda_model = Model()
with lda_model:
# Priors for unknown model parameters
alpha = HalfNormal('alpha', sd=1)
eta = HalfNormal('eta', sd=1)
a1 = eta*np.ones(shape=N)
a2 = alpha*np.ones(shape=K)
beta = [Dirichlet('beta_%i' % i, a1, shape=N) for i in range(K)]
theta = [Dirichlet('theta_%s' % i, a2, shape=K) for i in range(D)]
z = [Categorical('z_%i' % d, p = theta[d], shape=Wd[d]) for d in range(D)]
# That's when you get the error. It is caused by: beta[z[d][w]]
w = [Categorical('w_%i_%i' % (d, w), p = beta[z[d][w]], observed = data[i,j]) for d in range(D) for w in range(Wd[d])]
任何帮助将不胜感激!
推荐答案
以下代码改编自@Hanan引用的内容.我已经以某种方式使其与pymc3一起使用.
The following code was adapted from what has been referenced by @Hanan. I've somehow made it work with pymc3.
import numpy as np
import pymc3 as pm
def get_word_dict(collection):
vocab_list = list({word for doc in collection for word in doc})
idx_list = [i for i in range(len(vocab_list))]
return dict(zip(vocab_list,idx_list))
def word_to_idx(dict_vocab_idx, collection):
return [[dict_vocab_idx[word] for word in doc] for doc in collection]
docs = [["sepak","bola","sepak","bola","bola","bola","sepak"],
["uang","ekonomi","uang","uang","uang","ekonomi","ekonomi"],
["sepak","bola","sepak","bola","sepak","sepak"],
["ekonomi","ekonomi","uang","uang"],
["sepak","uang","ekonomi"],
["komputer","komputer","teknologi","teknologi","komputer","teknologi"],
["teknologi","komputer","teknologi"]]
dict_vocab_idx = get_word_dict(docs)
idxed_collection = word_to_idx(dict_vocab_idx, docs)
n_topics = 3
n_vocab = len(dict_vocab_idx)
n_docs = len(idxed_collection)
length_docs = [len(doc) for doc in idxed_collection]
alpha = np.ones([n_docs, n_topics])
beta = np.ones([n_topics, n_vocab])
with pm.Model() as model:
theta = pm.distributions.Dirichlet('theta', a=alpha, shape=(n_docs, n_topics))
phi = pm.distributions.Dirichlet('phi', a=beta, shape=(n_topics, n_vocab))
zs = [pm.Categorical("z_d{}".format(d), p=theta[d], shape=length_docs[d]) for d in range(n_docs)]
ws = [pm.Categorical("w_{}_{}".format(d,i), p=phi[zs[d][i]], observed=idxed_collection[d][i])
for d in range(n_docs) for i in range(length_docs[d])]
trace = pm.sample(2000)
for d in range(n_docs):
value_z=trace.get_values("z_d{}".format(d))
print(value_z[1999])
这篇关于PyMC3如何实现潜在的狄利克雷分配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!