本文介绍了PyMC3如何实现潜在的狄利克雷分配?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用PyMC3实现lda.

I am trying to implement lda using PyMC3.

但是,当定义模型的最后一部分时,会根据其主题对单词进行采样,但我不断收到错误消息:TypeError:列表索引必须是整数,而不是TensorVariable

However, when defining the last part of the model in which words are sampled based on their topics, I keep getting the error: TypeError: list indices must be integers, not TensorVariable

如何解决这个问题?

代码如下:

## Data Preparation

K = 2 # number of topics
N = 4 # number of words
D = 3 # number of documents

import numpy as np

data = np.array([[1, 1, 1, 1], [1, 1, 1, 1], [0, 0, 0, 0]])
Wd = [len(doc) for doc in data]  # length of each document

## Model Specification

from pymc3 import Model, Normal, HalfNormal, Dirichlet, Categorical, constant

lda_model = Model()

with lda_model:

    # Priors for unknown model parameters
    alpha = HalfNormal('alpha', sd=1)
    eta = HalfNormal('eta', sd=1)

    a1 = eta*np.ones(shape=N)
    a2 = alpha*np.ones(shape=K)

    beta = [Dirichlet('beta_%i' % i, a1, shape=N) for i in range(K)]
    theta = [Dirichlet('theta_%s' % i, a2, shape=K) for i in range(D)]

    z = [Categorical('z_%i' % d, p = theta[d], shape=Wd[d]) for d in range(D)]

    # That's when you get the error. It is caused by: beta[z[d][w]]
    w = [Categorical('w_%i_%i' % (d, w), p = beta[z[d][w]], observed = data[i,j]) for d in range(D) for w in range(Wd[d])]

任何帮助将不胜感激!

推荐答案

以下代码改编自@Hanan引用的内容.我已经以某种方式使其与pymc3一起使用.

The following code was adapted from what has been referenced by @Hanan. I've somehow made it work with pymc3.

import numpy as np
import pymc3 as pm

def get_word_dict(collection):
    vocab_list = list({word for doc in collection for word in doc})
    idx_list = [i for i in range(len(vocab_list))]
    return dict(zip(vocab_list,idx_list))

def word_to_idx(dict_vocab_idx, collection):
    return [[dict_vocab_idx[word] for word in doc] for doc in collection]

docs = [["sepak","bola","sepak","bola","bola","bola","sepak"],
         ["uang","ekonomi","uang","uang","uang","ekonomi","ekonomi"],
         ["sepak","bola","sepak","bola","sepak","sepak"],
         ["ekonomi","ekonomi","uang","uang"],
         ["sepak","uang","ekonomi"],
         ["komputer","komputer","teknologi","teknologi","komputer","teknologi"],
         ["teknologi","komputer","teknologi"]]

dict_vocab_idx = get_word_dict(docs)
idxed_collection = word_to_idx(dict_vocab_idx, docs)

n_topics = 3
n_vocab = len(dict_vocab_idx)
n_docs = len(idxed_collection)
length_docs = [len(doc) for doc in idxed_collection]

alpha = np.ones([n_docs, n_topics])
beta = np.ones([n_topics, n_vocab])

with pm.Model() as model:
    theta = pm.distributions.Dirichlet('theta', a=alpha, shape=(n_docs, n_topics))
    phi = pm.distributions.Dirichlet('phi', a=beta, shape=(n_topics, n_vocab))
    zs = [pm.Categorical("z_d{}".format(d), p=theta[d], shape=length_docs[d]) for d in range(n_docs)]
    ws = [pm.Categorical("w_{}_{}".format(d,i), p=phi[zs[d][i]], observed=idxed_collection[d][i])
    for d in range(n_docs) for i in range(length_docs[d])]
    trace = pm.sample(2000)

for d in range(n_docs):
    value_z=trace.get_values("z_d{}".format(d))
    print(value_z[1999])

这篇关于PyMC3如何实现潜在的狄利克雷分配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-28 22:24