问题描述
我正在测试perplexity
度量的文本语言模型:
I am testing the perplexity
measure for a language model for a text:
train_sentences = nltk.sent_tokenize(train_text)
test_sentences = nltk.sent_tokenize(test_text)
train_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent)))
for sent in train_sentences]
test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent)))
for sent in test_sentences]
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE,Laplace
from nltk.lm import Vocabulary
vocab = Vocabulary(nltk.tokenize.word_tokenize(train_text),1);
n = 2
print(train_tokenized_text)
print(len(train_tokenized_text))
train_data, padded_vocab = padded_everygram_pipeline(n, train_tokenized_text)
# print(list(vocab),"\n >>>>",list(padded_vocab))
model = MLE(n) # Lets train a 3-grams maximum likelihood estimation model.
# model.fit(train_data, padded_vocab)
model.fit(train_data, vocab)
sentences = test_sentences
print("len: ",len(sentences))
print("per all", model.perplexity(test_text))
当我在model.fit(train_data, vocab)
中使用vocab
时,在print("per all", model.perplexity(test_text))
中的困惑是一个数字(30.2
),但是如果我使用的padded_vocab
具有附加的<s>
和</s>
,它将打印inf
When I use vocab
in model.fit(train_data, vocab)
the perplexity in print("per all", model.perplexity(test_text))
is a number (30.2
), but if I use padded_vocab
which has additional <s>
and </s>
it prints inf
.
推荐答案
困惑的输入是以ngram表示的文本,而不是字符串列表.您可以通过运行
The input to perplexity is text in ngrams not a list of strings. You can verify the same by running
for x in test_text:
print ([((ngram[-1], ngram[:-1]),model.score(ngram[-1], ngram[:-1])) for ngram in x])
您应该看到令牌(ngrams)都是错误的.
You should see that the tokens(ngrams) are all wrong.
如果您的测试数据中的单词超出(训练数据中的)词汇表的话,您仍然会感到困惑
You will still get inf in the perplexity if your words in test data are out of vocab (of train data)
train_sentences = nltk.sent_tokenize(train_text)
test_sentences = nltk.sent_tokenize(test_text)
train_sentences = ['an apple', 'an orange']
test_sentences = ['an apple']
train_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent)))
for sent in train_sentences]
test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent)))
for sent in test_sentences]
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE,Laplace
from nltk.lm import Vocabulary
n = 1
train_data, padded_vocab = padded_everygram_pipeline(n, train_tokenized_text)
model = MLE(n)
# fit on padded vocab that the model know the new tokens added to vocab (<s>, </s>, UNK etc)
model.fit(train_data, padded_vocab)
test_data, _ = padded_everygram_pipeline(n, test_tokenized_text)
for test in test_data:
print("per all", model.perplexity(test))
# out of vocab test data
test_sentences = ['an ant']
test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent)))
for sent in test_sentences]
test_data, _ = padded_everygram_pipeline(n, test_tokenized_text)
for test in test_data:
print("per all [oov]", model.perplexity(test))
这篇关于为什么填充词汇的困惑对于nltk.lm bigram不定式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!