本文介绍了为什么填充词汇的困惑对于nltk.lm bigram不定式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在测试perplexity度量的文本语言模型:

I am testing the perplexity measure for a language model for a text:

  train_sentences = nltk.sent_tokenize(train_text)
  test_sentences = nltk.sent_tokenize(test_text)

  train_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in train_sentences]

  test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in test_sentences]

  from nltk.lm.preprocessing import padded_everygram_pipeline
  from nltk.lm import MLE,Laplace
  from nltk.lm import Vocabulary

  vocab = Vocabulary(nltk.tokenize.word_tokenize(train_text),1);

  n = 2
  print(train_tokenized_text)
  print(len(train_tokenized_text))
  train_data, padded_vocab = padded_everygram_pipeline(n, train_tokenized_text)

  # print(list(vocab),"\n >>>>",list(padded_vocab))
  model = MLE(n) # Lets train a 3-grams maximum likelihood estimation model.
  # model.fit(train_data, padded_vocab)
  model.fit(train_data, vocab)

  sentences = test_sentences
  print("len: ",len(sentences))
  print("per all", model.perplexity(test_text)) 

当我在model.fit(train_data, vocab)中使用vocab时,在print("per all", model.perplexity(test_text))中的困惑是一个数字(30.2),但是如果我使用的padded_vocab具有附加的<s></s>,它将打印inf

When I use vocab in model.fit(train_data, vocab) the perplexity in print("per all", model.perplexity(test_text)) is a number (30.2), but if I use padded_vocab which has additional <s> and </s> it prints inf.

推荐答案

困惑的输入是以ngram表示的文本,而不是字符串列表.您可以通过运行

The input to perplexity is text in ngrams not a list of strings. You can verify the same by running

for x in test_text:
    print ([((ngram[-1], ngram[:-1]),model.score(ngram[-1], ngram[:-1])) for ngram in x])

您应该看到令牌(ngrams)都是错误的.

You should see that the tokens(ngrams) are all wrong.

如果您的测试数据中的单词超出(训练数据中的)词汇表的话,您仍然会感到困惑

You will still get inf in the perplexity if your words in test data are out of vocab (of train data)

train_sentences = nltk.sent_tokenize(train_text)
test_sentences = nltk.sent_tokenize(test_text)

train_sentences = ['an apple', 'an orange']
test_sentences = ['an apple']

train_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in train_sentences]

test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in test_sentences]

from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE,Laplace
from nltk.lm import Vocabulary

n = 1
train_data, padded_vocab = padded_everygram_pipeline(n, train_tokenized_text)
model = MLE(n)
# fit on padded vocab that the model know the new tokens added to vocab (<s>, </s>, UNK etc)
model.fit(train_data, padded_vocab) 

test_data, _ = padded_everygram_pipeline(n, test_tokenized_text)
for test in test_data:
    print("per all", model.perplexity(test))

# out of vocab test data
test_sentences = ['an ant']
test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in test_sentences]
test_data, _ = padded_everygram_pipeline(n, test_tokenized_text)
for test in test_data:
    print("per all [oov]", model.perplexity(test))

这篇关于为什么填充词汇的困惑对于nltk.lm bigram不定式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-24 12:46