问题描述
鉴于以下代码(摘自该).
I'm trying to figure out how to properly interpret nltk
's "likelihood ratio" given the below code (taken from this question).
import nltk.collocations
import nltk.corpus
import collections
bgm = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(nltk.corpus.brown.words())
scored = finder.score_ngrams(bgm.likelihood_ratio)
# Group bigrams by first word in bigram.
prefix_keys = collections.defaultdict(list)
for key, scores in scored:
prefix_keys[key[0]].append((key[1], scores))
for key in prefix_keys:
prefix_keys[key].sort(key = lambda x: -x[1])
prefix_keys['baseball']
这有点误导,并且似然得分不能直接模拟二元组的频率分布.
This is slightly misleading, and the likelihood score does not directly model a frequency distribution of the bigram.
nltk.collocations.BigramAssocMeasures().raw_freq
使用t检验对原始频率进行建模,因为t检验不适用于稀疏数据(例如双字母组),因此提供了似然比.
models raw frequency with t tests, not well suited to sparse data such as bigrams, thus the provision of the likelihood ratio.
Manning和Schutze计算的似然比不等于频率.
The likelihood ratio as calculated by Manning and Schutze is not equivalent to frequency.
https://nlp.stanford.edu/fsnlp/promo/colloc.pdf
第5.3.4节详细介绍了它们的计算方式.
Section 5.3.4 describes their calculations in detail on how the calculation is done.
它们以一种非常适合稀疏矩阵(如语料库矩阵)的方式,考虑了文档中单词一的出现频率,文档中单词二的出现频率和二元组的出现频率.
They take into account frequency of word one in the document, frequency of word two in the document, and frequency of the bigram in the document in a manner that is well-suited to sparse matrices like corpus matrices.
如果您熟悉TF-IDF矢量化方法,则该比率的目标是在标准化噪声特征方面达到类似的目的.
If you are familiar with the TF-IDF vectorization method, this ratio aims for something similar as far as normalizing noisy features.
分数可以无限大.得分之间的相对差异反映了我刚刚描述的那些输入(单词1,单词2和单词1word2的语料频率).
The score can be infinitely large. The relative difference between scores reflects those inputs I just described (corpus frequencies of word 1, word 2 and word1word2).
除非您是统计学家,否则此图是他们解释中最直观的部分:
This chart is the most intuitive piece of their explanation, unless you're a statistician:
似然分数被计算为最左列.
The likelihood score is calculated as the leftmost column.
这篇关于如何解释Python NLTK bigram似然比?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!