问题描述
我的目标是输入一组短语,如
My goal is to input an array of phrases as in
array = ["Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.","At vero eos et accusam et justo duo dolores et ea rebum.","Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet."]
并向其显示一个新短语,例如
and to present a new phrase to it, like
"Felix qui potuit rerum cognoscere causas"
,我想告诉我这是否可能属于上述array
小组的一部分.
and I want it to tell me whether this is likely part of the group in the aforementioned array
or not.
我发现了如何检测单词的出现频率,但是如何找到unsimilarity
?毕竟,我的目标是找到不寻常的短语,而不是某些单词的出现频率.
I found how to detect frequencies of words, but how do I find unsimilarity
? After all, my goal is to find unusual phrases, not the frequency of certain words.
推荐答案
您可以为此目的构建一个简单的语言模型".它将估计一个短语的概率,并将平均平均单词单词概率较低的短语标记为异常.
You can build a simple "language model" for this purpose. It will estimate probability of a phrase, and mark phrases with low average per-word probability as unusual.
对于单词概率估计,它可以使用平滑的单词数.
For word probability estimation, it can use a smoothed word count.
这是模型的样子:
import re
import numpy as np
from collections import Counter
class LanguageModel:
""" A simple model to measure 'unusualness' of sentences.
delta is a smoothing parameter.
The larger delta is, the higher is the penalty for unseen words.
"""
def __init__(self, delta=0.01):
self.delta = delta
def preprocess(self, sentence):
words = sentence.lower().split()
return [re.sub(r"[^A-Za-z]+", '', word) for word in words]
def fit(self, corpus):
""" Estimate counts from an array of texts """
self.counter_ = Counter(word
for sentence in corpus
for word in self.preprocess(sentence))
self.total_count_ = sum(self.counter_.values())
self.vocabulary_size_ = len(self.counter_.values())
def perplexity(self, sentence):
""" Calculate negative mean log probability of a word in a sentence
The higher this number, the more unusual the sentence is.
"""
words = self.preprocess(sentence)
mean_log_proba = 0.0
for word in words:
# use a smoothed version of "probability" to work with unseen words
word_count = self.counter_.get(word, 0) + self.delta
total_count = self.total_count_ + self.vocabulary_size_ * self.delta
word_probability = word_count / total_count
mean_log_proba += np.log(word_probability) / len(words)
return -mean_log_proba
def relative_perplexity(self, sentence):
""" Perplexity, normalized between 0 (the most usual sentence) and 1 (the most unusual)"""
return (self.perplexity(sentence) - self.min_perplexity) / (self.max_perplexity - self.min_perplexity)
@property
def max_perplexity(self):
""" Perplexity of an unseen word """
return -np.log(self.delta / (self.total_count_ + self.vocabulary_size_ * self.delta))
@property
def min_perplexity(self):
""" Perplexity of the most likely word """
return self.perplexity(self.counter_.most_common(1)[0][0])
您可以训练该模型并将其应用于不同的句子.
You can train this model and apply it to different sentences.
train = ["Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.",
"At vero eos et accusam et justo duo dolores et ea rebum.",
"Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet."]
test = ["Felix qui potuit rerum cognoscere causas", # an "unlikely" phrase
'sed diam nonumy eirmod sanctus sit amet', # a "likely" phrase
]
lm = LanguageModel()
lm.fit(train)
for sent in test:
print(lm.perplexity(sent).round(3), sent)
为您打印的内容
8.525 Felix qui potuit rerum cognoscere causas
3.517 sed diam nonumy eirmod sanctus sit amet
您可以看到,第一个短语的异常性"高于第二个短语,因为第二个短语是由训练词组成的.
You can see that "unusualness" is higher for the first phrase than for the second, because the second one is made from the training words.
如果您的常用"短语的语料足够大,则可以从我使用的1-gram模型转换为N-gram(对于英语,明智的N为2或3).或者,您可以使用递归神经网络来预测以所有先前单词为条件的每个单词的概率.但这需要一个非常庞大的训练语料库.
If your corpus of "usual" phrases is large enough, you can switch from 1-gram models I use to N-grams (for English, sensible N is 2 or 3). Alternatively, you can use recurrent neural nets to predict probability of each word conditional on all the previous words. But this requires a really huge training corpus.
如果您使用土耳其语这样的高度虚构语言,则可以使用字符级N-gram代替单词级模型,或者仅使用NLTK的词形化算法对文本进行预处理.
If you work with a highly flective language, like Turkish, you can use character-level N-grams instead of a word-level model, or just preprocess your texts using a lemmatization algorithm from NLTK.
这篇关于使用“一连串的常用短语"来查找不寻常的短语.的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!