


There are already spell checking models available which help us to find the suggested correct spellings based on a corpus of trained correct spellings. Can the granularity be increased to "word" from alphabet so that we can have even phrase suggestions , such that if an incorrect phrase is entered then it should suggest the nearest correct phrase from the corpus of correct phrases, of course it is trained from a list of valid phrases.

是否有任何 Python 库已经实现了此功能,或者如何针对现有的大型黄金标准短语语料库进行此操作以获得统计相关的建议?

Are there any python libraries which achieve this functionality already or how to proceed for this for an existing large gold standard phrase corpus to get statistically relevant suggestions?


Note: this is different from a spell checker as the alphabets in a spell checker are finite whereas in a phrase correcter the alphabet is itself a word hence theoretically infinite , but we can limit the number of words from a phrase bank.


您要构建的是一个 N-gram 模型,该模型包括计算每个单词跟随 n 个单词的序列的概率.

What you want to build is a N-gram model which consist in computing the probability for each word to follow a sequence of n words.

您可以使用 NLTK 文本语料库 来训练您的模型,或者您可以标记您的模型自己的语料库,带有 nltk.sent_tokenize(text)nltk.word_tokenize(sentence).

You can use NLTK text corpora to train your model, or you can tokenize your own corpus with nltk.sent_tokenize(text) and nltk.word_tokenize(sentence).

你可以考虑 2-gram(马尔可夫模型):

You can consider 2-gram (Markov model):


...或 3 克:


显然用 n+1-gram 训练模型比 n-gram 成本更高.

Obviously training the model with n+1-gram is costlier than n-gram.

您可以考虑一对 (word, pos) 而不是考虑单词,其中 pos 是词性标签(您可以使用 (word, pos) 获取标签代码>nltk.pos_tag(tokens))

Instead of considering words, you can consider the couple (word, pos) where pos is the part-of-speech tag (you can get the tags using nltk.pos_tag(tokens))


You can also try to consider the lemmas instead of the words.

这里有一些关于 N-gram 建模的有趣讲座:

Here some interesting lectures about N-gram modelling:

  1. N-gram 简介
  2. 估计 N-gram 概率


This is a simple and short example of code (2-gram) not optimized:

from collections import defaultdict
import nltk
import math

ngram = defaultdict(lambda: defaultdict(int))
corpus = "The cat is cute. He jumps and he is happy."
for sentence in nltk.sent_tokenize(corpus):
    tokens = map(str.lower, nltk.word_tokenize(sentence))
    for token, next_token in zip(tokens, tokens[1:]):
        ngram[token][next_token] += 1
for token in ngram:
    total = math.log10(sum(ngram[token].values()))
    ngram[token] = {nxt: math.log10(v) - total for nxt, v in ngram[token].items()}


07-16 19:58