python - 从多行语料库中使用NLTK创建二元词

我正在尝试从多行语料库生成双字母组。跨断行创建了双字，这是一个问题，因为每行代表它自己的上下文，并且与下一行无关。这导致语义上不正确的双字母组。

语料库

Reeves Acrylfarbe 75Ml Ultramarin
Acrylfarbe Deep Peach
Reeves Acrylfarbe 75Ml Grasgrün
Acrylfarbe Antique Go

有问题的二元示例

'Ultramarin Acrylfarbe'，'GrasgrünAcrylfarbe'

这是我正在使用的代码：

finder = BigramCollocationFinder.from_words(word_tokenize(corpus))
bigrams = finder.nbest(bigram_measures.likelihood_ratio, 100)

如何省略跨越两行的二元组？

最佳答案

我相信这样的事情应该起作用：

finder = nltk.BigramCollocationFinder.from_documents([
    nltk.word_tokenize(x) for x in corpus.split('\n')])
bigrams = finder.nbest(bigram_measures.likelihood_ratio, 100)

关于python - 从多行语料库中使用NLTK创建二元词，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/38906263/