Python-比较多个文本文件中的n-gram

初次发布者-我是编程技能有限的Python新用户。最终，我试图在同一目录中找到的多个文本文档中识别和比较n-gram。我的分析有点类似于gi窃检测-我想计算可以找到特定n-gram的文本文档的百分比。现在，我正在尝试一个较大问题的简单版本，试图在两个文本文档中比较n-gram。我可以毫不费力地确定n-gram，但是我正在努力比较这两个文档。有没有一种方法可以将n-gram存储在列表中，以有效比较两个文档中存在哪些n-gram？到目前为止，这是我所做的事情(请原谅代码)。作为引用，我在下面提供了一些基本句子，而不是我在代码中实际阅读的文本文档。

import nltk
from nltk.util import ngrams

text1 = 'Hello my name is Jason'
text2 = 'My name is not Mike'

n = 3
trigrams1 = ngrams(text1.split(), n)
trigrams2 = ngrams(text2.split(), n)

print(trigrams1)
for grams in trigrams1:
    print(grams)

def compare(trigrams1, trigrams2):
    for grams1 in trigrams1:
        if each_gram in trigrams2:
            print (each_gram)
    return False

感谢大家的帮助!

最佳答案

在common函数中使用一个说compare的列表。将每个ngram附加到两个Trigram共有的此列表，最后将列表返回为:

>>> trigrams1 = ngrams(text1.lower().split(), n)  # use text1.lower() to ignore sentence case.
>>> trigrams2 = ngrams(text2.lower().split(), n)  # use text2.lower() to ignore sentence case.
>>> trigrams1
[('hello', 'my', 'name'), ('my', 'name', 'is'), ('name', 'is', 'jason')]
>>> trigrams2
[('my', 'name', 'is'), ('name', 'is', 'not'), ('is', 'not', 'mike')]
>>> def compare(trigrams1, trigrams2):
...    common=[]
...    for grams1 in trigrams1:
...       if grams1 in trigrams2:
...         common.append(grams1)
...    return common
...
>>> compare(trigrams1, trigrams2)
[('my', 'name', 'is')]

关于Python-比较多个文本文件中的n-gram，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/27412881/