是否有Python库或工具可以分析两个文本的相似性以提供建议?

本文介绍了是否有Python库或工具可以分析两个文本的相似性以提供建议?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

首先，道歉.

我不是数学家，所以我希望能有一个沉迷"的学习环境.解决方案.简而言之，我正在尝试比较两个正文以生成建议.您将在下面看到的是使用NLP测量相似度的新手尝试.我愿意接受所有反馈.但是我的主要问题是:下面描述的方法是否可以作为在两个文本正文中查找相似性(在措辞，情感等方面)的准确方法?如果没有，您将如何生成这样的推荐引擎(新方法，新数据等)?

I'm not a mathematician, so I'm hoping there's a "dumbed down" solution to this. In short, I'm attempting to compare two bodies of text to generate recommendations. What you'll see below is a novice attempt at measuring similarity using NLP. I'm open to all feedback. But my primary question: does the method described below serve as an accurate means of finding similarities (in wording, sentiment, etc) in two bodies of text? If not, how would you generate such a recommendation engine (new methods, new data, etc)?

我目前有两个词典-一个带有名为 personality_feature_dict 的个性数据，其中包括个性类型和相关的描述词: {'Type 1':['able'，'accepting'，'according'，'accountable'...]} 和另一个名为 book_feature_dict 的书，其中包含书名和它们自己的描述符词，它们是使用TF-IDF {'Book标题":[实际上"，管理"，年龄"，允许"，反" ...]}

I currently have two dictionaries – one with personality data called personality_feature_dict that includes the personality type and associated descriptor words: {'Type 1': ['able', 'accepting', 'according', 'accountable'...]} and the other called book_feature_dict containing book titles and their own descriptor words, which were extracted using TF-IDF {'Book Title': ['actually', 'administration', 'age', 'allow', 'anti'...]}

就目前而言，我正在使用以下代码来计算每个字典值之间的相似度，以识别相似度.首先，我使用所有字典项创建一个更大的语料库.

As it stands, I'm using the following code to calculate the similarity between dictionary values from each to identify % similarity. First, I create a larger corpus using all dictionary items.

book_values = list(book_feature_dict.values())
personality_values = list(personality_feature_dict.values())

texts = book_values + personality_values

dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

import numpy as np
np.random.seed(1)

然后，我创建一个LDA模型来识别相似之处.我对LDA建模的知识有限，因此，如果您在此处发现错误，我们感谢您举报该错误！

Then I create an LDA model to identify similarities. My knowledge of LDA modeling is limited, so if you spot an error here, I appreciate you flagging it!

from gensim.models import ldamodel
model = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=4, minimum_probability=1e-8)

最后，我以单词袋的形式遍历一组值，并通过查找第一个人格类型或(personality_feature_dict.values())[personality_num] 与99个书名/值进行比较，来比较赫林格距离.

Finally, I iterate through sets of values as bags of words and compare how the first personality type or (personality_feature_dict.values())[personality_num] compares to 99 book descriptions/values by finding the Hellinger distance.

from gensim.matutils import hellinger
personality_num = 0
i = 0

while i < 98:

    s_0 = list(book_feature_dict.values())[i]
    s_0_bow = model.id2word.doc2bow(s_0)
    s_0_lda_bow = model[s_0_bow]

    e_0 = list(personality_feature_dict.values())[personality_num]
    e_0_bow = model.id2word.doc2bow(e_0)
    e_0_lda_bow = model[e_0_bow]

    x = 100 - (hellinger(e_0_lda_bow, s_0_lda_bow)*100)
    i = i+1

最后，我打印出所有LDA模型以较高的相关性返回的实例.

Finally, I print all instances where the LDA model comes back with a high correlation as a percentage.

    if x > 50:
        print (list(personality_feature_dict.keys())[personality_num])
        print('similarity to ', (list(book_feature_dict.keys())[i]), 'is')
        print(x, '%', '\n\n')

结果看起来像这样:

Personality Type
similarity to  Name of Book 1 is
84.6029228744518 %


Personality Type
similarity to  Name of Book 2 is
83.09513184950528 %


Personality Type
similarity to  Name of Book 3 is
85.44322295890642 %

...

embeddings

是否有Python库或工具可以分析两个文本的相似性以提供建议?

问题描述

推荐答案