本文介绍了是否有Python库或工具可以分析两个文本的相似性以提供建议?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,道歉.

我不是数学家,所以我希望能有一个沉迷"的学习环境.解决方案.简而言之,我正在尝试比较两个正文以生成建议.您将在下面看到的是使用NLP测量相似度的新手尝试.我愿意接受所有反馈.但是我的主要问题是:下面描述的方法是否可以作为在两个文本正文中查找相似性(在措辞,情感等方面)的准确方法?如果没有,您将如何生成这样的推荐引擎(新方法,新数据等)?

I'm not a mathematician, so I'm hoping there's a "dumbed down" solution to this. In short, I'm attempting to compare two bodies of text to generate recommendations. What you'll see below is a novice attempt at measuring similarity using NLP. I'm open to all feedback. But my primary question: does the method described below serve as an accurate means of finding similarities (in wording, sentiment, etc) in two bodies of text? If not, how would you generate such a recommendation engine (new methods, new data, etc)?

我目前有两个词典-一个带有名为 personality_feature_dict 的个性数据,其中包括个性类型和相关的描述词: {'Type 1':['able','accepting','according','accountable'...]} 和另一个名为 book_feature_dict 的书,其中包含书名和它们自己的描述符词,它们是使用TF-IDF {'Book标题":[实际上",管理",年龄",允许",反" ...]}

I currently have two dictionaries – one with personality data called personality_feature_dict that includes the personality type and associated descriptor words: {'Type 1': ['able', 'accepting', 'according', 'accountable'...]} and the other called book_feature_dict containing book titles and their own descriptor words, which were extracted using TF-IDF {'Book Title': ['actually', 'administration', 'age', 'allow', 'anti'...]}

就目前而言,我正在使用以下代码来计算每个字典值之间的相似度,以识别相似度.首先,我使用所有字典项创建一个更大的语料库.

As it stands, I'm using the following code to calculate the similarity between dictionary values from each to identify % similarity. First, I create a larger corpus using all dictionary items.

book_values = list(book_feature_dict.values())
personality_values = list(personality_feature_dict.values())

texts = book_values + personality_values

dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

import numpy as np
np.random.seed(1)

然后,我创建一个LDA模型来识别相似之处.我对LDA建模的知识有限,因此,如果您在此处发现错误,我们感谢您举报该错误!

Then I create an LDA model to identify similarities. My knowledge of LDA modeling is limited, so if you spot an error here, I appreciate you flagging it!

from gensim.models import ldamodel
model = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=4, minimum_probability=1e-8)

最后,我以单词袋的形式遍历一组值,并通过查找第一个人格类型或(personality_feature_dict.values())[personality_num] 与99个书名/值进行比较,来比较赫林格距离.

Finally, I iterate through sets of values as bags of words and compare how the first personality type or (personality_feature_dict.values())[personality_num] compares to 99 book descriptions/values by finding the Hellinger distance.

from gensim.matutils import hellinger
personality_num = 0
i = 0

while i < 98:

    s_0 = list(book_feature_dict.values())[i]
    s_0_bow = model.id2word.doc2bow(s_0)
    s_0_lda_bow = model[s_0_bow]

    e_0 = list(personality_feature_dict.values())[personality_num]
    e_0_bow = model.id2word.doc2bow(e_0)
    e_0_lda_bow = model[e_0_bow]

    x = 100 - (hellinger(e_0_lda_bow, s_0_lda_bow)*100)
    i = i+1

最后,我打印出所有LDA模型以较高的相关性返回的实例.

Finally, I print all instances where the LDA model comes back with a high correlation as a percentage.

    if x > 50:
        print (list(personality_feature_dict.keys())[personality_num])
        print('similarity to ', (list(book_feature_dict.keys())[i]), 'is')
        print(x, '%', '\n\n')

结果看起来像这样:

Personality Type
similarity to  Name of Book 1 is
84.6029228744518 %


Personality Type
similarity to  Name of Book 2 is
83.09513184950528 %


Personality Type
similarity to  Name of Book 3 is
85.44322295890642 %

...

推荐答案

您的问题是否非常广泛.因此,它甚至不一定适合StackOverflow.

Your question if very, very broad. As such it does not necessarily even fit StackOverflow.

在我看来,您似乎正在尝试使用一组特定的词汇表来过滤书籍.为此,您不需要进入LDA建模.二进制词向量或嵌入距离之间的简单余弦相似性就可以做到(例如,使用 FastText Word2Vec GloVe 嵌入).

To me it seems that you are attempting to filter books using a specific set of vocabulary. For that you do not need to get into LDA modelling. A simple cosine similarity between binary word vectors or embeddings distance would do (e.g. using FastText, Word2Vec, GloVe embeddings).

关于您训练LDA模型的方式的一个有疑问的部分是,您正在发现整个书籍库中的潜在主题.人格特质这些词可以在所有主题中任意分配,并且不太可能成为特定书籍所属主题的有力线索.因此,在4维潜在主题空间中测量的相似性并不是与特定个性相关单词(和主题)对齐的良好指示.

The questionable part about the way you trained the LDA model is that you are uncovering the latent topics across your corpus of books. The words for personality traits can be arbitrarily distributed across all of the topics and are unlikely to be strong clues about which topic a given book belongs to. Therefore, the similarity you are measuring in the 4-dimensional latent topic space is not a good indication for alignment with particular personality-related words (and themes).

我建议您使用嵌入和某种方式将它们聚合到较大数量的文本中(例如 doc2vec来自 gensim ).

I would recommend using embeddings and some way to aggregate them across larger volume of text (e.g. doc2vec from gensim).

这篇关于是否有Python库或工具可以分析两个文本的相似性以提供建议?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-31 06:38