



I'm not a mathematician, so I'm hoping there's a "dumbed down" solution to this. In short, I'm attempting to compare two bodies of text to generate recommendations. What you'll see below is a novice attempt at measuring similarity using NLP. I'm open to all feedback. But my primary question: does the method described below serve as an accurate means of finding similarities (in wording, sentiment, etc) in two bodies of text? If not, how would you generate such a recommendation engine (new methods, new data, etc)?

我目前有两个词典-一个带有名为 personality_feature_dict 的个性数据,其中包括个性类型和相关的描述词: {'Type 1':['able','accepting','according','accountable'...]} 和另一个名为 book_feature_dict 的书,其中包含书名和它们自己的描述符词,它们是使用TF-IDF {'Book标题":[实际上",管理",年龄",允许",反" ...]}

I currently have two dictionaries – one with personality data called personality_feature_dict that includes the personality type and associated descriptor words: {'Type 1': ['able', 'accepting', 'according', 'accountable'...]} and the other called book_feature_dict containing book titles and their own descriptor words, which were extracted using TF-IDF {'Book Title': ['actually', 'administration', 'age', 'allow', 'anti'...]}


As it stands, I'm using the following code to calculate the similarity between dictionary values from each to identify % similarity. First, I create a larger corpus using all dictionary items.

book_values = list(book_feature_dict.values())
personality_values = list(personality_feature_dict.values())

texts = book_values + personality_values

dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

import numpy as np


Then I create an LDA model to identify similarities. My knowledge of LDA modeling is limited, so if you spot an error here, I appreciate you flagging it!

from gensim.models import ldamodel
model = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=4, minimum_probability=1e-8)

最后,我以单词袋的形式遍历一组值,并通过查找第一个人格类型或(personality_feature_dict.values())[personality_num] 与99个书名/值进行比较,来比较赫林格距离.

Finally, I iterate through sets of values as bags of words and compare how the first personality type or (personality_feature_dict.values())[personality_num] compares to 99 book descriptions/values by finding the Hellinger distance.

from gensim.matutils import hellinger
personality_num = 0
i = 0

while i < 98:

    s_0 = list(book_feature_dict.values())[i]
    s_0_bow = model.id2word.doc2bow(s_0)
    s_0_lda_bow = model[s_0_bow]

    e_0 = list(personality_feature_dict.values())[personality_num]
    e_0_bow = model.id2word.doc2bow(e_0)
    e_0_lda_bow = model[e_0_bow]

    x = 100 - (hellinger(e_0_lda_bow, s_0_lda_bow)*100)
    i = i+1


Finally, I print all instances where the LDA model comes back with a high correlation as a percentage.

    if x > 50:
        print (list(personality_feature_dict.keys())[personality_num])
        print('similarity to ', (list(book_feature_dict.keys())[i]), 'is')
        print(x, '%', '\n\n')


Personality Type
similarity to  Name of Book 1 is
84.6029228744518 %

Personality Type
similarity to  Name of Book 2 is
83.09513184950528 %

Personality Type
similarity to  Name of Book 3 is
85.44322295890642 %




Your question if very, very broad. As such it does not necessarily even fit StackOverflow.

在我看来,您似乎正在尝试使用一组特定的词汇表来过滤书籍.为此,您不需要进入LDA建模.二进制词向量或嵌入距离之间的简单余弦相似性就可以做到(例如,使用 FastText Word2Vec GloVe 嵌入).

To me it seems that you are attempting to filter books using a specific set of vocabulary. For that you do not need to get into LDA modelling. A simple cosine similarity between binary word vectors or embeddings distance would do (e.g. using FastText, Word2Vec, GloVe embeddings).


The questionable part about the way you trained the LDA model is that you are uncovering the latent topics across your corpus of books. The words for personality traits can be arbitrarily distributed across all of the topics and are unlikely to be strong clues about which topic a given book belongs to. Therefore, the similarity you are measuring in the 4-dimensional latent topic space is not a good indication for alignment with particular personality-related words (and themes).

我建议您使用嵌入和某种方式将它们聚合到较大数量的文本中(例如 doc2vec来自 gensim ).

I would recommend using embeddings and some way to aggregate them across larger volume of text (e.g. doc2vec from gensim).


08-31 06:38