是否有一种方法可以阻止短语检测到不感兴趣的n-gram,就像我在示例中所提到的那样?解决方案 Phrases具有可配置的threshold参数,该参数调整统计截止值以将单词对提升为短语. (阈值越高,表示成对的短语就越少.)您可以对其进行调整,以使其更大比例的提升短语与您自己对有趣"短语的直觉相匹配–但是该类仍在使用相当粗略的方法,对语法或领域知识一无所知语料库中有什么.因此,获得所有/大部分所需短语的任何值都可能包括许多无趣的短语,反之亦然.如果您具有先验知识,即某些单词组很重要,则可以在基于并列统计的Phrases处理之前(或代替此过程)自行对语料库进行预处理,以将其组合为单个标记.I am using Gensim Phrases to identify important n-grams in my text as follows.bigram = Phrases(documents, min_count=5)trigram = Phrases(bigram[documents], min_count=5)for sent in documents: bigrams_ = bigram[sent] trigrams_ = trigram[bigram[sent]]However, this detects uninteresting n-grams such as special issue, important matter, high risk etc. I am particularly, interested in detecting concepts in the text such as machine learning, human computer interaction etc.Is there a way to stop phrases detecting uninteresting n-grams as I have mentioned above in my example? 解决方案 Phrases has a configurable threshold parameter which adjusts the statistical cutoff for promoting word-pairs into phrases. (Larger thresholds mean fewer pairs become phrases.)You can adjust that to try to make a greater proportion of its promoted phrases match your own ad hoc intuition about "interesting" phrases – but this class is still using a fairly crude method, without any awareness of grammar or domain knowledge beyond what's in the corpus. So any value that gets all/most of the phrases you want will likely include many uninteresting ones, or vice-versa.If you have a priori knowledge that certain word-groups are of importance, you could preprocess the corpus yourself to combine those into single tokens, before (or instead of) the collocation-statistics-based Phrases process. 这篇关于Gensim短语用法以过滤n-gram的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 1403页,肝出来的.. 09-06 07:00