问题描述
我很好奇是否存在一种算法或方法,可以通过使用一些权重计算,出现率或其他工具从给定的文本生成关键字/标签.
I am curious if there is an algorithm/method exists to generate keywords/tags from a given text, by using some weight calculations, occurrence ratio or other tools.
此外,如果您为此指定了任何基于Python的解决方案/库,也将不胜感激.
Additionally, I will be grateful if you point any Python based solution / library for this.
谢谢
推荐答案
执行此操作的一种方法是提取文档中出现频率比您偶然期望的单词高的单词.例如,在大量文档中说马尔科夫"一词几乎从未见过.但是,在同一收藏集中的特定文档中,Markov经常出现.这表明Markov可能是与文档关联的很好的关键字或标签.
One way to do this would be to extract words that occur more frequently in a document than you would expect them to by chance. For example, say in a larger collection of documents the term 'Markov' is almost never seen. However, in a particular document from the same collection Markov shows up very frequently. This would suggest that Markov might be a good keyword or tag to associate with the document.
要识别类似的关键字,您可以使用关键字的逐点相互信息和文件.这由PMI(term, doc) = log [ P(term, doc) / (P(term)*P(doc)) ]
给出.这将大致告诉您,在大型文档中遇到该术语时,您对特定文档中的术语感到惊讶的程度(或更多).
To identify keywords like this, you could use the point-wise mutual information of the keyword and the document. This is given by PMI(term, doc) = log [ P(term, doc) / (P(term)*P(doc)) ]
. This will roughly tell you how much less (or more) surprised you are to come across the term in the specific document as appose to coming across it in the larger collection.
要确定与文档相关联的5个最佳关键字,您只需按其在文档中的PMI得分对术语进行排序,然后选择得分最高的5个.
To identify the 5 best keywords to associate with a document, you would just sort the terms by their PMI score with the document and pick the 5 with the highest score.
如果要提取多字标签,请参见StackOverflow问题.
If you want to extract multiword tags, see the StackOverflow question How to extract common / significant phrases from a series of text entries.
从我对这个问题的回答中借来的 NLTK搭配方法涵盖了如何去做使用n-gram PMI在大约7行代码中提取有趣的多字表达式,例如:
Borrowing from my answer to that question, the NLTK collocations how-to covers how to doextract interesting multiword expressions using n-gram PMI in a about 7 lines of code, e.g.:
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
# change this to read in your data
finder = BigramCollocationFinder.from_words(
nltk.corpus.genesis.words('english-web.txt'))
# only bigrams that appear 3+ times
finder.apply_freq_filter(3)
# return the 5 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 5)
这篇关于从文本内容生成标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!