问题描述
我想知道如何为文本分类计算逐点相互信息.更确切地说,我想将推文归类.我有一个推文数据集(带注释),每个类别的单词都有一个字典.有了这些信息,如何计算每个推文的每个类别的PMI,并将这些推文分类为这些类别之一.
I was wondering how one would calculate the pointwise mutual information for text classification. To be more exact, I want to classify tweets in categories. I have a dataset of tweets (which are annotated), and I have a dictionary per category of words which belong to that category. Given this information, how is it possible to calculate the PMI for each category per tweet, to classify a tweet in one of these categories.
推荐答案
PMI是度量功能(在您的情况下为单词)与类(类别)之间的关联的度量,而不是文档(tweet)与类别之间的关联性.该公式可从维基百科:
PMI is a measure of association between a feature (in your case a word) and a class (category), not between a document (tweet) and a category. The formula is available on Wikipedia:
P(x, y)
pmi(x ,y) = log ------------
P(x)P(y)
在该公式中,X
是模拟单词出现的随机变量,而Y
是模拟类出现的随机变量.对于给定的单词x
和给定的类y
,您可以使用PMI来确定某个功能是否具有信息性,并且可以在此基础上进行功能选择.较少的功能通常可以提高分类算法的性能,并大大加快分类算法的速度.但是,分类步骤是分开的-PMI仅可帮助您选择更好的功能以输入到您的学习算法中.
In that formula, X
is the random variable that models the occurrence of a word, and Y
models the occurrence of a class. For a given word x
and a given class y
, you can use PMI to decide if a feature is informative or not, and you can do feature selection on that basis. Having less features often improves the performance of your classification algorithm and speeds it up considerably. The classification step, however, is separate- PMI only helps you select better features to feed into your learning algorithm.
我在原始帖子中没有提到的一件事是,PMI对单词频率很敏感.让我们将公式重写为
One thing I didn't mention in the original post is that PMI is sensitive to word frequencies. Let's rewrite the formula as
P(x, y) P(x|y)
pmi(x ,y) = log ------------ = log ------------
P(x)P(y) P(x)
当x
和y
完全相关时,P(x|y) = P(y|x) = 1
,所以pmi(x,y) = 1/P(x)
.即使频率较低的x
-es(单词)与频率较高的x
-es相比,其PMI得分也较高,即使两者均与y
完全相关.
When x
and y
are perfectly correlated, P(x|y) = P(y|x) = 1
, so pmi(x,y) = 1/P(x)
. Less frequent x
-es (words) will have a higher PMI score than frequent x
-es, even if both are perfectly correlated with y
.
这篇关于关于文本的逐点相互信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!