问题描述
我正在尝试选择Chi-2功能来执行一些文本分类任务.我知道Chi-2测试会检查B/T两个 categorical 变量的依存关系,因此,如果我们针对具有二进制BOW矢量表示的二进制文本分类问题执行Chi-2特征选择,则每个Chi-2测试在每个(特征,类)对上,这将是一个非常简单的具有2个自由度的Chi-2检验.
I'm experimenting with Chi-2 feature selection for some text classification tasks.I understand that Chi-2 test checks the dependencies B/T two categorical variables, so if we perform Chi-2 feature selection for a binary text classification problem with binary BOW vector representation, each Chi-2 test on each (feature,class) pair would be a very straightforward Chi-2 test with 1 degree of freedom.
在我看来,我们也可以对DF(字数)矢量表示进行Chi-2特征选择.我的第一个问题是:sklearn如何将整数值特征离散化为分类?
It seems to me that we we can also perform Chi-2 feature selection on DF (word counts) vector presentation. My 1st question is: how does sklearn discretize the integer-valued feature into categorical?
我的第二个问题类似于第一个问题.从此处的演示代码中: http://scikit-learn.sourceforge.net/dev /auto_examples/document_classification_20newsgroups.html
My second question is similar to the first. From the demo codes here: http://scikit-learn.sourceforge.net/dev/auto_examples/document_classification_20newsgroups.html
在我看来,我们也可以对TF * IDF矢量表示执行Chi-2特征选择. sklearn如何对实值特征执行Chi-2特征选择?
It seems to me that we can also perform Chi-2 feature selection on a TF*IDF vector representation. How sklearn perform Chi-2 feature selection on real-valued features?
在此先感谢您的宝贵意见!
Thank you in advance for your kind advise!
推荐答案
χ2功能选择代码构建了 contingency表 从其输入X
(功能值)和y
(类标签)输入.每个条目 i , j 对应于某个特征 i 和某些类 j ,并保存 j 的所有样本中的em> i 个要素的值.然后针对类别上的经验分布(恰好是它们的相对频率,在y
中)和特征值上的均匀分布,根据预期频率计算χ²检验统计量.
The χ² features selection code builds a contingency table from its inputs X
(feature values) and y
(class labels). Each entry i, j corresponds to some feature i and some class j, and holds the sum of the i'th feature's values across all samples belonging to the class j. It then computes the χ² test statistic against expected frequencies arising from the empirical distribution over classes (just their relative frequencies in y
) and a uniform distribution over feature values.
当特征值是频率(例如术语)时,此方法有效,因为总和将是该类别中某个特征(术语)的总频率.没有离散化.
This works when the feature values are frequencies (of terms, for example) because the sum will be the total frequency of a feature (term) in that class. There's no discretization going on.
当这些值是tf-idf值时,由于它们只是加权/缩放频率,因此在实践中也很好用.
It also works quite well in practice when the values are tf-idf values, since those are just weighted/scaled frequencies.
这篇关于对TF和TF * IDF向量执行Chi-2特征选择的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!