问题描述
我正在学习用于特征选择的chi2,并且遇到了此
I'm learning about chi2 for feature selection and came across code like this
但是,我对chi2的理解是,较高的分数意味着该功能更多是独立的(因此对模型的用处较小),因此我们将对分数最低的功能感兴趣.但是,使用scikit可以学习 SelectKBest ,选择器将返回最高 chi2得分的值.我对使用chi2测试的理解不正确吗?还是sklearn中的chi2得分产生了chi2统计量以外的其他结果?
However, my understanding of chi2 was that higher scores mean that the feature is more independent (and therefore less useful to the model) and so we would be interested in features with the lowest scores. However, using scikit learns SelectKBest, the selector returns the values with the highest chi2 scores. Is my understanding of using the chi2 test incorrect? Or does the chi2 score in sklearn produce something other than a chi2 statistic?
有关我的意思,请参见下面的代码(除了结尾之外,大部分都是从上面的链接复制的)
See code below for what I mean (mostly copied from above link except for the end)
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import pandas as pd
import numpy as np
# Load iris data
iris = load_iris()
# Create features and target
X = iris.data
y = iris.target
# Convert to categorical data by converting data to integers
X = X.astype(int)
# Select two features with highest chi-squared statistics
chi2_selector = SelectKBest(chi2, k=2)
chi2_selector.fit(X, y)
# Look at scores returned from the selector for each feature
chi2_scores = pd.DataFrame(list(zip(iris.feature_names, chi2_selector.scores_, chi2_selector.pvalues_)), columns=['ftr', 'score', 'pval'])
chi2_scores
# you can see that the kbest returned from SelectKBest
#+ were the two features with the _highest_ score
kbest = np.asarray(iris.feature_names)[chi2_selector.get_support()]
kbest
推荐答案
您的理解相反.
chi2检验的原假设是两个分类变量是独立的".因此,chi2统计量的较高值意味着两个分类变量是相关的",并且对分类更为有用.
The null hypothesis for chi2 test is that "two categorical variables are independent". So a higher value of chi2 statistic means "two categorical variables are dependent" and MORE USEFUL for classification.
SelectKBest根据较高的chi2值为您提供最佳的两个(k = 2)功能.因此,您需要获得它提供的那些功能,而不是在chi2选择器上获得其他功能".
SelectKBest gives you the best two (k=2) features based on higher chi2 values. Thus you need to get those features that it gives, rather that getting the "other features" on the chi2 selector.
从chi2_selector.scores_获取chi2统计信息,从chi2_selector.get_support()获得最佳功能是正确的.基于独立性测试的chi2测试,它将为您提供花瓣长度(cm)"和花瓣宽度(cm)"作为前两项功能.希望它能阐明此算法.
You are correct to get the chi2 statistic from chi2_selector.scores_ and the best features from chi2_selector.get_support(). It will give you 'petal length (cm)' and 'petal width (cm)' as top 2 features based on chi2 test of independence test. Hope it clarifies this algorithm.
这篇关于Sklearn Chi2用于功能选择的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!