大纲列表

3.1 Filter

3.1.1 方差选择法

3.1.2 相关系数法

3.1.3 卡方检验

3.1.4 互信息法

3.2 Wrapper

3.2.1 递归特征消除法

3.3 Embedded

3.3.1 基于惩罚项的特征选择法

3.3.2 基于树模型的特征选择法

类	所属方式	说明
VarianceThreshold	Filter	方差选择法
SelectKBest	Filter	可选关联系数、卡方校验、最大信息系数作为得分计算的方法
RFE	Wrapper	递归地训练基模型，将权值系数较小的特征从特征集合中消除
SelectFromModel	Embedded	训练基模型，选择权值系数较高的特征

策略依据

从两个方面考虑来选择特征：

- 特征是否发散：如果一个特征不发散，例如方差接近于0，也就是说样本在这个特征上基本上没有差异，这个特征对于样本的区分并没有什么用。
- 特征与目标的相关性：这点比较显见，与目标相关性高的特征，应当优选选择。除方差法外，本文介绍的其他方法均从相关性考虑。

　　根据特征选择的形式又可以将特征选择方法分为3种：

- Filter：过滤法，按照发散性或者相关性对各个特征进行评分，设定阈值或者待选择阈值的个数，选择特征。
- Wrapper：包装法，根据目标函数（通常是预测效果评分），每次选择若干特征，或者排除若干特征。
- Embedded：嵌入法，先使用某些机器学习的算法和模型进行训练，得到各个特征的权值系数，根据系数从大到小选择特征。类似于Filter方法，但是是通过训练来确定特征的优劣。

特征选择

Filter

一、方差选择法

假设我们有一个带有布尔特征的数据集，我们要移除那些超过80%的数据都为1或0的特征。

结论：第一列被移除。

>>> from sklearn.feature_selection import VarianceThreshold

>>> X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]

>>> sel = VarianceThreshold(threshold=(.8 * (1 - .8)))

>>> sel.fit_transform(X)

array([[0, 1],

       [1, 0],

       [0, 0],

       [1, 1],

       [1, 0],

       [1, 1]])

二、卡方检验

支持稀疏数据。常用的两个API：

(1) SelectKBest 移除得分前 [Feature] Feature selection-LMLPHP 名以外的所有特征

(2) SelectPercentile 移除得分在用户指定百分比以后的特征

from sklearn.feature_selection import SelectKBest

from sklearn.feature_selection import chi2

# 找到最佳的2个特征

rst = SelectKBest(chi2, k=2).fit_transform(iris.data, iris.target)

print(rst[:5])

参数设置

加入噪声列属性（特征），检测打分机制。

(1) 用于回归: f_regression

(2) 用于分类: chi2 or f_classif

#%%

print(__doc__)

import numpy as np

import matplotlib.pyplot as plt

from sklearn import datasets, svm

from sklearn.feature_selection import SelectPercentile, f_classif, chi2

###############################################################################

# import some data to play with

# The iris dataset

iris = datasets.load_iris()

# Some noisy data not correlated

E = np.random.uniform(0, 0.1, size=(len(iris.data), 20))

# Add the noisy data to the informative features

X = np.hstack((iris.data, E))

y = iris.target

###############################################################################

plt.figure(1)

plt.clf()

X_indices = np.arange(X.shape[-1])

###############################################################################

# Univariate feature selection with F-test for feature scoring

# We use the default selection function: the 10% most significant features

# selector = SelectPercentile(f_classif, percentile=10)

selector = SelectPercentile(chi2, percentile=10)

selector.fit(X, y)

scores = -np.log10(selector.pvalues_)

scores /= scores.max()

plt.bar(X_indices - .45, scores, width=.2,

        label=r'Univariate score ($-Log(p_{value})$)', color='g')

f_classif 的结果

ch2 的结果

三、皮尔逊相关系数

四、互信息法

链接：https://www.zhihu.com/question/28641663/answer/41653367

计算每一个特征与响应变量的相关性，工程上常用的手段有计算皮尔逊系数和互信息系数，

皮尔逊系数只能衡量线性相关性；

互信息系数能够很好地度量各种相关性，但是计算相对复杂一些。

Wrapper

一、递归特征消除法

原理就是给每个“特征”打分：

首先，预测模型在原始特征上训练，每项特征指定一个权重。

之后，那些拥有最小绝对值权重的特征被踢出特征集。

如此往复递归，直至剩余的特征数量达到所需的特征数量。

(1) Recursive feature elimination: 一个递归特征消除的示例，展示了在数字分类任务中，像素之间的相关性。

(2) Recursive feature elimination with cross-validation: 一个递归特征消除示例，通过交叉验证的方式自动调整所选特征的数量。

print(__doc__)

from sklearn.svm import SVC

from sklearn.datasets import load_digits

from sklearn.feature_selection import RFE

import matplotlib.pyplot as plt

# Load the digits dataset

digits = load_digits()

X = digits.images.reshape((len(digits.images), -1))

y = digits.target


########################################################

# Create the RFE object and rank each pixel

svc = SVC(kernel="linear", C=1)

rfe = RFE(estimator=svc, n_features_to_select=1, step=1)


rfe.fit(X, y)

ranking = rfe.ranking_.reshape(digits.images[0].shape)

# Plot pixel ranking

plt.matshow(ranking)

plt.colorbar()

plt.title("Ranking of pixels with RFE")

plt.show()

对64个特征的重要性进行绘图，如下：

$ print(ranking)

[[       ]

 [       ]

 [        ]

 [         ]

 [        ]

 [          ]

 [        ]

 [        ]]

Embedded

一、基于惩罚项的特征选择法

二、基于树模型的特征选择法

该话题独立成章，详见: [Feature] Feature selection - Embedded topic

集成 pipeline

如下代码片段中，

(1) 我们将 sklearn.svm.LinearSVC 和 sklearn.feature_selection.SelectFromModel 结合来评估特征的重要性，并选择最相关的特征。

(2) 之后 sklearn.ensemble.RandomForestClassifier 模型使用转换后的输出训练，即只使用被选出的相关特征。

Ref: sklearn.pipeline.Pipeline

clf = Pipeline([

  ('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))),

  ('classification', RandomForestClassifier())

])

clf.fit(X, y)

降维

一、主成分分析法（PCA）

二、线性判别分析法（LDA）

Goto: [Scikit-learn] 4.4 Dimensionality reduction - PCA

Ref: [Scikit-learn] 2.5 Dimensionality reduction - Probabilistic PCA & Factor Analysis

Ref: [Scikit-learn] 2.5 Dimensionality reduction - ICA

Goto: [Scikit-learn] 1.2 Dimensionality reduction - Linear and Quadratic Discriminant Analysis

End.

OA

[Feature] Feature selection