问题描述
我正在尝试以下代码,发现 sklearn
中的 StandardScaler(或 MinMaxScaler)
和 Normalizer
处理数据的方式非常不同.这个问题使管道建设更加困难.我想知道这种设计差异是否是故意的.
I was trying the following code and found that StandardScaler(or MinMaxScaler)
and Normalizer
from sklearn
handle data very differently. This issue makes the pipeline construction more difficult. I was wondering if this design discrepancy is intentional or not.
from sklearn.preprocessing import StandardScaler, Normalizer, MinMaxScaler
对于Normalizer
,数据是水平"读取的.
For Normalizer
, the data is read "horizontally".
Normalizer(norm = 'max').fit_transform([[ 1., 1., 2., 10],
[ 2., 0., 0., 100],
[ 0., -1., -1., 1000]])
#array([[ 0.1 , 0.1 , 0.2 , 1. ],
# [ 0.02 , 0. , 0. , 1. ],
# [ 0. , -0.001, -0.001, 1. ]])
对于StandardScaler
和MinMaxScaler
,数据是垂直"读取的.
For StandardScaler
and MinMaxScaler
, the data is read "vertically".
StandardScaler().fit_transform([[ 1., 1., 2., 10],
[ 2., 0., 0., 100],
[ 0., -1., -1., 1000]])
#array([[ 0. , 1.22474487, 1.33630621, -0.80538727],
# [ 1.22474487, 0. , -0.26726124, -0.60404045],
# [-1.22474487, -1.22474487, -1.06904497, 1.40942772]])
MinMaxScaler().fit_transform([[ 1., 1., 2., 10],
[ 2., 0., 0., 100],
[ 0., -1., -1., 1000]])
#array([[0.5 , 1. , 1. , 0. ],
# [1. , 0.5 , 0.33333333, 0.09090909],
# [0. , 0. , 0. , 1. ]])
推荐答案
这是预期的行为,因为 StandardScaler
和 Normalizer
用于不同的目的.StandardScaler
有效垂直",因为它...
This is expected behavior, because StandardScaler
and Normalizer
serve different purposes. The StandardScaler
works 'vertically', because it...
通过去除均值和缩放到单位方差来标准化[s]个特征
[...]通过计算训练集中样本的相关统计数据,对每个特征独立地进行居中和缩放.然后将平均值和标准偏差存储起来,以用于使用变换方法的后续数据.
[...] Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using the transform method.
而 规范化器
横向"工作,因为它...
while the Normalizer
works 'horizontally', because it...
将 [s] 个样本单独标准化为单位范数.
具有至少一个非零分量的每个样本(即数据矩阵的每一行)独立于其他样本进行重新缩放,使其范数(l1 或 l2)等于 1.
Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1 or l2) equals one.
请查看 scikit-learn 文档(上面的链接),以获得更多见解,从而更好地满足您的目的.
Please have a look at the scikit-learn docs (links above), to get more insight, which serves your purpose better.
这篇关于为什么standardscaler和normalizer需要不同的数据输入?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!