问题描述
试图理解 Spark 的归一化算法.我的小测试集包含 5 个向量:
{0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0},{1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 70000.0},{-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 70000.0},{-0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0},{0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 70000.0},
我希望 new Normalizer().transform(vectors)
创建 JavaRDD
,其中每个向量特征都标准化为 (v-mean)/stdev
跨越所有特性 0、`特性 1 等的值.
结果集是:
请注意,所有原始值 7000.0 都会导致不同的标准化"值.此外,例如,当值为 .95
, 1
,-11.357142668768307E-5
/code>、-.95
、0
?更重要的是,如果我删除一个功能,结果是不同的.找不到有关此问题的任何文档.
事实上,我的问题是,如何正确地对 RDD 中的所有向量进行归一化?
您的期望完全不正确.正如官方文档中明确说明的那样Normalizer
将单个样本缩放为单位 L norm",其中 p 的默认值为 2.忽略数值精度问题:
import org.apache.spark.mllib.linalg.Vectorsval rdd = sc.parallelize(Seq(Vectors.dense(0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0),Vectors.dense(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 70000.0),Vectors.dense(-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 70000.0),Vectors.dense(-0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0),Vectors.dense(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 70000.0)))val 转换 = normalizer.transform(rdd)变换后的.map(_.toArray.sum).collect//Array[Double] = Array(1.0009051182149054, 1.000085713673417,//0.9999142851020933, 1.00087797536153, 1.0
MLLib
不提供您需要的功能,但可以使用 StandardScaler
来自 ML
.
import org.apache.spark.ml.feature.StandardScalerval df = rdd.map(Tuple1(_)).toDF("特征")val 缩放器 = 新的 StandardScaler().setInputCol("功能").setOutputCol("scaledFeatures").setWithStd(真).setWithMean(真)val 转换DF = scaler.fit(df).transform(df)转换DF.select($"scaledFeatures")show(5, false)//+-------------------------------------------------------------------------------------------------------------------------+//|缩放特征 |//+-------------------------------------------------------------------------------------------------------------------------+//|[0.9740388301169303,0.015272022105217588,0.0,1.0938637007095298,1.0938637007095298,1.09106919.530]4//|[1.0253040317020319,1.4038947727833362,1.414213562373095,-0.6532797101459693,-0.653279710145098204,1450982710145098203780904//|[-1.0253040317020319,-1.4242574689236265,-1.414213562373095,-0.805205224133404,-0.8052052241360.504,3630.504040,360.50]350404080.50//|[-0.9740388301169303,0.015272022105217588,0.0,1.0938637007095298,1.0938637007095298,1.09106971285] |//|[0.0,-0.010181348070145075,0.0,-0.7292424671396867,-0.7292424671396867,-0.7273794188965303,0.0] |//+-------------------------------------------------------------------------------------------------------------------------+
Trying to understand Spark's normalization algorithm. My small test set contains 5 vectors:
{0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0},
{1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 70000.0},
{-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 70000.0},
{-0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0},
{0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 70000.0},
I would expect that new Normalizer().transform(vectors)
creates JavaRDD
where each vector feature is normalized as (v-mean)/stdev
across all values for feature-0, `feature-1, etc.
The resulting set is:
[-1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,0.9999999993877552]
[1.357142668768307E-5,2.571428214508371E-7,0.0,3.428570952677828E-4,3.428570952677828E-4,2.057142571606697E-4,0.9999998611976999]
[-1.357142668768307E-5,2.571428214508371E-7,0.0,3.428570952677828E-4,3.428570952677828E-4,2.057142571606697E-4,0.9999998611976999]
[1.4285714276967932E-5,1.4285714276967932E-5,1.4285714276967932E-5,1.4285714276967932E-5,1.4285714276967932E-5,1.4285714276967932E-5,0.9999999993877552]
[0.0,0.0,0.0,0.0,0.0,0.0,1.0]
Note that all original values 7000.0 result in different 'normalized' values. Also, how, for example, 1.357142668768307E-5
was calculated when the values are: .95
, 1
,-1
, -.95
, 0
? What's more, if I remove a feature, the results are different. Could not find any documentation on the issue.
In fact, my question is, how to normalize all vectors in RDD correctly?
Your expectations are simply incorrect. As it is clearly stated in the official documentation "Normalizer
scales individual samples to have unit L norm" where default value for p is 2. Ignoring numerical precision issues:
import org.apache.spark.mllib.linalg.Vectors
val rdd = sc.parallelize(Seq(
Vectors.dense(0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0),
Vectors.dense(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 70000.0),
Vectors.dense(-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 70000.0),
Vectors.dense(-0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0),
Vectors.dense(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 70000.0)))
val transformed = normalizer.transform(rdd)
transformed.map(_.toArray.sum).collect
// Array[Double] = Array(1.0009051182149054, 1.000085713673417,
// 0.9999142851020933, 1.00087797536153, 1.0
MLLib
doesn't provide functionality you need but can use StandardScaler
from ML
.
import org.apache.spark.ml.feature.StandardScaler
val df = rdd.map(Tuple1(_)).toDF("features")
val scaler = new StandardScaler()
.setInputCol("features")
.setOutputCol("scaledFeatures")
.setWithStd(true)
.setWithMean(true)
val transformedDF = scaler.fit(df).transform(df)
transformedDF.select($"scaledFeatures")show(5, false)
// +--------------------------------------------------------------------------------------------------------------------------+
// |scaledFeatures |
// +--------------------------------------------------------------------------------------------------------------------------+
// |[0.9740388301169303,0.015272022105217588,0.0,1.0938637007095298,1.0938637007095298,1.0910691283447955,0.0] |
// |[1.0253040317020319,1.4038947727833362,1.414213562373095,-0.6532797101459693,-0.6532797101459693,-0.6010982697825494,0.0] |
// |[-1.0253040317020319,-1.4242574689236265,-1.414213562373095,-0.805205224133404,-0.805205224133404,-0.8536605680105113,0.0]|
// |[-0.9740388301169303,0.015272022105217588,0.0,1.0938637007095298,1.0938637007095298,1.0910691283447955,0.0] |
// |[0.0,-0.010181348070145075,0.0,-0.7292424671396867,-0.7292424671396867,-0.7273794188965303,0.0] |
// +--------------------------------------------------------------------------------------------------------------------------+
这篇关于Spark中的特征归一化算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!