本文介绍了Spark中的特征归一化算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

试图理解 Spark 的归一化算法.我的小测试集包含 5 个向量:

{0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0},{1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 70000.0},{-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 70000.0},{-0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0},{0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 70000.0},

我希望 new Normalizer().transform(vectors) 创建 JavaRDD,其中每个向量特征都标准化为 (v-mean)/stdev 跨越所有特性 0、`特性 1 等的值.
结果集是:

[ - 1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,0.9999999993877552][1.357142668768307E-5,2.571428214508371E-7,0.0,3.428570952677828E-4,3.428570952677828E-4,2.057199691969196919691796E-7,0.0,3.428570952677828282668768307E-5[-1.357142668768307E-5,2.571428214508371E-7,0.0,3.428570952677828E-4,3.428570952677828E-4,2.05976E-4,2.05979691969196919619679619678287782828282826687961996[1.4285714276967932E-5,1.4285714276967932E-5,1.4285714276967932E-5,1.4285714276967932E-5,1.4285714276967932E-5,1.4285714276967932E-5,0.9999999993877552][0.0,0.0,0.0,0.0,0.0,0.0,1.0]

请注意,所有原始值 7000.0 都会导致不同的标准化"值.此外,例如,当值为 .95, 1,-11.357142668768307E-5/code>、-.950?更重要的是,如果我删除一个功能,结果是不同的.找不到有关此问题的任何文档.
事实上,我的问题是,如何正确地对 RDD 中的所有向量进行归一化?

解决方案

您的期望完全不正确.正如官方文档中明确说明的那样Normalizer 将单个样本缩放为单位 L norm",其中 p 的默认值为 2.忽略数值精度问题:

import org.apache.spark.mllib.linalg.Vectorsval rdd = sc.parallelize(Seq(Vectors.dense(0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0),Vectors.dense(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 70000.0),Vectors.dense(-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 70000.0),Vectors.dense(-0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0),Vectors.dense(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 70000.0)))val 转换 = normalizer.transform(rdd)变换后的.map(_.toArray.sum).collect//Array[Double] = Array(1.0009051182149054, 1.000085713673417,//0.9999142851020933, 1.00087797536153, 1.0

MLLib 不提供您需要的功能,但可以使用 StandardScaler 来自 ML.

import org.apache.spark.ml.feature.StandardScalerval df = rdd.map(Tuple1(_)).toDF("特征")val 缩放器 = 新的 StandardScaler().setInputCol("功能").setOutputCol("scaledFeatures").setWithStd(真).setWithMean(真)val 转换DF = scaler.fit(df).transform(df)转换DF.select($"scaledFeatures")show(5, false)//+-------------------------------------------------------------------------------------------------------------------------+//|缩放特征 |//+-------------------------------------------------------------------------------------------------------------------------+//|[0.9740388301169303,0.015272022105217588,0.0,1.0938637007095298,1.0938637007095298,1.09106919.530]4//|[1.0253040317020319,1.4038947727833362,1.414213562373095,-0.6532797101459693,-0.653279710145098204,1450982710145098203780904//|[-1.0253040317020319,-1.4242574689236265,-1.414213562373095,-0.805205​​224133404,-0.805205​​2241360.504,3630.504040,360.50]350404080.50//|[-0.9740388301169303,0.015272022105217588,0.0,1.0938637007095298,1.0938637007095298,1.09106971285] |//|[0.0,-0.010181348070145075,0.0,-0.7292424671396867,-0.7292424671396867,-0.7273794188965303,0.0] |//+-------------------------------------------------------------------------------------------------------------------------+

Trying to understand Spark's normalization algorithm. My small test set contains 5 vectors:

{0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0},
{1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 70000.0},
{-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 70000.0},
{-0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0},
{0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 70000.0},

I would expect that new Normalizer().transform(vectors) creates JavaRDD where each vector feature is normalized as (v-mean)/stdev across all values for feature-0, `feature-1, etc.
The resulting set is:

[-1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,0.9999999993877552]
[1.357142668768307E-5,2.571428214508371E-7,0.0,3.428570952677828E-4,3.428570952677828E-4,2.057142571606697E-4,0.9999998611976999]
[-1.357142668768307E-5,2.571428214508371E-7,0.0,3.428570952677828E-4,3.428570952677828E-4,2.057142571606697E-4,0.9999998611976999]
[1.4285714276967932E-5,1.4285714276967932E-5,1.4285714276967932E-5,1.4285714276967932E-5,1.4285714276967932E-5,1.4285714276967932E-5,0.9999999993877552]
[0.0,0.0,0.0,0.0,0.0,0.0,1.0]

Note that all original values 7000.0 result in different 'normalized' values. Also, how, for example, 1.357142668768307E-5 was calculated when the values are: .95, 1,-1, -.95, 0? What's more, if I remove a feature, the results are different. Could not find any documentation on the issue.
In fact, my question is, how to normalize all vectors in RDD correctly?

解决方案

Your expectations are simply incorrect. As it is clearly stated in the official documentation "Normalizer scales individual samples to have unit L norm" where default value for p is 2. Ignoring numerical precision issues:

import org.apache.spark.mllib.linalg.Vectors

val rdd = sc.parallelize(Seq(
    Vectors.dense(0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0),
    Vectors.dense(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 70000.0),
    Vectors.dense(-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 70000.0),
    Vectors.dense(-0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0),
    Vectors.dense(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 70000.0)))

val transformed = normalizer.transform(rdd)
transformed.map(_.toArray.sum).collect
// Array[Double] = Array(1.0009051182149054, 1.000085713673417,
//   0.9999142851020933, 1.00087797536153, 1.0

MLLib doesn't provide functionality you need but can use StandardScaler from ML.

import org.apache.spark.ml.feature.StandardScaler

val df = rdd.map(Tuple1(_)).toDF("features")

val scaler = new StandardScaler()
  .setInputCol("features")
  .setOutputCol("scaledFeatures")
  .setWithStd(true)
  .setWithMean(true)

val transformedDF =  scaler.fit(df).transform(df)

transformedDF.select($"scaledFeatures")show(5, false)

// +--------------------------------------------------------------------------------------------------------------------------+
// |scaledFeatures                                                                                                            |
// +--------------------------------------------------------------------------------------------------------------------------+
// |[0.9740388301169303,0.015272022105217588,0.0,1.0938637007095298,1.0938637007095298,1.0910691283447955,0.0]                |
// |[1.0253040317020319,1.4038947727833362,1.414213562373095,-0.6532797101459693,-0.6532797101459693,-0.6010982697825494,0.0] |
// |[-1.0253040317020319,-1.4242574689236265,-1.414213562373095,-0.805205224133404,-0.805205224133404,-0.8536605680105113,0.0]|
// |[-0.9740388301169303,0.015272022105217588,0.0,1.0938637007095298,1.0938637007095298,1.0910691283447955,0.0]               |
// |[0.0,-0.010181348070145075,0.0,-0.7292424671396867,-0.7292424671396867,-0.7273794188965303,0.0]                           |
// +--------------------------------------------------------------------------------------------------------------------------+

这篇关于Spark中的特征归一化算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-14 12:30