线性回归在Apache的星火给予错误的拦截和权重

本文介绍了线性回归在Apache的星火给予错误的拦截和权重的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用MLLib LinearRegressionWithSGD为Y =（2 * X1）+（3 * 2次）的伪数据组（Y，X1，X2）+4是生产错截距和重量。实际使用的数据是，

  X1 X2ÿ
1 0.1 6.3
2 0.2 8.6
3 0.3 10.9
4 0.6 13.8
5 0.8 16.4
6 1.2 19.6
7 1.6 22.8
8 1.9 25.7
9 2.1 28.3
10 2.4 31.2
11 2.7 34.1

我设置以下的输入参数，并得到了下面模型输出
[numIterations，一步，miniBatchFraction，regParam] [拦截，[重量]

[5,9,0.6,5] = [2.36667135839938E13，重量：[1.708772545209758E14，3.849548062850367E13]

[2，违约，违约，违约] = [-2495.5635231554793，重量：[ - 19122.41357929275，-4308.224496146531]

[5，违约，违约，违约] = [2.875191315671051E8，重量：[2.2013802074495964E9,4.9593017130199933E8]

[20，违约，违约，违约] = [-8.896967235537095E29，重量：[-6.811932001659158E30，-1.5346020624812824E30]

要知道，

如何得到正确的拦截和权重[4，[2，3]为上述虚拟数据。

将调整收敛步长的帮助？我需要一个自动化的方式几百变量运行这一点，所以并不热衷于这样做。

我应该扩展的数据？它将如何帮助？

以下是用于产生这些结果的code。

 对象SciBenchTest {  高清主（参数：数组[字符串]）：单位=运行  DEF运行：单位= {    VAL sparkConf =新SparkConf（）。setAppName（SparkBench）
    VAL SC =新SparkContext（sparkConf）    //加载并解析Y =（2 * X1）+（3 * X2）+4伪数据（Y，X1，X2）
    //即截距应该是4，配重块（2,3）？
    VAL数据= sc.textFile（数据/ dummy.csv）    // LabeledPoint是（标签，[功能]）
    VAL parsedData = {Data.Map中一行=＆GT;
      VAL份= line.split（，）
      VAL标签=零件（2）.toDouble
      Val有=阵列（零件（0），配件（1））图（_.toDouble）
      LabeledPoint（标签，Vectors.dense（功能））
    }
    //parsedData.collect().foreach(x =＆GT;的println（X））;    //缩放功能
    / * VAL定标器=新StandardScaler（withMean = TRUE，withStd = TRUE）
      .fit（parsedData.map（X =＆GT; x.features））
    VAL scaledData = parsedData
      .MAP（X =＆GT;
      LabeledPoint（x.label，
        scaler.transform（Vectors.dense（x.features.toArray））））    scaledData.collect（）的foreach。（X =＆GT;的println（X））; * /    //构建模型：SGD =随机梯度下降
    VAL numIterations = 20 // 5
    VAL步= 9.0 //9.0 //0.7
    VAL miniBatchFraction = 0.6 //0.7 //0.65 //0.7
    VAL regParam = 5.0 //3.0 //10.0
    // VAL模型= LinearRegressionWithSGD.train（parsedData，numIterations，步骤）// scaledData    VAL算法=新LinearRegressionWithSGD（）//列车（parsedData，numIterations）
    algorithm.setIntercept（真）
    algorithm.optimizer
      //.setMiniBatchFraction(miniBatchFraction）
      .setNumIterations（numIterations）
      //.setStepSize(step）
      //.setGradient(new LeastSquaresGradient（））
      //.setUpdater(new SquaredL2Updater（））// L1Updater // SimpleUpdater // SquaredL2Updater
      //.setRegParam(regParam）    VAL模型= algorithm.run（parsedData）    的println（S＆GT;＆GT;＆GT;＆GT;产品型号拦截：$ {} model.intercept，重量：$ {model.weights}）    //评估训练实例模型
    VAL valuesAnd preDS = {parsedData.map点=＆GT;
      VAL prediction =模式。predict（point.features）
      （point.label，point.features，prediction）
    }
    //打印出的特点，实际的和predicted值...
    valuesAnd preds.take（10）.foreach（{壳体（V，F，P）=＆GT;
      的println（S特点：$ {F}，predicted：$ {P}，实际：$ {V}）
    }）
  }
}

解决方案

作为的文档中描述

选择最好的步长为SGD方法往往是微妙的。

我会尝试与情人值，例如：

  //构建线性回归模型
VAR回归=新LinearRegressionWithSGD（）。setIntercept（真）
regression.optimizer.setStepSize（0.001）
VAL模型= regression.run（parsedData）

Using MLLib LinearRegressionWithSGD for the dummy data set (y, x1, x2) for y = (2*x1) + (3*x2) + 4 is producing wrong intercept and weights. Actual data used is,

x1  x2  y
1   0.1 6.3
2   0.2 8.6
3   0.3 10.9
4   0.6 13.8
5   0.8 16.4
6   1.2 19.6
7   1.6 22.8
8   1.9 25.7
9   2.1 28.3
10  2.4 31.2
11  2.7 34.1

I set the following input parameters and got the below model outputs[numIterations, step, miniBatchFraction, regParam] [intercept, [weights]]

[5,9,0.6,5] = [2.36667135839938E13, weights:[1.708772545209758E14, 3.849548062850367E13] ]
[2,default,default,default] = [-2495.5635231554793, weights:[-19122.41357929275,-4308.224496146531]]
[5,default,default,default] = [2.875191315671051E8, weights: [2.2013802074495964E9,4.9593017130199933E8]]
[20,default,default,default] = [-8.896967235537095E29, weights: [-6.811932001659158E30,-1.5346020624812824E30]]

Need to know,

How do i get the correct intercept and weights [4, [2, 3]] for the above mentioned dummy data.
Will tuning the step size help in convergence? I need to run this in a automated manner for several hundred variables, so not keen to do that.
Should I scale the data? How will it help?

Below is the code used to generate these results.

object SciBenchTest {

  def main(args: Array[String]): Unit = run

  def run: Unit = {

    val sparkConf = new SparkConf().setAppName("SparkBench")
    val sc = new SparkContext(sparkConf)

    // Load and parse the dummy data (y, x1, x2) for y = (2*x1) + (3*x2) + 4
    // i.e. intercept should be 4, weights (2, 3)?
    val data = sc.textFile("data/dummy.csv")

    // LabeledPoint is (label, [features])
    val parsedData = data.map { line =>
      val parts = line.split(',')
      val label = parts(2).toDouble
      val features = Array(parts(0), parts(1)) map (_.toDouble)
      LabeledPoint(label, Vectors.dense(features))
    }
    //parsedData.collect().foreach(x => println(x));

    // Scale the features
    /*val scaler = new StandardScaler(withMean = true, withStd = true)
      .fit(parsedData.map(x => x.features))
    val scaledData = parsedData
      .map(x =>
      LabeledPoint(x.label,
        scaler.transform(Vectors.dense(x.features.toArray))))

    scaledData.collect().foreach(x => println(x));*/

    // Building the model: SGD = stochastic gradient descent
    val numIterations = 20 //5
    val step = 9.0 //9.0 //0.7
    val miniBatchFraction = 0.6 //0.7 //0.65 //0.7
    val regParam = 5.0 //3.0 //10.0
    //val model = LinearRegressionWithSGD.train(parsedData, numIterations, step) //scaledData

    val algorithm = new LinearRegressionWithSGD()       //train(parsedData, numIterations)
    algorithm.setIntercept(true)
    algorithm.optimizer
      //.setMiniBatchFraction(miniBatchFraction)
      .setNumIterations(numIterations)
      //.setStepSize(step)
      //.setGradient(new LeastSquaresGradient())
      //.setUpdater(new SquaredL2Updater()) //L1Updater //SimpleUpdater //SquaredL2Updater
      //.setRegParam(regParam)

    val model = algorithm.run(parsedData)

    println(s">>>> Model intercept: ${model.intercept}, weights: ${model.weights}")

    // Evaluate model on training examples
    val valuesAndPreds = parsedData.map { point =>
      val prediction = model.predict(point.features)
      (point.label, point.features, prediction)
    }
    // Print out features, actual and predicted values...
    valuesAndPreds.take(10).foreach({ case (v, f, p) =>
      println(s"Features: ${f}, Predicted: ${p}, Actual: ${v}")
    })
  }
}

解决方案

As described in the documentationhttps://spark.apache.org/docs/1.0.2/mllib-optimization.htmlselecting the best step-size for SGD methods can often be delicate.

I would try with lover values, for example

// Build linear regression model
var regression = new LinearRegressionWithSGD().setIntercept(true)
regression.optimizer.setStepSize(0.001)
val model = regression.run(parsedData)

这篇关于线性回归在Apache的星火给予错误的拦截和权重的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！