问题描述
使用MLLib LinearRegressionWithSGD为Y =(2 * X1)+(3 * 2次)的伪数据组(Y,X1,X2)+4是生产错截距和重量。实际使用的数据是,
X1 X2ÿ
1 0.1 6.3
2 0.2 8.6
3 0.3 10.9
4 0.6 13.8
5 0.8 16.4
6 1.2 19.6
7 1.6 22.8
8 1.9 25.7
9 2.1 28.3
10 2.4 31.2
11 2.7 34.1
我设置以下的输入参数,并得到了下面模型输出
[numIterations,一步,miniBatchFraction,regParam] [拦截,[重量]
- [5,9,0.6,5] = [2.36667135839938E13,重量:[1.708772545209758E14,3.849548062850367E13]
- [2,违约,违约,违约] = [-2495.5635231554793,重量:[ - 19122.41357929275,-4308.224496146531]
- [5,违约,违约,违约] = [2.875191315671051E8,重量:[2.2013802074495964E9,4.9593017130199933E8]
- [20,违约,违约,违约] = [-8.896967235537095E29,重量:[-6.811932001659158E30,-1.5346020624812824E30]
要知道,
- 如何得到正确的拦截和权重[4,[2,3]为上述虚拟数据。
- 将调整收敛步长的帮助?我需要一个自动化的方式几百变量运行这一点,所以并不热衷于这样做。
- 我应该扩展的数据?它将如何帮助?
以下是用于产生这些结果的code。
对象SciBenchTest { 高清主(参数:数组[字符串]):单位=运行 DEF运行:单位= { VAL sparkConf =新SparkConf()。setAppName(SparkBench)
VAL SC =新SparkContext(sparkConf) //加载并解析Y =(2 * X1)+(3 * X2)+4伪数据(Y,X1,X2)
//即截距应该是4,配重块(2,3)?
VAL数据= sc.textFile(数据/ dummy.csv) // LabeledPoint是(标签,[功能])
VAL parsedData = {Data.Map中一行=>
VAL份= line.split(,)
VAL标签=零件(2).toDouble
Val有=阵列(零件(0),配件(1))图(_.toDouble)
LabeledPoint(标签,Vectors.dense(功能))
}
//parsedData.collect().foreach(x =>的println(X)); //缩放功能
/ * VAL定标器=新StandardScaler(withMean = TRUE,withStd = TRUE)
.fit(parsedData.map(X => x.features))
VAL scaledData = parsedData
.MAP(X =>
LabeledPoint(x.label,
scaler.transform(Vectors.dense(x.features.toArray)))) scaledData.collect()的foreach。(X =>的println(X)); * / //构建模型:SGD =随机梯度下降
VAL numIterations = 20 // 5
VAL步= 9.0 //9.0 //0.7
VAL miniBatchFraction = 0.6 //0.7 //0.65 //0.7
VAL regParam = 5.0 //3.0 //10.0
// VAL模型= LinearRegressionWithSGD.train(parsedData,numIterations,步骤)// scaledData VAL算法=新LinearRegressionWithSGD()//列车(parsedData,numIterations)
algorithm.setIntercept(真)
algorithm.optimizer
//.setMiniBatchFraction(miniBatchFraction)
.setNumIterations(numIterations)
//.setStepSize(step)
//.setGradient(new LeastSquaresGradient())
//.setUpdater(new SquaredL2Updater())// L1Updater // SimpleUpdater // SquaredL2Updater
//.setRegParam(regParam) VAL模型= algorithm.run(parsedData) 的println(S>>>>产品型号拦截:$ {} model.intercept,重量:$ {model.weights}) //评估训练实例模型
VAL valuesAnd preDS = {parsedData.map点=>
VAL prediction =模式。predict(point.features)
(point.label,point.features,prediction)
}
//打印出的特点,实际的和predicted值...
valuesAnd preds.take(10).foreach({壳体(V,F,P)=>
的println(S特点:$ {F},predicted:$ {P},实际:$ {V})
})
}
}
作为的文档中描述
选择最好的步长为SGD方法往往是微妙的。
我会尝试与情人值,例如:
//构建线性回归模型
VAR回归=新LinearRegressionWithSGD()。setIntercept(真)
regression.optimizer.setStepSize(0.001)
VAL模型= regression.run(parsedData)
Using MLLib LinearRegressionWithSGD for the dummy data set (y, x1, x2) for y = (2*x1) + (3*x2) + 4 is producing wrong intercept and weights. Actual data used is,
x1 x2 y
1 0.1 6.3
2 0.2 8.6
3 0.3 10.9
4 0.6 13.8
5 0.8 16.4
6 1.2 19.6
7 1.6 22.8
8 1.9 25.7
9 2.1 28.3
10 2.4 31.2
11 2.7 34.1
I set the following input parameters and got the below model outputs[numIterations, step, miniBatchFraction, regParam] [intercept, [weights]]
- [5,9,0.6,5] = [2.36667135839938E13, weights:[1.708772545209758E14, 3.849548062850367E13] ]
- [2,default,default,default] = [-2495.5635231554793, weights:[-19122.41357929275,-4308.224496146531]]
- [5,default,default,default] = [2.875191315671051E8, weights: [2.2013802074495964E9,4.9593017130199933E8]]
- [20,default,default,default] = [-8.896967235537095E29, weights: [-6.811932001659158E30,-1.5346020624812824E30]]
Need to know,
- How do i get the correct intercept and weights [4, [2, 3]] for the above mentioned dummy data.
- Will tuning the step size help in convergence? I need to run this in a automated manner for several hundred variables, so not keen to do that.
- Should I scale the data? How will it help?
Below is the code used to generate these results.
object SciBenchTest {
def main(args: Array[String]): Unit = run
def run: Unit = {
val sparkConf = new SparkConf().setAppName("SparkBench")
val sc = new SparkContext(sparkConf)
// Load and parse the dummy data (y, x1, x2) for y = (2*x1) + (3*x2) + 4
// i.e. intercept should be 4, weights (2, 3)?
val data = sc.textFile("data/dummy.csv")
// LabeledPoint is (label, [features])
val parsedData = data.map { line =>
val parts = line.split(',')
val label = parts(2).toDouble
val features = Array(parts(0), parts(1)) map (_.toDouble)
LabeledPoint(label, Vectors.dense(features))
}
//parsedData.collect().foreach(x => println(x));
// Scale the features
/*val scaler = new StandardScaler(withMean = true, withStd = true)
.fit(parsedData.map(x => x.features))
val scaledData = parsedData
.map(x =>
LabeledPoint(x.label,
scaler.transform(Vectors.dense(x.features.toArray))))
scaledData.collect().foreach(x => println(x));*/
// Building the model: SGD = stochastic gradient descent
val numIterations = 20 //5
val step = 9.0 //9.0 //0.7
val miniBatchFraction = 0.6 //0.7 //0.65 //0.7
val regParam = 5.0 //3.0 //10.0
//val model = LinearRegressionWithSGD.train(parsedData, numIterations, step) //scaledData
val algorithm = new LinearRegressionWithSGD() //train(parsedData, numIterations)
algorithm.setIntercept(true)
algorithm.optimizer
//.setMiniBatchFraction(miniBatchFraction)
.setNumIterations(numIterations)
//.setStepSize(step)
//.setGradient(new LeastSquaresGradient())
//.setUpdater(new SquaredL2Updater()) //L1Updater //SimpleUpdater //SquaredL2Updater
//.setRegParam(regParam)
val model = algorithm.run(parsedData)
println(s">>>> Model intercept: ${model.intercept}, weights: ${model.weights}")
// Evaluate model on training examples
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, point.features, prediction)
}
// Print out features, actual and predicted values...
valuesAndPreds.take(10).foreach({ case (v, f, p) =>
println(s"Features: ${f}, Predicted: ${p}, Actual: ${v}")
})
}
}
As described in the documentationhttps://spark.apache.org/docs/1.0.2/mllib-optimization.htmlselecting the best step-size for SGD methods can often be delicate.
I would try with lover values, for example
// Build linear regression model
var regression = new LinearRegressionWithSGD().setIntercept(true)
regression.optimizer.setStepSize(0.001)
val model = regression.run(parsedData)
这篇关于线性回归在Apache的星火给予错误的拦截和权重的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!