本文介绍了Spark LinearRegressionSummary “正常"概括的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据 LinearRegressionSummary (Spark 2.1.0 JavaDoc),p 值仅适用于普通"求解器.

According to LinearRegressionSummary (Spark 2.1.0 JavaDoc), p-values are only available for the "normal" solver.

此值仅在使用普通"求解器时可用.

普通"求解器到底是什么?

What the hell is the "normal" solver?

我正在这样做:

import org.apache.spark.ml.{Pipeline, PipelineModel} 
import org.apache.spark.ml.evaluation.RegressionEvaluator 
import org.apache.spark.ml.feature.VectorAssembler 
import org.apache.spark.ml.regression.LinearRegressionModel 
import org.apache.spark.ml.tuning.{CrossValidator, CrossValidatorModel, ParamGridBuilder} 
import org.apache.spark.sql.functions._ 
import org.apache.spark.sql.{DataFrame, SparkSession}
    .
    .
    .
val (trainingData, testData): (DataFrame, DataFrame) = 
  com.acme.pta.accuracy.Util.splitData(output, testProportion)
    .
    .
    .
val lr = 
  new org.apache.spark.ml.regression.LinearRegression()
  .setSolver("normal").setMaxIter(maxIter)

val pipeline = new Pipeline()
  .setStages(Array(lr))

val paramGrid = new ParamGridBuilder()
  .addGrid(lr.elasticNetParam, Array(0.2, 0.4, 0.8, 0.9))
  .addGrid(lr.regParam, Array(0,6, 0.3, 0.1, 0.01))
  .build()

val cv = new CrossValidator()
  .setEstimator(pipeline)
  .setEvaluator(evaluator)
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(numFolds) // Use 3+ in practice

val cvModel: CrossValidatorModel = cv.fit(trainingData)

val pipelineModel: PipelineModel = cvModel.bestModel.asInstanceOf[PipelineModel]
val lrModel: LinearRegressionModel = 
  pipelineModel.stages(0).asInstanceOf[LinearRegressionModel]

val modelSummary = lrModel.summary
Holder.log.info("lrModel.summary: " + modelSummary)
try {
  Holder.log.info("feature p values: ")
  // Exception occurs on line below.
  val featuresAndPValues = features.zip(lrModel.summary.pValues)
  featuresAndPValues.foreach(
    (featureAndPValue: (String, Double)) => 
    Holder.log.info(
      "feature: " + featureAndPValue._1 + ": " + featureAndPValue._2))
} catch {
  case _: java.lang.UnsupportedOperationException 
            => Holder.log.error("Cannot compute p-values")
}

我仍然收到UnsupportedOperationException.

异常信息是:

没有可用于此线性回归模型的 p 值

还有什么我需要做的吗?我正在使用

Is there something else I need to be doing? I'm using

  "org.apache.spark" %% "spark-mllib" % "2.1.1"

那个版本支持 pValues 吗?

Is pValues supported in that version?

推荐答案

更新

在正常 LinearRegression 中,pValues 和其他正常"统计数据仅在 elasticNetParamregParam 参数之一为零时才出现.所以你可以改变

In normal LinearRegression pValues and other "normal" statistics are only present when one of the parameters elasticNetParam or regParam is zero. So you can change

.addGrid( lr.elasticNetParam, Array( 0.0 ) )

.addGrid( lr.regParam, Array( 0.0 ) )

解决方案 2

制作 LinearRegression 的自定义版本,它将明确使用

Solution 2

Make custom version of LinearRegression which would explicitly use

  1. 回归的正常"求解器.
  2. Cholesky WeightedLeastSquares 的求解器.
  1. "normal" solver for regression.
  2. Cholesky solver for WeightedLeastSquares.

我将这个类作为 ml.regression 包的扩展.

I made this class as an extension to ml.regression package.

package org.apache.spark.ml.regression

import scala.collection.mutable

import org.apache.spark.SparkException
import org.apache.spark.internal.Logging
import org.apache.spark.ml.feature.Instance
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.optim.WeightedLeastSquares
import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
import org.apache.spark.ml.util._
import org.apache.spark.mllib.linalg.VectorImplicits._
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, Dataset, Row}
import org.apache.spark.sql.functions._

class CholeskyLinearRegression ( override val uid: String )
    extends Regressor[ Vector, CholeskyLinearRegression, LinearRegressionModel ]
    with LinearRegressionParams with DefaultParamsWritable with Logging {

    import CholeskyLinearRegression._

    def this() = this(Identifiable.randomUID("linReg"))

    def setRegParam(value: Double): this.type = set(regParam, value)
    setDefault(regParam -> 0.0)

    def setFitIntercept(value: Boolean): this.type = set(fitIntercept, value)
    setDefault(fitIntercept -> true)

    def setStandardization(value: Boolean): this.type = set(standardization, value)
    setDefault(standardization -> true)

    def setElasticNetParam(value: Double): this.type = set(elasticNetParam, value)
    setDefault(elasticNetParam -> 0.0)

    def setMaxIter(value: Int): this.type = set(maxIter, value)
    setDefault(maxIter -> 100)

    def setTol(value: Double): this.type = set(tol, value)
    setDefault(tol -> 1E-6)

    def setWeightCol(value: String): this.type = set(weightCol, value)

    def setSolver(value: String): this.type = set(solver, value)
    setDefault(solver -> Auto)

    def setAggregationDepth(value: Int): this.type = set(aggregationDepth, value)
    setDefault(aggregationDepth -> 2)

    override protected def train(dataset: Dataset[_]): LinearRegressionModel = {

        // Extract the number of features before deciding optimization solver.
        val numFeatures = dataset.select(col($(featuresCol))).first().getAs[Vector](0).size
        val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))

        val instances: RDD[Instance] = 
            dataset
            .select( col( $(labelCol) ), w, col( $(featuresCol) ) )
            .rdd.map {
                case Row(label: Double, weight: Double, features: Vector) =>
                Instance(label, weight, features)
            }

        // if (($(solver) == Auto &&
        //   numFeatures <= WeightedLeastSquares.MAX_NUM_FEATURES) || $(solver) == Normal) {
        // For low dimensional data, WeightedLeastSquares is more efficient since the
        // training algorithm only requires one pass through the data. (SPARK-10668)

        val optimizer = new WeightedLeastSquares( 
            $(fitIntercept), 
            $(regParam),
            elasticNetParam = $(elasticNetParam), 
            $(standardization), 
            true,
            solverType = WeightedLeastSquares.Cholesky, 
            maxIter = $(maxIter), 
            tol = $(tol)
        )

        val model = optimizer.fit(instances)

        val lrModel = copyValues(new LinearRegressionModel(uid, model.coefficients, model.intercept))
        val (summaryModel, predictionColName) = lrModel.findSummaryModelAndPredictionCol()

        val trainingSummary = new LinearRegressionTrainingSummary(
            summaryModel.transform(dataset),
            predictionColName,
            $(labelCol),
            $(featuresCol),
            summaryModel,
            model.diagInvAtWA.toArray,
            model.objectiveHistory
        )

        lrModel
        .setSummary( Some( trainingSummary ) )

        lrModel
    }

    override def copy(extra: ParamMap): CholeskyLinearRegression = defaultCopy(extra)
}

object CholeskyLinearRegression 
    extends DefaultParamsReadable[CholeskyLinearRegression] {

    override def load(path: String): CholeskyLinearRegression = super.load(path)

    val MAX_FEATURES_FOR_NORMAL_SOLVER: Int = WeightedLeastSquares.MAX_NUM_FEATURES

    /** String name for "auto". */
    private[regression] val Auto = "auto"

    /** String name for "normal". */
    private[regression] val Normal = "normal"

    /** String name for "l-bfgs". */
    private[regression] val LBFGS = "l-bfgs"

    /** Set of solvers that LinearRegression supports. */
    private[regression] val supportedSolvers = Array(Auto, Normal, LBFGS)
}

您所要做的就是将其粘贴到项目中的单独文件中,并将代码中的 LinearRegression 更改为 CholeskyLinearRegression.

All you have to do is to paste it to the separate file in the project and change LinearRegression to CholeskyLinearRegression in your code.

val lr = new CholeskyLinearRegression() // new LinearRegression()
        .setSolver( "normal" )
        .setMaxIter( maxIter )

它适用于非零参数并提供 pValues.在以下参数网格上测试.

It works with non-zero params and gives pValues. Tested on following params grid.

val paramGrid = new ParamGridBuilder()
        .addGrid( lr.elasticNetParam, Array( 0.2, 0.4, 0.8, 0.9 ) )
        .addGrid( lr.regParam, Array( 0.6, 0.3, 0.1, 0.01 ) )
        .build()

全面调查

我最初认为主要问题是模型没有完全保留.在 CrossValidator 中拟合后不会保留训练模型.因为内存消耗是可以理解的.关于如何解决这个问题,目前正在辩论.JIRA 中的问题.

Full investigation

I initially thought that the main issue is with the model being not fully preserved. Trained model is not preserved after fitting in CrossValidator. It is understandable because of memory consumption. There is an ongoing debate on how should it be resolved. Issue in JIRA.

您可以在评论部分看到我尝试从最佳模型中提取参数以便再次运行.然后我发现模型摘要还可以,只是对于某些参数diagInvAtWa 的长度为1,基本上为零.

You can see in the commented section that I tried to extract parameters from the best model in order to run it again. Then I found out that the model summary is ok, it's just for some parameters diagInvAtWa has length of 1 and basically a zero.

对于岭回归或 Tikhonov 正则化 (elasticNet = 0) 和任何 regParam 可以计算 pValues 和其他正常"统计数据,但对于套索方法和介于两者之间的方法(弹性网)不是.regParam = 0 也是如此:计算任何 elasticNet pValues.

For ridge regression or Tikhonov regularization (elasticNet = 0) and any regParam pValues and other "normal" statistics can be computed but for Lasso method and something in between (elastic net) not. Same goes for regParam = 0: with any elasticNet pValues were computed.

这是为什么

LinearRegression 使用solverType = WeightedLeastSquares.Auto的正常"求解器的加权最小二乘优化器.这个优化器有 两个选项:QuasiNewtonCholesky.前者仅在 regParamelasticNetParam 都非零时选择.

LinearRegression uses Weighted Least Square optimizer for "normal" solver with solverType = WeightedLeastSquares.Auto. This optimizer has two options for solvers: QuasiNewton or Cholesky. The former is selected only when both regParam and elasticNetParam are non-zeroes.

val solver = if (
    ( solverType == WeightedLeastSquares.Auto && 
        elasticNetParam != 0.0 && 
        regParam != 0.0 ) ||
    ( solverType == WeightedLeastSquares.QuasiNewton ) ) {

    ...
    new QuasiNewtonSolver(fitIntercept, maxIter, tol, effectiveL1RegFun)
} else {
    new CholeskySolver
}

因此,在您的参数网格中,将始终使用 QuasiNewtonSolver,因为没有 regParamelasticNetParam 的组合,其中其中之一为零.

So in your parameters grid the QuasiNewtonSolver will be always used because there are no combinations of regParam and elasticNetParam where one of them is zero.

我们知道,为了获得 pValues 和其他正常"统计数据,例如 t-statistic 或 std.系数误差矩阵的对角线 (A^T * W * A)^-1 (diagInvAtWA) 不能是只有一个零的向量.此条件在 pValues.

We know that in order to get pValues and other "normal" statistics such as t-statistic or std. error of coefficients the diagonal of matrix (A^T * W * A)^-1 (diagInvAtWA) must not be a vector with only one zero. This condition is set in definition of pValues.

diagInvAtWA 是压缩上三角矩阵 (solution.aaInv) 的对角元素向量.

diagInvAtWA is a vector of diagonal elements of packed upper triangular matrix (solution.aaInv).

val diagInvAtWA = solution.aaInv.map { inv => ...

对于 Cholesky 求解器,它是 计算 但对于 QuasiNewton 不是.第二个 NormalEquationSolution 的">参数 就是这个矩阵.

For Cholesky solver it is calculated but for QuasiNewton not. Second parameter for NormalEquationSolution is this matrix.

从技术上讲,您可以使用

You technically could make your own version of LinearRegression with

在这个例子中,我使用了来自 sample_linear_regression_data.txt="noreferrer">此处.

In this example I used data sample_linear_regression_data.txt from here.

完整的复制代码

import org.apache.spark._

import org.apache.spark.ml.{Pipeline, PipelineModel} 
import org.apache.spark.ml.evaluation.{RegressionEvaluator, BinaryClassificationEvaluator}
import org.apache.spark.ml.feature.VectorAssembler 
import org.apache.spark.ml.regression.{LinearRegressionModel, LinearRegression}
import org.apache.spark.ml.tuning.{CrossValidator, CrossValidatorModel, ParamGridBuilder} 
import org.apache.spark.sql.functions._ 
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.ml.param.ParamMap

object Main {

    def main( args: Array[ String ] ): Unit = {

        val spark =
            SparkSession
            .builder()
            .appName( "SO" )
            .master( "local[*]" )
            .config( "spark.driver.host", "localhost" )
            .getOrCreate()

        import spark.implicits._

        val data = 
            spark
            .read
            .format( "libsvm" )
            .load( "./sample_linear_regression_data.txt" )

        val Array( training, test ) = 
            data
            .randomSplit( Array( 0.9, 0.1 ), seed = 12345 )

        val maxIter = 10;

        val lr = new LinearRegression()
            .setSolver( "normal" )
            .setMaxIter( maxIter )

        val paramGrid = new ParamGridBuilder()
            // .addGrid( lr.elasticNetParam, Array( 0.2, 0.4, 0.8, 0.9 ) )
            .addGrid( lr.elasticNetParam, Array( 0.0 ) )
            .addGrid( lr.regParam, Array( 0.6, 0.3, 0.1, 0.01 ) )
            .build()

        val pipeline = new Pipeline()
            .setStages( Array( lr ) )

        val cv = new CrossValidator()
            .setEstimator( pipeline )
            .setEvaluator( new RegressionEvaluator )
            .setEstimatorParamMaps( paramGrid )
            .setNumFolds( 2 )  // Use 3+ in practice

        val cvModel = 
            cv
            .fit( training )

        val pipelineModel: PipelineModel = 
            cvModel
            .bestModel
            .asInstanceOf[ PipelineModel ]

        val lrModel: LinearRegressionModel = 
            pipelineModel
            .stages( 0 )
            .asInstanceOf[ LinearRegressionModel ]

        // Technically there is a way to use exact ParamMap
        // to build a new LR but for the simplicity I'll 
        // get and set them explicitly

        // lrModel.params.foreach( ( param ) => {

        //     println( param )
        // } )

        // val bestLr = new LinearRegression()
        //     .setSolver( "normal" )
        //     .setMaxIter( maxIter )
        //     .setRegParam( lrModel.getRegParam )
        //     .setElasticNetParam( lrModel.getElasticNetParam )

        // val bestLrModel = bestLr.fit( training )

        val modelSummary = 
            lrModel
            .summary

        println( "lrModel pValues: " + modelSummary.pValues.mkString( ", " ) )

        spark.stop()
    }
}

原创

有三种求解器算法 可用:

  • l-bfgs - 有限内存 Broyden–Fletcher–Goldfarb–Shanno 算法,这是一种有限内存准牛顿优化 方法.
  • normal - 使用 正规方程作为分析线性回归问题的解决方案.它基本上是一个 加权最小二乘法重加权最小二乘法.
  • auto - 自动选择求解器算法.将在可能的情况下使用正态方程求解器,但这会在需要时自动回退到迭代优化方法
  • l-bfgs - Limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm which is a limited-memory quasi-Newton optimization method.
  • normal - using Normal Equation as an analytical solution to the linear regression problem. It is basically a weighted least squares approach or reweighted least squares approach.
  • auto - solver algorithm is selected automatically. The Normal Equations solver will be used when possible, but this will automatically fall back to iterative optimization methods when needed

coefficientStandardErrorstValuespValues 仅在使用普通"求解器时可用,因为它们都基于 diagInvAtWA - 矩阵的对角线 (A^T * W * A)^-1.

The coefficientStandardErrors, tValues and pValues are only available when using the "normal" solver because they are all based on diagInvAtWA - a diagonal of matrix (A^T * W * A)^-1.

这篇关于Spark LinearRegressionSummary “正常"概括的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-29 20:08