python - 在Apache Spark上训练逻辑回归模型时出错。 SPARK-5063

我正在尝试使用Apache Spark构建Logistic回归模型。
这是代码。

parsedData = raw_data.map(mapper) # mapper is a function that generates pair of label and feature vector as LabeledPoint object
featureVectors = parsedData.map(lambda point: point.features) # get feature vectors from parsed data
scaler = StandardScaler(True, True).fit(featureVectors) #this creates a standardization model to scale the features
scaledData = parsedData.map(lambda lp: LabeledPoint(lp.label, scaler.transform(lp.features))) #trasform the features to scale mean to zero and unit std deviation
modelScaledSGD = LogisticRegressionWithSGD.train(scaledData, iterations = 10)

但是我得到这个错误：

例外：看来您试图从广播变量，操作或转换中引用SparkContext。 SparkContext只能在驱动程序上使用，而不能在工作程序上运行的代码中使用。有关更多信息，请参见SPARK-5063。

我不确定如何解决此问题。任何帮助将不胜感激。

最佳答案

您看到的问题与How to use Java/Scala function from an action or a transformation?中描述的问题几乎相同，要转换您必须调用Scala函数，它需要访问SparkContext，因此会出现错误。

处理此问题的标准方法是仅处理数据的必需部分，然后压缩结果。

labels = parsedData.map(lambda point: point.label)
featuresTransformed = scaler.transform(featureVectors)

scaledData = (labels
    .zip(featuresTransformed)
    .map(lambda p: LabeledPoint(p[0], p[1])))

modelScaledSGD = LogisticRegressionWithSGD.train(...)

如果不打算基于MLlib组件实现自己的方法，则使用高级ML API会更容易。

编辑：

这里有两个可能的问题。

此时，LogisticRegressionWithSGD支持only binomial分类（感谢eliasah指出这一点）。如果需要多标签分类，可以将其替换为LogisticRegressionWithLBFGS。
StandardScaler仅支持密集向量，因此应用范围有限。

关于python - 在Apache Spark上训练逻辑回归模型时出错。 SPARK-5063，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/32196339/