我正在尝试使用pySpark实现Logistic回归
这是我的代码
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from time import time
from pyspark.mllib.regression import LabeledPoint
from numpy import array
RES_DIR="/home/shaahmed115/Pet_Projects/DA/TwitterStream_US_Elections/Features/"
sc= SparkContext('local','pyspark')
data_file = RES_DIR + "training.txt"
raw_data = sc.textFile(data_file)
print "Train data size is {}".format(raw_data.count())
test_data_file = RES_DIR + "testing.txt"
test_raw_data = sc.textFile(test_data_file)
print "Test data size is {}".format(test_raw_data.count())
def parse_interaction(line):
line_split = line.split(",")
return LabeledPoint(float(line_split[0]), array([float(x) for x in line_split]))
training_data = raw_data.map(parse_interaction)
logit_model = LogisticRegressionWithLBFGS.train(training_data,iterations=10, numClasses=3)
这引发了一个错误:
当前,ML软件包中带有ElasticNet的LogisticRegression仅支持二进制分类。在输入数据集中找到3
以下是我的数据集的示例:
2,1.0,1.0,1.0
0,1.0,1.0,1.0
1,0.0,0.0,0.0
第一个元素是类,其余的是向量。您可以看到有三个类。
有没有变通办法可以使多项式分类与此一起工作?
最佳答案
您看到的错误
ML软件包中带有ElasticNet的LogisticRegression仅支持二进制
分类。
清楚了。您可以使用mllib
版本支持多项式:org.apache.spark.mllib.classification.LogisticRegression
/**
* Train a classification model for Multinomial/Binary Logistic Regression using
* Limited-memory BFGS. Standard feature scaling and L2 regularization are used by default.
* NOTE: Labels used in Logistic Regression should be {0, 1, ..., k - 1}
* for k classes multi-label classification problem.
*
* Earlier implementations of LogisticRegressionWithLBFGS applies a regularization
* penalty to all elements including the intercept. If this is called with one of
* standard updaters (L1Updater, or SquaredL2Updater) this is translated
* into a call to ml.LogisticRegression, otherwise this will use the existing mllib
* GeneralizedLinearAlgorithm trainer, resulting in a regularization penalty to the
* intercept.
*/
关于python - LogisticRegressionwithLBFGS引发关于不支持多项式分类的错误,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/38961429/