问题描述
我尝试使用星火ML DecisionTreeClassifier在没有StringIndexer管道,因为我的功能已经被索引为(0.0 1.0)。 DecisionTreeClassifier作为标签,需要双精度值,所以这个code应该工作:
高清trainDecisionTreeModel(培训:RDD [LabeledPoint],SQLC:SQLContext):单位= {
进口sqlc.implicits._
VAL trainingDF = training.toDF()
//这个数据帧的格式:[标签:双,特点:矢量] VAL featureIndexer =新VectorIndexer()
.setInputCol(特征)
.setOutputCol(indexedFeatures)
.setMaxCategories(4)
.fit(trainingDF) VAL DT =新DecisionTreeClassifier()
.setLabelCol(标签)
.setFeaturesCol(indexedFeatures)
VAL管道=新管道()
.setStages(阵列(featureIndexer,DT))
pipeline.fit(trainingDF)
}
但实际上我得到
java.lang.IllegalArgumentException异常:
DecisionTreeClassifier给予无效的标签栏标签输入,
无指定类的数量。见StringIndexer。
当然,我可以把StringIndexer,让他使这对我的双标签字段中的工作,但我想DecisionTreeClassifier输出原材料prediction列工作得到0.0和1.0的概率为每行像...
VAL predictions = model.transform(singletonDF)
VAL zeroProbability = predictions.select(原始prediction)。asInstanceOf [向量](0)
VAL oneProbability = predictions.select(原始prediction)。asInstanceOf [向量](1)
如果我把StringIndexer在管道 - 我不知道我输入的指标标签0.0和1.0,在原prediction载体,因为字符串索引将由值的频率,它可以改变指数
请,有利于为DecisionTreeClassifier prepare数据,而无需使用StringIndexer或建议一些其他的方式来获得我原来的标签(0.0 1.0)的概率。对于每行
您总是可以手动设置所需的元数据:
进口sqlContext.implicits._
进口org.apache.spark.ml.attribute.NominalAttributeVAL元= NominalAttribute
.defaultAttr
.withName(标签)
.withValues(0.0,1.0)
.toMetadataVAL dfWithMeta = df.withColumn(标签$标签。作为(标签,元))
pipeline.fit(dfWithMeta)
I try to use Spark ML DecisionTreeClassifier in Pipeline without StringIndexer, because my feature is already indexed as (0.0; 1.0). DecisionTreeClassifier as label requires double values, so this code should work:
def trainDecisionTreeModel(training: RDD[LabeledPoint], sqlc: SQLContext): Unit = {
import sqlc.implicits._
val trainingDF = training.toDF()
//format of this dataframe: [label: double, features: vector]
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(trainingDF)
val dt = new DecisionTreeClassifier()
.setLabelCol("label")
.setFeaturesCol("indexedFeatures")
val pipeline = new Pipeline()
.setStages(Array(featureIndexer, dt))
pipeline.fit(trainingDF)
}
But actually I get
java.lang.IllegalArgumentException:
DecisionTreeClassifier was given input with invalid label column label,
without the number of classes specified. See StringIndexer.
Of course I can just put StringIndexer and let him make it's work for my double "label" field, but I want to work with output rawPrediction column of DecisionTreeClassifier to get probability of 0.0 and 1.0 for each row like...
val predictions = model.transform(singletonDF)
val zeroProbability = predictions.select("rawPrediction").asInstanceOf[Vector](0)
val oneProbability = predictions.select("rawPrediction").asInstanceOf[Vector](1)
If I put StringIndexer in Pipeline - I will not know indexes of my input labels "0.0" and "1.0" in rawPrediction vector, because String indexer will index by value's frequency, which could vary.
Please, help to prepare data for DecisionTreeClassifier without using StringIndexer or suggest some another way to get probability of my original labels (0.0; 1.0) for each row.
You can always set required metadata manually:
import sqlContext.implicits._
import org.apache.spark.ml.attribute.NominalAttribute
val meta = NominalAttribute
.defaultAttr
.withName("label")
.withValues("0.0", "1.0")
.toMetadata
val dfWithMeta = df.withColumn("label", $"label".as("label", meta))
pipeline.fit(dfWithMeta)
这篇关于如何使二元现象的分类在星火ML没有StringIndexer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!