问题描述
我有一个具有相应categoricalFeaturesInfo的要素集:Map [Int,Int].但是,对于我的一生,我无法弄清楚应该如何使DecisionTree类正常工作.它不接受任何内容,但接受LabeledPoint作为数据.但是,LabeledPoint需要(double,vector),其中vector需要double.
I have a feature set with a corresponding categoricalFeaturesInfo: Map[Int,Int]. However, for the life of me I cannot figure out how I am supposed to get the DecisionTree class to work. It will not accept anything, but a LabeledPoint as data. However, LabeledPoint requires (double, vector) where the vector requires doubles.
val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail)))
// Run training algorithm to build the model
val maxDepth: Int = 3
val isMulticlassWithCategoricalFeatures: Boolean = true
val numClassesForClassification: Int = countPossibilities(labelCol)
val model = DecisionTree.train(LP, Classification, Gini, isMulticlassWithCategoricalFeatures, maxDepth, numClassesForClassification,categoricalFeaturesInfo)
我得到的错误:
scala> val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail)))
<console>:32: error: overloaded method value dense with alternatives:
(values: Array[Double])org.apache.spark.mllib.linalg.Vector <and>
(firstValue: Double,otherValues: Double*)org.apache.spark.mllib.linalg.Vector
cannot be applied to (Array[String])
val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail)))
到目前为止,我的资源:树配置, 决策树, labeledpoint
My resources thus far:tree config, decision tree, labeledpoint
推荐答案
您可以先将类别转换为数字,然后像加载所有要素一样将其加载为数字.
You can first transform categories to numbers, then load data as if all features are numerical.
在Spark中构建决策树模型时,只需指定从特征索引到特征的映射Map[Int, Int]()
,就可以告诉Spark哪些特征是分类的,以及特征的Arity(该特征的不同类别的数量)友善.
When you build a decision tree model in Spark, you just need to tell spark which features are categorical and also the feature's arity (the number of distinct categories of that feature) by specifying a map Map[Int, Int]()
from feature indices to its arity.
例如,如果您的数据为:
For example if you have data as:
1,a,add
2,b,more
1,c,thinking
3,a,to
1,c,me
您可以先将数据转换为数字格式,如下所示:
You can first transform data into numerical format as:
1,0,0
2,1,1
1,2,2
3,0,3
1,2,4
您可以使用这种格式将数据加载到Spark.然后,如果您要告诉Spark第二列和第三列是分类的,则应创建一个地图:
In that format you can load data to Spark. Then if you want to tell Spark the second and the third columns are categorical, you should create a map:
categoricalFeaturesInfo = Map[Int, Int]((1,3),(2,5))
地图告诉我们索引1的特征具有3的特征,索引2的特征具有5的灵巧性.当我们构建将该地图作为训练函数的参数传递的决策树模型时,它们将被视为分类的:
The map tells us that feature with index 1 has arity 3, and feature with index 2 has artity 5. They will be considered as categorical when we build a decision tree model passing that map as a parameter of the training function:
val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)
这篇关于如何使用Scala使用分类功能集运行Spark决策树?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!