随机森林算法demo python spark

关键参数

最重要的，常常需要调试以提高算法效果的有两个参数：numTrees，maxDepth。

numTrees（决策树的个数）：增加决策树的个数会降低预测结果的方差，这样在测试时会有更高的accuracy。训练时间大致与numTrees呈线性增长关系。
maxDepth：是指森林中每一棵决策树最大可能depth，在决策树中提到了这个参数。更深的一棵树意味模型预测更有力，但同时训练时间更长，也更倾向于过拟合。但是值得注意的是，随机森林算法和单一决策树算法对这个参数的要求是不一样的。随机森林由于是多个的决策树预测结果的投票或平均而降低而预测结果的方差，因此相对于单一决策树而言，不容易出现过拟合的情况。所以随机森林可以选择比决策树模型中更大的maxDepth。
甚至有的文献说，随机森林的每棵决策树都最大可能地进行生长而不进行剪枝。但是不管怎样，还是建议对maxDepth参数进行一定的实验，看看是否可以提高预测的效果。
另外还有两个参数，subsamplingRate，featureSubsetStrategy一般不需要调试，但是这两个参数也可以重新设置以加快训练，但是值得注意的是可能会影响模型的预测效果（如果需要调试的仔细读下面英文吧）。

"""

Random Forest Classification Example.

"""

from __future__ import print_function

from pyspark import SparkContext

# $example on$

from pyspark.mllib.tree import RandomForest, RandomForestModel

from pyspark.mllib.util import MLUtils

# $example off$

if __name__ == "__main__":

    sc = SparkContext(appName="PythonRandomForestClassificationExample")

    # $example on$

    # Load and parse the data file into an RDD of LabeledPoint.

    data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')

    # Split the data into training and test sets (30% held out for testing)

    (trainingData, testData) = data.randomSplit([0.7, 0.3])

    # Train a RandomForest model.

    #  Empty categoricalFeaturesInfo indicates all features are continuous.

    #  Note: Use larger numTrees in practice.

    #  Setting featureSubsetStrategy="auto" lets the algorithm choose.

    model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},

                                         numTrees=3, featureSubsetStrategy="auto",

                                         impurity='gini', maxDepth=4, maxBins=32)

    # Evaluate model on test instances and compute test error

    predictions = model.predict(testData.map(lambda x: x.features))

    labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)

    testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())

    print('Test Error = ' + str(testErr))

    print('Learned classification forest model:')

    print(model.toDebugString())

    # Save and load model

    model.save(sc, "target/tmp/myRandomForestClassificationModel")

    sameModel = RandomForestModel.load(sc, "target/tmp/myRandomForestClassificationModel")

    # $example off$

模型样子：

TreeEnsembleModel classifier with 3 trees

  Tree 0:

    If (feature 511 <= 0.0)

     If (feature 434 <= 0.0)

      Predict: 0.0

     Else (feature 434 > 0.0)

      Predict: 1.0

    Else (feature 511 > 0.0)

     Predict: 0.0

  Tree 1:

    If (feature 490 <= 31.0)

     Predict: 0.0

    Else (feature 490 > 31.0)

     Predict: 1.0

  Tree 2:

    If (feature 302 <= 0.0)

     If (feature 461 <= 0.0)

      If (feature 208 <= 107.0)

       Predict: 1.0

      Else (feature 208 > 107.0)

       Predict: 0.0

     Else (feature 461 > 0.0)

      Predict: 1.0

    Else (feature 302 > 0.0)

     Predict: 0.0