我想使用pyspark.mllib.tree.randomforest模块来获取我的观测值的接近矩阵。
直到现在,我的数据还很小,可以直接加载到内存中。因此,我使用sklearn.ensemble.randomForestClassifier以如下方式获得邻近矩阵:假设x是包含特征的矩阵,y是包含标签的向量。我训练了随机森林来区分标签为“0”和标签为“1”的对象。有了训练好的随机森林,我想通过计算两个观测值得到的最终节点(叶子)的决策树数量,来衡量数据集中每对观测值之间的接近度。因此,对于100个决策树,两个观测值之间的接近度可以从0(从不落在同一最终叶中)到100(所有决策树都落在同一最终叶中)不等。python实现:
import numpy
from sklearn import ensemble
## data
print X.shape, Y.shape # X is a matrix that holds the 4281 features and contains 8562 observations and Y contains 8562 labels
>> (8562, 4281) (8562,)
## train the tree
n_trees = 100
rand_tree = sklearn.ensemble.RandomForestClassifier(n_estimators=n_tress)
rand_tree.fit(X, Y)
## get proximity matrix
apply_mat = rand_tree.apply(X)
obs_num = len(apply_mat)
sim_mat = numpy.eye(obs_num) * len(apply_mat[0]) # max values that they can be similar at = N estimators
for i in xrange(obs_num):
for j in xrange(i, obs_num):
vec_i = apply_mat[i]
vec_j = apply_mat[j]
sim_val = len(vec_i[vec_i==vec_j])
sim_mat[i][j] = sim_val
sim_mat[j][i] = sim_val
sim_mat_norm = sim_mat / len(apply_mat[0])
print sim_mat_norm.shape
>> (8562, 8562)
现在,我处理的数据太大,无法放入内存,因此我决定使用spark。我能够加载数据并进行拟合,但我没有找到一种方法来“应用”随机林到数据中,以获得接近矩阵。有办法弄到吗?
(我使用与Spark文档中相同的实现:https://spark.apache.org/docs/1.2.0/mllib-ensembles.html#classification):
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils
# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
numTrees=3, featureSubsetStrategy="auto",
impurity='gini', maxDepth=4, maxBins=32)
我也很高兴听到其他能解决我问题的想法。
谢谢!
最佳答案
pyspark mllib模型不提供访问此信息的直接方法。理论上,您可以尝试直接提取模型,并分别对每棵树进行预测:
from pyspark.mllib.tree import DecisionTreeMode
numTrees = 3
trees = [DecisionTreeModel(model._java_model.trees()[i])
for i in range(numTrees)]
predictions = [t.predict(testData) for t in trees]
但最好使用ml模型:
from pyspark.ml.feature import StringIndexer
from pyspark.ml.classification import RandomForestClassifier
df = sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
indexer = StringIndexer(inputCol="label", outputCol="indexed").fit(df)
df_indexed = indexer.transform(df)
model = RandomForestClassifier(
numTrees=3, maxDepth=2, labelCol="indexed", seed=42
).fit(df_indexed)
使用
rawPrediction
或probability
列:model.transform(df).select("rawPrediction", "probability").show(5, False)
## +---------------------------------------+-----------------------------------------+
## |rawPrediction |probability |
## +---------------------------------------+-----------------------------------------+
## |[0.0,3.0] |[0.0,1.0] |
## |[2.979591836734694,0.02040816326530612]|[0.9931972789115647,0.006802721088435374]|
## |[2.979591836734694,0.02040816326530612]|[0.9931972789115647,0.006802721088435374]|
## |[2.979591836734694,0.02040816326530612]|[0.9931972789115647,0.006802721088435374]|
## |[2.979591836734694,0.02040816326530612]|[0.9931972789115647,0.006802721088435374]|
## +---------------------------------------+-----------------------------------------+
注意:如果你认为你的数据需要spark,那么建立完整的距离/相似矩阵不太可能是一个好主意。只是说说而已。