我正在尝试为sklearn中的随机森林回归模型实现R的特征重要性评分方法;根据R的文档:
第一项指标是通过排列OOB数据计算得出的:对于每棵树,
记录数据袋外部分的预测误差
(分类错误率,MSE回归)。然后一样
在排列每个预测变量之后完成。和...之间的不同
然后将这两者在所有树上取平均值,并通过
标准差的差异。如果标准偏差
变量的差等于0,则除法未完成
(但在这种情况下,平均值几乎总是等于0)。
因此,如果我理解正确,则需要能够为每棵树中的OOB样本置换每个预测变量(特征)。
我了解我可以使用这样的方法访问经过训练的森林中的每棵树
numberTrees = 100
clf = RandomForestRegressor(n_estimators=numberTrees)
clf.fit(X,Y)
for tree in clf.estimators_:
do something
无论如何,有没有获取每棵树都是OOB的样本列表?也许我可以通过每棵树的
random_state
来导出OOB样本列表? 最佳答案
尽管R使用OOB样本,但我发现通过使用所有训练样本,我在scikit中得到了相似的结果。我正在执行以下操作:
# permute training data and score against its own model
epoch = 3
seeds = range(epoch)
scores = defaultdict(list) # {feature: change in R^2}
# repeat process several times and then average and then average the score for each feature
for j in xrange(epoch):
clf = RandomForestRegressor(n_jobs = -1, n_estimators = trees, random_state = seeds[j],
max_features = num_features, min_samples_leaf = leaf)
clf = clf.fit(X_train, y_train)
acc = clf.score(X_train, y_train)
print 'Epoch', j
# for each feature, permute its values and check the resulting score
for i, col in enumerate(X_train.columns):
if i % 200 == 0: print "- feature %s of %s permuted" %(i, X_train.shape[1])
X_train_copy = X_train.copy()
X_train_copy[col] = np.random.permutation(X_train[col])
shuff_acc = clf.score(X_train_copy, y_train)
scores[col].append((acc-shuff_acc)/acc)
# get mean across epochs
scores_mean = {k: np.mean(v) for k, v in scores.iteritems()}
# sort scores (best first)
scores_sorted = pd.DataFrame.from_dict(scores_mean, orient='index').sort(0, ascending = False)