本文介绍了使用sklearn python通过决策树提取数据点的规则路径的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在使用决策树模型,我想提取每个数据点的决策路径,以便了解造成Y的原因而不是进行预测。
我该怎么做?找不到任何文档。
解决方案
以下是使用 iris数据集
的示例。
从sklearn.datasets导入load_iris
$ p $的iris.pdf文件p>
从sklearn导入树
导入graphviz
iris = load_iris()
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data,iris.target)
dot_data = tree.export_graphviz(clf,out_file =无,
feature_names = iris.feature_names,
class_names = iris.target_names,
fill =正确,四舍五入=正确,
special_characters = True)
图= graphviz .Source(dot_data)
#这将创建一个带有规则路径
graph.render( iris)
编辑:以下代码来自sklearn文档,进行了一些小的更改以实现您的目标
从sklearn.model_selection导入numpy为np
从sklearn.datasets导入train_test_split
从sklearn.tree导入load_iris
tree导入DecisionTreeClassifier
虹膜= load_iris()
X = iris.data
y = iris.target
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 0)
estimator = DecisionTreeClassifier(max_leaf_nodes = 3,random_state = 0)
estimator.fit(X_train,y_train)
#决策估计器具有一个名为tree_的属性,该属性存储整个
#树结构,并允许访问低级属性。二进制树
#tree_表示为多个并行数组。每个
#数组的第i个元素保存有关节点 i的信息。节点0是树的根。注意:
#某些数组仅适用于叶子节点或拆分节点。在这种
#情况下,其他类型的节点的值是任意的!
#
#在这些数组中,我们有:
#-left_child,节点左孩子的ID
#-right_child,节点右孩子的
#-特征,用于拆分节点的特征
#-阈值,节点
的阈值n_nodes = estimator.tree_.node_count
children_left = estimator.tree_ .children_left
children_right = estimator.tree_.children_right
功能= estimator.tree_.feature
阈值= estimator.tree_.threshold
#可以遍历树结构计算各种属性,例如
#作为每个节点的深度以及是否为叶子。
node_depth = np.zeros(shape = n_nodes,dtype = np.int64)
is_leaves = np.zeros(shape = n_nodes,dtype = bool)
stack = [(0,-1 )]#seed是根节点ID及其父级深度
,而len(stack)> 0:
node_id,parent_depth = stack.pop()
node_depth [node_id] = parent_depth + 1
#如果我们有一个测试节点
if(children_left [ node_id]!= children_right [node_id]):
stack.append((children_left [node_id],parent_depth + 1))
stack.append((children_right [node_id],parent_depth + 1))
else:
is_leaves [node_id] =真
print(二进制树结构具有%s个节点,并且具有
以下树结构:
%n_nodes)
for i in range(n_nodes):
if is_leaves [i]:
print(%snode =%s叶子节点。%(node_depth [i] * \t,i))
else:
print(%snode =%s测试节点:如果X [:,%s]< =%s else to
节点%s。
%(node_depth [i] * \t,
i,
children_left [i],
功能[i],
个门槛[i],
个children_ri ght [i],
))
print()
#首先,我们获取每个样本的决策路径。 Decision_path
#方法允许检索节点指示符函数。在位置(i,j)处
#指标矩阵的非零元素表示样本i通过节点j进入
#。
node_indicator = estimator.decision_path(X_test)
#同样,我们还可以使每个样本达到叶子ID。
Leave_id = estimator.apply(X_test)
#现在,可以获取用于预测样本的测试或
#一组样本。首先,让我们作为示例。
#这里是您想要的
sample_id = 0
node_index = node_indicator.indices [node_indicator.indptr [sample_id]:
node_indicator.indptr [sample_id + 1] ]
print('用于预测样本%s的规则:'%sample_id)node_index中的node_id的
:
if if_id [sample_id] == node_id: #<-更改!=到==
#continue#<-注释掉
print(到达叶子节点{},在这里没有决定。format(leave_id [sample_id])) #<-
其他:#< -如果(X_test [sample_id,feature [node_id]]< = threshold [node_id]]< = threshold [node_id]):
threshold_sign =< =
else:
threshold_sign =>
print(决策id节点%s:(X [%s,%s](=%s)%s%s)
%(node_id,
sample_id ,
feature [node_id],
X_test [sample_id,feature [node_id]],#<-将i更改为sample_id
threshold_sign,
threshold [node_id]))
这将在末尾打印:
用于预测样本0的规则:
决策ID节点0:(X [0,3](= 2.4)> 0.800000011920929)
个决策ID节点2:(X [0,2](= 5.1)> 4.949999809265137)
个叶子节点4已到达,此处没有决策
I'm using decision tree model and I want to extract the decision path for each data point in order to understand what caused the Y rather than to predict it.How can I do that? Couldn't find any documentation.
解决方案Here is an example using the
iris dataset
.from sklearn.datasets import load_iris from sklearn import tree import graphviz iris = load_iris() clf = tree.DecisionTreeClassifier() clf = clf.fit(iris.data, iris.target) dot_data = tree.export_graphviz(clf, out_file=None, feature_names=iris.feature_names, class_names=iris.target_names, filled=True, rounded=True, special_characters=True) graph = graphviz.Source(dot_data) #this will create an iris.pdf file with the rule path graph.render("iris")
EDIT: the following code is from the sklearn documentation with some small changes to address your goal
import numpy as np from sklearn.model_selection import train_test_split from sklearn.datasets import load_iris from sklearn.tree import DecisionTreeClassifier iris = load_iris() X = iris.data y = iris.target X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) estimator = DecisionTreeClassifier(max_leaf_nodes=3, random_state=0) estimator.fit(X_train, y_train) # The decision estimator has an attribute called tree_ which stores the entire # tree structure and allows access to low level attributes. The binary tree # tree_ is represented as a number of parallel arrays. The i-th element of each # array holds information about the node `i`. Node 0 is the tree's root. NOTE: # Some of the arrays only apply to either leaves or split nodes, resp. In this # case the values of nodes of the other type are arbitrary! # # Among those arrays, we have: # - left_child, id of the left child of the node # - right_child, id of the right child of the node # - feature, feature used for splitting the node # - threshold, threshold value at the node n_nodes = estimator.tree_.node_count children_left = estimator.tree_.children_left children_right = estimator.tree_.children_right feature = estimator.tree_.feature threshold = estimator.tree_.threshold # The tree structure can be traversed to compute various properties such # as the depth of each node and whether or not it is a leaf. node_depth = np.zeros(shape=n_nodes, dtype=np.int64) is_leaves = np.zeros(shape=n_nodes, dtype=bool) stack = [(0, -1)] # seed is the root node id and its parent depth while len(stack) > 0: node_id, parent_depth = stack.pop() node_depth[node_id] = parent_depth + 1 # If we have a test node if (children_left[node_id] != children_right[node_id]): stack.append((children_left[node_id], parent_depth + 1)) stack.append((children_right[node_id], parent_depth + 1)) else: is_leaves[node_id] = True print("The binary tree structure has %s nodes and has " "the following tree structure:" % n_nodes) for i in range(n_nodes): if is_leaves[i]: print("%snode=%s leaf node." % (node_depth[i] * "\t", i)) else: print("%snode=%s test node: go to node %s if X[:, %s] <= %s else to " "node %s." % (node_depth[i] * "\t", i, children_left[i], feature[i], threshold[i], children_right[i], )) print() # First let's retrieve the decision path of each sample. The decision_path # method allows to retrieve the node indicator functions. A non zero element of # indicator matrix at the position (i, j) indicates that the sample i goes # through the node j. node_indicator = estimator.decision_path(X_test) # Similarly, we can also have the leaves ids reached by each sample. leave_id = estimator.apply(X_test) # Now, it's possible to get the tests that were used to predict a sample or # a group of samples. First, let's make it for the sample. # HERE IS WHAT YOU WANT sample_id = 0 node_index = node_indicator.indices[node_indicator.indptr[sample_id]: node_indicator.indptr[sample_id + 1]] print('Rules used to predict sample %s: ' % sample_id) for node_id in node_index: if leave_id[sample_id] == node_id: # <-- changed != to == #continue # <-- comment out print("leaf node {} reached, no decision here".format(leave_id[sample_id])) # <-- else: # < -- added else to iterate through decision nodes if (X_test[sample_id, feature[node_id]] <= threshold[node_id]): threshold_sign = "<=" else: threshold_sign = ">" print("decision id node %s : (X[%s, %s] (= %s) %s %s)" % (node_id, sample_id, feature[node_id], X_test[sample_id, feature[node_id]], # <-- changed i to sample_id threshold_sign, threshold[node_id]))
This will print at the end the following:
Rules used to predict sample 0:decision id node 0 : (X[0, 3] (= 2.4) > 0.800000011920929)decision id node 2 : (X[0, 2] (= 5.1) > 4.949999809265137)leaf node 4 reached, no decision here
这篇关于使用sklearn python通过决策树提取数据点的规则路径的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!