使用sklearn python通过决策树提取数据点的规则路径 | python通过决策树提取数据点的规则路径

本文介绍了使用sklearn python通过决策树提取数据点的规则路径的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！
问题描述

我正在使用决策树模型，我想提取每个数据点的决策路径，以便了解造成Y的原因而不是进行预测。
我该怎么做？找不到任何文档。
解决方案
以下是使用 iris数据集的示例。
 从sklearn.datasets导入load_iris 
从sklearn导入树
导入graphviz 
 
 iris = load_iris（）
 clf = tree.DecisionTreeClassifier（）
 clf = clf.fit（iris.data，iris.target）
 
 dot_data = tree.export_graphviz（clf，out_file =无，
 feature_names = iris.feature_names，
 class_names = iris.target_names，
 fill =正确，四舍五入=正确，
 special_characters = True）
图= graphviz .Source（dot_data）
＃这将创建一个带有规则路径
 graph.render（ iris）
   
 
  
 
 
 
 
 编辑：以下代码来自sklearn文档，进行了一些小的更改以实现您的目标
 
 
 
 从sklearn.model_selection导入numpy为np 
从sklearn.datasets导入train_test_split 
从sklearn.tree导入load_iris 
tree导入DecisionTreeClassifier 
 
虹膜= load_iris（）
 X = iris.data 
y = iris.target 
 X_train，X_test，y_train，y_test = train_test_split（X，y，random_state = 0）
 
 estimator = DecisionTreeClassifier（max_leaf_nodes = 3，random_state = 0）
 estimator.fit（X_train，y_train）
 
＃决策估计器具有一个名为tree_的属性，该属性存储整个
＃树结构，并允许访问低级属性。二进制树
＃tree_表示为多个并行数组。每个
＃数组的第i个元素保存有关节点 i的信息。节点0是树的根。注意：
＃某些数组仅适用于叶子节点或拆分节点。在这种
＃情况下，其他类型的节点的值是任意的！ 
＃
＃在这些数组中，我们有：
＃-left_child，节点左孩子的ID 
＃-right_child，节点右孩子的
＃-特征，用于拆分节点的特征
＃-阈值，节点
 
的阈值n_nodes = estimator.tree_.node_count 
 children_left = estimator.tree_ .children_left 
 children_right = estimator.tree_.children_right 
功能= estimator.tree_.feature 
阈值= estimator.tree_.threshold 
 
＃可以遍历树结构计算各种属性，例如
＃作为每个节点的深度以及是否为叶子。 
 node_depth = np.zeros（shape = n_nodes，dtype = np.int64）
 is_leaves = np.zeros（shape = n_nodes，dtype = bool）
 stack = [（0，-1 ）]＃seed是根节点ID及其父级深度
，而len（stack）> 0：
 node_id，parent_depth = stack.pop（）
 node_depth [node_id] = parent_depth + 1 
 
＃如果我们有一个测试节点
 if（children_left [ node_id]！= children_right [node_id]）：
 stack.append（（children_left [node_id]，parent_depth + 1））
 stack.append（（children_right [node_id]，parent_depth + 1））
 else：
 is_leaves [node_id] =真
 
 print（二进制树结构具有％s个节点，并且具有 
以下树结构： 
％n_nodes）
 for i in range（n_nodes）：
 if is_leaves [i]：
 print（％snode =％s叶子节点。％（node_depth [i] * \t，i））
 else：
 print（％snode =％s测试节点：如果X [:,％s]< =％s else to 
节点％s。 
％（node_depth [i] * \t，
i，
 children_left [i]，
功能[i]， 
个门槛[i]，
个children_ri ght [i]，
））
 print（）
 
＃首先，我们获取每个样本的决策路径。 Decision_path 
＃方法允许检索节点指示符函数。在位置（i，j）处
＃指标矩阵的非零元素表示样本i通过节点j进入
＃。 
 
 node_indicator = estimator.decision_path（X_test）
 
＃同样，我们还可以使每个样本达到叶子ID。 
 
 Leave_id = estimator.apply（X_test）
 
＃现在，可以获取用于预测样本的测试或
＃一组样本。首先，让我们作为示例。 
 
＃这里是您想要的
 sample_id = 0 
 node_index = node_indicator.indices [node_indicator.indptr [sample_id]：
 node_indicator.indptr [sample_id + 1] ] 
 
 print（'用于预测样本％s的规则：'％sample_id）node_index中的node_id的
：
 
 if if_id [sample_id] == node_id： ＃<-更改！=到== 
 #continue＃<-注释掉
 print（到达叶子节点{}，在这里没有决定。format（leave_id [sample_id]）） ＃<-
 
其他：＃< -如果（X_test [sample_id，feature [node_id]]< = threshold [node_id]]< = threshold [node_id]）：
 threshold_sign =< = 
 else： 
 threshold_sign => 
 
 print（决策id节点％s：（X [％s，％s]（=％s）％s％s） 
％（node_id，
 sample_id ，
 feature [node_id]，
 X_test [sample_id，feature [node_id]]，＃<-将i更改为sample_id 
 threshold_sign，
 threshold [node_id]））
  
 
 
 
 
 
 这将在末尾打印：
 
 
  用于预测样本0的规则：
决策ID节点0：（X [0，3]（= 2.4）> 0.800000011920929） 
个决策ID节点2：（X [0，2]（= 5.1）> 4.949999809265137）
个叶子节点4已到达，此处没有决策 
 
 
 
 
I'm using decision tree model and I want to extract the decision path for each data point in order to understand what caused the Y rather than to predict it.How can I do that? Couldn't find any documentation. 
 解决方案 
Here is an example using the iris dataset.
from sklearn.datasets import load_iris
from sklearn import tree
import graphviz

iris = load_iris()
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)

dot_data = tree.export_graphviz(clf, out_file=None,
                                feature_names=iris.feature_names,
                                class_names=iris.target_names,
                                filled=True, rounded=True,
                                special_characters=True)
graph = graphviz.Source(dot_data)
#this will create an iris.pdf file with the rule path
graph.render("iris")
EDIT: the following code is from the sklearn documentation with some small changes to address your goal
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

estimator = DecisionTreeClassifier(max_leaf_nodes=3, random_state=0)
estimator.fit(X_train, y_train)

# The decision estimator has an attribute called tree_  which stores the entire
# tree structure and allows access to low level attributes. The binary tree
# tree_ is represented as a number of parallel arrays. The i-th element of each
# array holds information about the node `i`. Node 0 is the tree's root. NOTE:
# Some of the arrays only apply to either leaves or split nodes, resp. In this
# case the values of nodes of the other type are arbitrary!
#
# Among those arrays, we have:
#   - left_child, id of the left child of the node
#   - right_child, id of the right child of the node
#   - feature, feature used for splitting the node
#   - threshold, threshold value at the node

n_nodes = estimator.tree_.node_count
children_left = estimator.tree_.children_left
children_right = estimator.tree_.children_right
feature = estimator.tree_.feature
threshold = estimator.tree_.threshold

# The tree structure can be traversed to compute various properties such
# as the depth of each node and whether or not it is a leaf.
node_depth = np.zeros(shape=n_nodes, dtype=np.int64)
is_leaves = np.zeros(shape=n_nodes, dtype=bool)
stack = [(0, -1)]  # seed is the root node id and its parent depth
while len(stack) > 0:
    node_id, parent_depth = stack.pop()
    node_depth[node_id] = parent_depth + 1

    # If we have a test node
    if (children_left[node_id] != children_right[node_id]):
        stack.append((children_left[node_id], parent_depth + 1))
        stack.append((children_right[node_id], parent_depth + 1))
    else:
        is_leaves[node_id] = True

print("The binary tree structure has %s nodes and has "
      "the following tree structure:"
      % n_nodes)
for i in range(n_nodes):
    if is_leaves[i]:
        print("%snode=%s leaf node." % (node_depth[i] * "\t", i))
    else:
        print("%snode=%s test node: go to node %s if X[:, %s] <= %s else to "
              "node %s."
              % (node_depth[i] * "\t",
                 i,
                 children_left[i],
                 feature[i],
                 threshold[i],
                 children_right[i],
                 ))
print()

# First let's retrieve the decision path of each sample. The decision_path
# method allows to retrieve the node indicator functions. A non zero element of
# indicator matrix at the position (i, j) indicates that the sample i goes
# through the node j.

node_indicator = estimator.decision_path(X_test)

# Similarly, we can also have the leaves ids reached by each sample.

leave_id = estimator.apply(X_test)

# Now, it's possible to get the tests that were used to predict a sample or
# a group of samples. First, let's make it for the sample.

# HERE IS WHAT YOU WANT
sample_id = 0
node_index = node_indicator.indices[node_indicator.indptr[sample_id]:
                                    node_indicator.indptr[sample_id + 1]]

print('Rules used to predict sample %s: ' % sample_id)
for node_id in node_index:

    if leave_id[sample_id] == node_id:  # <-- changed != to ==
        #continue # <-- comment out
        print("leaf node {} reached, no decision here".format(leave_id[sample_id])) # <--

    else: # < -- added else to iterate through decision nodes
        if (X_test[sample_id, feature[node_id]] <= threshold[node_id]):
            threshold_sign = "<="
        else:
            threshold_sign = ">"

        print("decision id node %s : (X[%s, %s] (= %s) %s %s)"
              % (node_id,
                 sample_id,
                 feature[node_id],
                 X_test[sample_id, feature[node_id]], # <-- changed i to sample_id
                 threshold_sign,
                 threshold[node_id]))
This will print at the end the following:
Rules used to predict sample 0:decision id node 0 : (X[0, 3] (= 2.4) > 0.800000011920929)decision id node 2 : (X[0, 2] (= 5.1) > 4.949999809265137)leaf node 4 reached, no decision here
                        这篇关于使用sklearn python通过决策树提取数据点的规则路径的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！