python - 无法实现基于类的非二进制 id3 决策解决方案

编辑:我在底部添加了自己的熵/inf 增益/最佳分割方法，以防有人想帮助我调试，这样他们就不必自己编写了!

作为一个挑战，在观看了 this video 之后，我想用 Python 创建一个解决方案(基于类来练习 OOP)。由于这里的数据是分类数据，我想创建树，以便它在视觉上与视频类似(即可能 >2 个 child ，分割和值上的标签等)。

我已尽最大努力调试代码以使其运行。在目前的状态下，它有两个巨大的问题:

(1) 实例化类的子项列表显示了一些 None 值和一些 Tree 对象。由于 None 值破坏了 .predict 函数，我放置了一个临时 if 语句来忽略它们

(2)值属性显示"is"和“否”，它们是目标值而不是特征值。例如

In: dt.tree_.children[0].value
Out: "Yes"

特征分割不在标签属性中。不管节点/子节点，label字段都是None，预测也返回None。我已经浏览了几个小时的代码，但不知道为什么。我已经包含了数据(从视频中复制)和 DataFrame 设置以便于访问并展示我如何尝试运行该程序。

对于这篇很长的帖子，我提前道歉!我试图解释我的逻辑(并且不希望人们认为我只是在要求其他人编写我的代码)，即使它可能会阻止大多数人提供帮助!

注意:决策树类使用嵌套函数，因此 .fit 将始终以 Tree() 的新实例化开始，因此 .predict 将使用应在 .fit 之后填充的 dt.tree_ 属性

树()类:

class Tree():
    def __init__(self, children = [], label = None, value = None):
        self.children = children    #used in place of left and right for binary solutions
        self.label = label          #to label which feature this node's children are split on
        self.value = value          #the values of the above node's split feature. This should always be None for the head node

dt.fit 的伪代码:

def fit(self, data, target, features)
    def run_id3(data, target, features, tree):

        (base case)
        Check if target column has only one unique value.
            If so, set current tree label to target column, add one child below current tree with target value
            return (end recursion)

        find the best feature to split data on
        set current node label to feature
        for each unique value in splitting feature:
            create a node and set value equal to unique value
            append new node to the children list of the current tree
            recur with data filtered for the current unique feature value (split) and with the child tree as the head
    run_id3(data, target, features, self.tree_)

dt.fit 的代码:

class DecisionTree():
    tree_: Tree

    def __init__(self):
        self.tree_ = Tree()
        pass

    def fit(self, data, target, features):
        def run_id3(data, target, features, tree):
            unique_targets = pd.unique(data[target])
            if len(unique_targets) == 1:
                tree.label = target
                tree.children.append(Tree(value=unique_targets[0]))
                return
            best_split = find_best(data, target, features)
            tree.label = best_split
            for unique_val in np.unique(data[best_split]):
                new_tree = Tree()
                new_tree.value = unique_val
                tree.children.append(run_id3(data[data[best_split] == unique_val], target, features, new_tree))

        run_id3(data, target, features, self.tree_)

dt.predict 的伪代码:

def predict(self, row):
    def get_prediction(tree, row):
        check if current node has no children
            return node label (should be target prediction)
        set current column (feature split) to current node label
        for each child of current node
            if child is not null (THIS IS NOT GOOD, EXISTS TO STOP PROGRAM HALTING)
                if child’s value is equal to the value in that column in our test row
                    recur (go down tree), set current child tree to head in parameter

    tree = self.tree_ (so tree starts at the head of the instantiated tree, should be populated after dt.fit)
    return get_prediction(tree, row)

dt.predict 的代码:

    def predict(self, row):
        def get_prediction(tree, row):
            if len(tree.children) == 0:
                return tree.label
            column = tree.label
            for child in tree.children:
# the below conditional is purely to stop the program halting since I haven't worked out why the children attribute keeps adding NoneType objects
                if child is not None:
                    if child.value == row[column]:
                        return get_prediction(child, row)

        tree = self.tree_
        return get_prediction(tree, row)

数据设置:

outlook = ['Sunny', 'Sunny', 'Overcast', 'Rain', 'Rain', 'Rain', 'Overcast', 'Sunny', 'Sunny', 'Rain', 'Sunny', 'Overcast', 'Overcast', 'Rain', 'Rain']
humidity = ['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'High', 'High']
wind = ['Weak', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Strong', 'Weak']
play = ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', '?']

columns = ["Outlook", "Humidity", "Wind", "Play"]
data = pd.DataFrame([outlook, humidity, wind, play]).T
data.columns = columns
train = data.iloc[:-1, :]
test = data.iloc[-1, :3]
features = columns.copy()
features.remove("Play")
target = "Play"

dt = DecisionTree()
dt.fit(train, target, features)
pred = dt.predict(test)

信息增益方法:

import numpy as np
import pandas as pd



def entropy(column):
    elements, counts = np.unique(column, return_counts=True)
    # if statement in comprehension stops nan result since 0*log2(x) is undefined, returns 0. in this case,
    # 1*log2(1) + 0*log2(0) = 0. zero entropy result, zero uncertainty is consistent with theory
    entropy = np.sum(
        [-(counts[i] / np.sum(counts)) * np.log2(counts[i] / np.sum(counts)) if counts[i] > 0 else 0 for i in
         range(len(counts))])
    return entropy


def information_gain(data, split_name, target_name):
    target_entropy = entropy(data[target_name])
    vals, counts = np.unique(data[split_name], return_counts=True)
    weighted_entropy = np.sum(
        [counts[i] / np.sum(counts) * entropy(data.loc[data[split_name] == vals[i], target_name]) for i in
         range(len(counts))])
    return target_entropy - weighted_entropy


def find_best(data, target_name, features):
    max_gain = 0
    best_col = ""
    for col in features:
        gain = information_gain(data, col, target_name)
        if gain > max_gain:
            max_gain = gain
            best_col = col
    return best_col

最佳答案

我无法提供完整的答案，但我会提出以下意见:

在 DecisionTree.fit 中，在 run_id3 函数中，您附加到 tree.children 两次，这些附加调用之一必须是子节点中 None 值的原因。

这个看起来不错，你正在附加一棵树:

tree.children.append(Tree(value=unique_targets[0]))

这个看起来更可疑:

        for unique_val in np.unique(data[best_split]):
            new_tree = Tree()
            new_tree.value = unique_val
            tree.children.append(run_id3(data[data[best_split] == unique_val], target, features, new_tree))

您将 run_id3 的返回值附加到 tree.children ，但 run_id3 不返回值，并且在不返回值的 Python 函数中返回 None 。 run_id3 附加到传递给它的树的子列表中，所以你的代码应该是这样的:

        for unique_val in np.unique(data[best_split]):
            new_tree = Tree()
            new_tree.value = unique_val
            run_id3(data[data[best_split] == unique_val], target, features, new_tree)

几个次要的风格点:

class Tree():
    def __init__(self, children = [], label = None, value = None):
        self.children = children

Tree 之后不需要括号，除非你想从另一个类继承，在这种情况下你会有 class Tree(Ancestor):...
在 children=[] 等函数参数中提供 mutable default arguments 会产生意想不到的效果，所以最好避免这种做法。改用这个习语:

class Tree:

    def __init__(self, children=None, label=None, value=None):
        self.children = children if children is not None else []
        ...

关于python - 无法实现基于类的非二进制 id3 决策解决方案，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/53790092/