本文介绍了决策树.选择分割对象的阈值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我正确理解了,则一组呈现了对象(即要素数组),我们需要将其分为2个子集.为此,我们将某些特征x 与阈值t (t 是m个节点的阈值)进行比较.我们使用杂质函数H()来找到分割对象的最佳方法.但是,我们如何选择t 的值,以及应该将哪个特征与阈值进行比较?我的意思是,我们可以选择t 的方法有无数种,因此我们不能只为每种可能性计算H()函数.

If I understand this correctly, a set of objects (which are arrays of features) is presented and we need to split it into 2 subsets. To do that we compare some feature x to a threshold t (t is the threshold at m node). We use an impurity function H() to find the best way to split the objects. But how do we choose the values of t and which feature should be compared to the thresholds? I mean, there is an infinite number of ways we can choose t so we can't just compute H() function for each possibility.

推荐答案

在这些幻灯片,引入了两种方法来选择数值属性X的分割阈值.

In Page 18 of these slides, two methods are introduced to choose the splitting threshold for a numerical attribute X.

方法1:

  • 根据X将数据排序为{x_1,...,x_m}
  • 考虑x_i +(x_ {i + 1}-x_i)/2形式的分割点

方法2:

假设X是一个实值变量

  • 将IG(Y | X:t)定义为H(Y)-H(Y | X:t)

  • Define IG(Y|X:t) as H(Y) - H(Y|X:t)

定义H(Y | X:t)= H(Y | X = t)P(X> = t)

Define H(Y|X:t) = H(Y|X < t) P(X < t) + H(Y|X >= t) P(X >= t)

  • IG(Y | X:t)是预测所有Y的信息增益知道X是否大于或小于t
  • IG(Y|X:t) is the information gain for predicting Y if all youknow is whether X is greater than or less than t

然后定义IG ^ *(Y | X)= max_t IG(Y | X:t)

Then define IG^*(Y|X) = max_t IG(Y|X:t)

对于每个实值属性,请使用IG *(Y | X)来评估其作为拆分的适用性

For each real-valued attribute, use IG*(Y|X) for assessing its suitability as a split

注意,可能会在一个属性上多次拆分,具有不同的阈值

Note, may split on an attribute multiple times,with different thresholds

这篇关于决策树.选择分割对象的阈值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-13 19:46