问题描述
如果我正确理解了此,则一组呈现了对象(即要素数组),我们需要将其分为2个子集.为此,我们将某些特征x 与阈值t (t 是m个节点的阈值)进行比较.我们使用杂质函数H()来找到分割对象的最佳方法.但是,我们如何选择t 的值,以及应该将哪个特征与阈值进行比较?我的意思是,我们可以选择t 的方法有无数种,因此我们不能只为每种可能性计算H()函数.
If I understand this correctly, a set of objects (which are arrays of features) is presented and we need to split it into 2 subsets. To do that we compare some feature x to a threshold t (t is the threshold at m node). We use an impurity function H() to find the best way to split the objects. But how do we choose the values of t and which feature should be compared to the thresholds? I mean, there is an infinite number of ways we can choose t so we can't just compute H() function for each possibility.
推荐答案
在这些幻灯片,引入了两种方法来选择数值属性X的分割阈值.
In Page 18 of these slides, two methods are introduced to choose the splitting threshold for a numerical attribute X.
方法1:
- 根据X将数据排序为{x_1,...,x_m}
- 考虑x_i +(x_ {i + 1}-x_i)/2形式的分割点
方法2:
假设X是一个实值变量
-
将IG(Y | X:t)定义为H(Y)-H(Y | X:t)
Define IG(Y|X:t) as H(Y) - H(Y|X:t)
定义H(Y | X:t)= H(Y | X = t)P(X> = t)
Define H(Y|X:t) = H(Y|X < t) P(X < t) + H(Y|X >= t) P(X >= t)
- IG(Y | X:t)是预测所有Y的信息增益知道X是否大于或小于t
- IG(Y|X:t) is the information gain for predicting Y if all youknow is whether X is greater than or less than t
然后定义IG ^ *(Y | X)= max_t IG(Y | X:t)
Then define IG^*(Y|X) = max_t IG(Y|X:t)
对于每个实值属性,请使用IG *(Y | X)来评估其作为拆分的适用性
For each real-valued attribute, use IG*(Y|X) for assessing its suitability as a split
注意,可能会在一个属性上多次拆分,具有不同的阈值
Note, may split on an attribute multiple times,with different thresholds
这篇关于决策树.选择分割对象的阈值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!