问题描述
我正在使用来自Weka 3.7.11的RandomForest,这反过来又在包装Weka的RandomTree.我的输入属性是数字,输出属性(标签)也是数字.
I'm using RandomForest from Weka 3.7.11 which in turn is bagging Weka's RandomTree. My input attributes are numerical and the output attribute(label) is also numerical.
训练随机树时,会为树的每个节点随机选择K个属性.尝试根据这些属性进行几次拆分,然后选择最佳"拆分.在这种(数字)情况下,Weka如何确定最好的分割方式?
When training the RandomTree, K attributes are chosen at random for each node of the tree. Several splits based on those attributes are attempted and the "best" one is chosen. How does Weka determine what split is best in this (numerical) case?
对于名义属性,我相信Weka使用基于条件熵的信息增益标准
For nominal attributes I believe Weka is using the information gain criterion which is based on conditional entropy.
IG(T|a) = H(T) - H(T|a)
数字属性是否使用类似的东西?也许是微分熵?
Is something similar used for numerical attributes? Maybe differential entropy?
推荐答案
当树在数值属性上拆分时,它会在a>5
之类的条件下拆分.因此,该条件有效地变为二进制变量,并且判据(信息增益)绝对相同.
When tree is split on numerical attribute, it is split on the condition like a>5
. So, this condition effectively becomes binary variable and the criterion (information gain) is absolutely the same.
P.S.对于回归,常用的是平方误差的总和(对于每个叶子,然后对叶子求和).但是我对Weka并不特别了解
P.S. For regression commonly used is the sum of squared errors (for each leaf, then sum over leaves). But I do not know specifically about Weka
这篇关于Weka 3.7.11中的随机树对数值属性使用什么划分准则?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!