决策树(Decision Trees)
1. Training and Visualizing a Decision Tree
can perform both classification and regression tasks, and even multioutput tasks
tree_clf = DecisionTreeClassifier(max_depth=2)
export_graphviz(
tree_clf,
out_file=image_path("iris_tree.dot"),
feature_names=iris.feature_names[2:],
class_names=iris.target_names,
rounded=True,
filled=True
)
$ dot -Tpng iris_tree.dot -o iris_tree.png
2. Making Predictions
- require very little data preparation.
- don’t require feature scaling or centering at all
- algorithm
CART, binary trees
ID3, mul-children trees - etimating class probabilities
根据叶子节点的value,就可以输出每个分类的概率 p k p_k pk - gini 节点的纯洁程度,0最纯洁
G i n i i = 1 − ∑ k = 1 n p i , k 2 Gini_i=1-\sum_{k=1}^{n}p_{i,k}^2 Ginii=1−k=1∑npi,k2
p i , k p_{i,k} pi,k表示第i个节点上,第k类出现的概率
3. The CART Training Algorithm
递归的为每个节点寻找最好的划分特征k和划分特征的阈值t,CART Cost Function For classification
J ( k , t k ) = m l e f t m G l e f t + m r i g h t m G r i g h t ( G m e a s u r e s t h e i m p u r i t y o f t h e s u b s e t ) J(k, t_k)=\frac{m_{left}}{m}G_{left} + \frac{m_{right}}{m}G_{right} \;\; (G \; measures \; the \; impurity \; of \; the \; subset) J(k,tk)=mmleftGleft+mmrightGright(Gmeasurestheimpurityofthesubset)
处了GINI指数可以作为G,香农信息熵也是一种方法
H i = − ∑ k = 1 , p i , k ≠ 0 n p i , k l o g ( p i , k ) H_i=-\sum_{k=1,p_{i,k}\neq 0}^{n}p_{i,k}log(p_{i,k}) Hi=−k=1,pi,k=0∑npi,klog(pi,k)
默认选择GINI指数,计算复杂度低一些,二者训练出来的树差不多,Gini impurity tends to isolate the most frequent class in its own branch of the tree, while entropy tends to produce slightly more balanced trees
- CART 全称是 Classifcation And Regression Tree
- CART is a greedy algorithm 贪心算法
- A greedy algorithm often produces a reasonably good solution,
- but it is not guaranteed to be the optimal solution.
- finding the optimal tree is known to be an NP-Complete problem
- it requires O(exp(m)) time
- mathematical question
- P is the set of problems that can be solved in polynomial time
- NP is the set of problems whose solutions can be verified in polynomial time
- NP-Hard problem is a problem to which any NP problem can be reduced in polynomial time.
- An NP-Complete problem is both NP and NP-Hard
4. Regularization Hyperparameters
- a nonparametric model
the number of parameters is not determined prior to training - a few parameters restrict the shape of the Decision Tree
- min_samples_split
- min_samples_leaf
- min_weight_fraction_leaf, same as min_samples_leaf but expressed as a fraction of the total number of eighted instances
- max_leaf_nodes
- max_features, maximum number of features that are evaluated for splitting at each node
- increasing min_* hyperparameters or reducing max_* hyperparameters will regularize the model
- 另可以先不加任何约束训练一棵树,完成后再对树进行裁剪的方式正则化
- The computational complexity of training a Decision Tree is O(n × m log(m))
5. Regression
将混乱程度修改为均值平方差
from sklearn.tree import DecisionTreeRegressor
# setting min_samples_leaf=10 to obviously overfitting
tree_reg = DecisionTreeRegressor(max_depth=2)
tree_reg.fit(X, y)
返回的value值,是这一个区间内的所有samples的平均值
6. Instability 不确定性
- 优点 a lot going
- simple to understand and interpret
- easy to use
- versatile, and powerful
- 缺点 a few limitations
- orthogonal decision boundaries 对非线性的样本不好处理
- very sensitive to small variations in the training data