我正在尝试通过以下伪代码在Python中实现Adaboost M1:
python - 为什么我的AdaBoost实现的错误没有减少?-LMLPHP

我已经有了一些办法,但是,我的“错误预测”的数量并没有减少。

我检查了我的体重更新功能,它似乎正在正确更新体重。

错误可能出在分类器中,因为“不正确的预测”的数量每隔一次迭代是相同的整数-我已经尝试了100次迭代。我不知道为什么它在每次迭代中都没有减少错误。

小费将不胜感激。
谢谢:)

from sklearn import tree
import pandas as pd
import numpy as np
import math

df = pd.read_csv("./dataset(3)/adaboost_train.csv")
X_train = df.loc[:,'x1':'x10']
Y_train = df[['y']]



def adaBoost(X_train,Y_train):
    classifiers = []
    # initializing the weights:
    N = len(Y_train)
    w_i = [1 / N] * N

    T = 20
    x_train = (X_train.apply(lambda x: x.tolist(), axis=1))
    clf_errors = []

    for t in range(T):
        print("Iteration:", t)
        # clf = clf2.fit(X_train,Y_train, sample_weight = w_i)

        clf = tree.DecisionTreeClassifier(max_depth=1)
        clf.fit(X_train, Y_train, sample_weight = w_i)

        #Predict all the values:
        y_pred = []
        for sample in x_train:
            p = clf.predict([sample])
            p = p[0]
            y_pred.append(p)
        num_of_incorrect = calculate_error_clf(y_pred, Y_train)


        clf_errors.append(num_of_incorrect)

        error_internal = calc_error(w_i,Y_train,y_pred)

        alpha = np.log((1-error_internal)/ error_internal)
        print(alpha)

        # Add the predictions, error and alpha for later use for every iteration
        classifiers.append((y_pred, error_internal, alpha))

        if t == 2 and y_pred == classifiers[0][0]:
            print("TRUE")


        w_i = update_weights(w_i,y_pred,Y_train,alpha,clf)


def calc_error(weights,Y_train,y_pred):
    err = 0
    for i in range(len(weights)):
        if y_pred[i] != Y_train['y'].iloc[i]:
            err= err + weights[i]
    # Normalizing the error:
    err = err/np.sum(weights)
    return err

# If the prediction is true, return 0. If it is not true, return 1.
def check_pred(y_p, y_t):
    if y_p == y_t:
        return 0
    else:
        return 1

def update_weights(w,y_pred,Y_train,alpha,clf):
    for j in range(len(w)):
        if y_pred[j] != Y_train['y'].iloc[j]:
            w[j] = w[j]* (np.exp( alpha * 1))
    return w

def calculate_error_clf(y_pred, y):
    sum_error = 0
    for i in range(len(y)):
        if y_pred[i] != y.iloc[i]['y']:
            sum_error += 1
        e = (y_pred[i] - y.iloc[i]['y'])**2


        #sum_error += e
    sum_error = sum_error
    return sum_error




我期望错误会减少,但事实并非如此。例如:

iteration 1: num_of_incorrect 4444
iteration 2: num_of_incorrect 4762
iteration 3: num_of_incorrect 4353
iteration 4: num_of_incorrect 4762
iteration 5: num_of_incorrect 4450
iteration 6: num_of_incorrect 4762
...
does not converge



最佳答案

错误分类的次数不会随着每次迭代而减少(因为每个分类器都是一周分类器)。这是一个整体模型,可以为先前错误分类的样本赋予更大的权重。因此,在下一次迭代中,某些先前错误分类的样本将被正确分类,但这也可能导致先前正确分类的样本出现错误(因此迭代级错误没有改善)。即使每个分类器都很弱,由于最终输出是所有分类器的加权和,因此最终分类会收敛到强大的学习者(请参阅算法的第3行)。

我使用numpy的实现

from sklearn import tree
import pandas as pd
import numpy as np
import math
from sklearn.datasets import load_breast_cancer, classification_report
from sklearn.metrics import confusion_matrix

data = load_breast_cancer()
X_train = data.data
Y_train = np.where(data.target == 0, 1, -1)

def adaBoost(X_train,Y_train):
    classifiers = []
    # initializing the weights:
    N = len(Y_train)
    w_i = np.array([1 / N] * N)

    T = 20
    clf_errors = []

    for t in range(T):
        clf = tree.DecisionTreeClassifier(max_depth=1)
        clf.fit(X_train, Y_train, sample_weight = w_i)

        #Predict all the values:
        y_pred = clf.predict(X_train)
        #print (confusion_matrix(Y_train, y_pred))

        # Line 2(b) of algorithm
        error = np.sum(np.where(Y_train != y_pred, w_i, 0))/np.sum(w_i)
        print("Iteration: {0}, Missed: {1}".format(t, np.sum(np.where(Y_train != y_pred, 1, 0))))

        # Line 2(c) of algorithm
        alpha = np.log((1-error)/ error)
        classifiers.append((alpha, clf))
        # Line 2(d) of algorithm
        w_i = np.where(Y_train != y_pred, w_i*np.exp(alpha), w_i)
    return classifiers

clfs = adaBoost(X_train, Y_train)

# Line 3 of algorithm
def predict(clfs, x):
    s = np.zeros(len(x))
    for (alpha, clf) in clfs:
        s += alpha*clf.predict(x)
    return np.sign(s)

print (confusion_matrix(Y_train, predict(clfs,X_train)))


输出:

Iteration: 0, Missed: 44Iteration: 1, Missed: 48Iteration: 2, Missed: 182Iteration: 3, Missed: 73Iteration: 4, Missed: 102Iteration: 5, Missed: 160Iteration: 6, Missed: 185Iteration: 7, Missed: 69Iteration: 8, Missed: 357Iteration: 9, Missed: 127Iteration: 10, Missed: 256Iteration: 11, Missed: 160Iteration: 12, Missed: 298Iteration: 13, Missed: 64Iteration: 14, Missed: 221Iteration: 15, Missed: 113Iteration: 16, Missed: 261Iteration: 17, Missed: 368Iteration: 18, Missed: 49Iteration: 19, Missed: 171[[354 3] [ 3 209]]

precision recall f1-score support-1 0.99 0.99 0.99 3571 0.99 0.99 0.99 212avg / total 0.99 0.99 0.99 569

如您所见,no:of的缺失不会改善,但是如果您检查混淆矩阵(代码中未注释),您会发现某些先前未正确分类的样本将被正确分类。最后,对于预测,因为我们通过误差对分类器加权,所以加权和收敛到一个强分类器(如所做的最终预测所示)。

关于python - 为什么我的AdaBoost实现的错误没有减少?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/55318330/

10-09 05:36