GradientBoostingClassifier

GradientBoostingClassifier

本文介绍了如何处理sklearn GradientBoostingClassifier中的类别变量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 GradientBoostingClassifier 训练模型使用分类变量.

I am attempting to train models with GradientBoostingClassifier using categorical variables.

以下是原始代码示例,仅用于尝试将类别变量输入到GradientBoostingClassifier.

The following is a primitive code sample, just for trying to input categorical variables into GradientBoostingClassifier.

from sklearn import datasets
from sklearn.ensemble import GradientBoostingClassifier
import pandas

iris = datasets.load_iris()
# Use only data for 2 classes.
X = iris.data[(iris.target==0) | (iris.target==1)]
Y = iris.target[(iris.target==0) | (iris.target==1)]

# Class 0 has indices 0-49. Class 1 has indices 50-99.
# Divide data into 80% training, 20% testing.
train_indices = list(range(40)) + list(range(50,90))
test_indices = list(range(40,50)) + list(range(90,100))
X_train = X[train_indices]
X_test = X[test_indices]
y_train = Y[train_indices]
y_test = Y[test_indices]

X_train = pandas.DataFrame(X_train)

# Insert fake categorical variable. 
# Just for testing in GradientBoostingClassifier.
X_train[0] = ['a']*40 + ['b']*40

# Model.
clf = GradientBoostingClassifier(learning_rate=0.01,max_depth=8,n_estimators=50).fit(X_train, y_train)

出现以下错误:

ValueError: could not convert string to float: 'b'

根据我的收集,看来 GradientBoostingClassifier可以使用该编码对类别变量进行一次热编码.

From what I gather, it seems that One Hot Encoding on categorical variables is required before GradientBoostingClassifier can build the model.

GradientBoostingClassifier是否可以使用分类变量构建模型而不必进行一种热编码?

Can GradientBoostingClassifier build models using categorical variables without having to do one hot encoding?

R gbm软件包能够处理上面的示例数据.我正在寻找具有同等功能的Python库.

R gbm package is capable of handling the sample data above. I'm looking for a Python library with equivalent capability.

推荐答案

pandas.get_dummies statsmodels .tools.tools.categorical 可用于将分类变量转换为虚拟矩阵.然后,我们可以将虚拟矩阵合并回训练数据.

pandas.get_dummies or statsmodels.tools.tools.categorical can be used to convert categorical variables to a dummy matrix. We can then merge the dummy matrix back to the training data.

下面是执行上述过程的问题示例代码.

Below is the example code from the question with the above procedure carried out.

from sklearn import datasets
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_curve,auc
from statsmodels.tools import categorical
import numpy as np

iris = datasets.load_iris()
# Use only data for 2 classes.
X = iris.data[(iris.target==0) | (iris.target==1)]
Y = iris.target[(iris.target==0) | (iris.target==1)]

# Class 0 has indices 0-49. Class 1 has indices 50-99.
# Divide data into 80% training, 20% testing.
train_indices = list(range(40)) + list(range(50,90))
test_indices = list(range(40,50)) + list(range(90,100))
X_train = X[train_indices]
X_test = X[test_indices]
y_train = Y[train_indices]
y_test = Y[test_indices]


###########################################################################
###### Convert categorical variable to matrix and merge back with training
###### data.

# Fake categorical variable.
catVar = np.array(['a']*40 + ['b']*40)
catVar = categorical(catVar, drop=True)
X_train = np.concatenate((X_train, catVar), axis = 1)

catVar = np.array(['a']*10 + ['b']*10)
catVar = categorical(catVar, drop=True)
X_test = np.concatenate((X_test, catVar), axis = 1)
###########################################################################

# Model and test.
clf = GradientBoostingClassifier(learning_rate=0.01,max_depth=8,n_estimators=50).fit(X_train, y_train)

prob = clf.predict_proba(X_test)[:,1]   # Only look at P(y==1).

fpr, tpr, thresholds = roc_curve(y_test, prob)
roc_auc_prob = auc(fpr, tpr)

print(prob)
print(y_test)
print(roc_auc_prob)

感谢安德烈亚斯·穆勒(Andreas Muller)指示不应将熊猫Dataframe用于scikit-learn估计器.

Thanks to Andreas Muller for instructing that pandas Dataframe should not be used for scikit-learn estimators.

这篇关于如何处理sklearn GradientBoostingClassifier中的类别变量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-23 17:19