本文介绍了递归功能选择可能不会产生更高的性能吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正试图分析以下数据,先通过逻辑回归对其进行建模,然后进行预测,然后计算出准确度& auc;然后执行递归特征选择并计算出准确度&再一次,auc认为精度和auc会更高,但实际上在选择了递归特征后它们都较低,不确定是否可以预期吗?还是我错过了什么?谢谢!

I'm tring to analyze below data, modeled it with logistic regression first and then did the prediction, calculated the accuracy & auc; then performed recursive feature selection and calculated accuracy & auc again, thought the accuracy and auc would be higher, but actually they are both lower after the recursive feature selection, not sure whether it's expected? Or did I miss something? Thanks!

数据: https://github.com/amandawang-dev/census-training/blob/master/census-training.csv

对于逻辑回归,准确度:0.8111649491571692; AUC:0.824896256487386

for logistic regression, Accuracy: 0.8111649491571692; AUC: 0.824896256487386

import pandas as pd
import numpy as np
from sklearn import preprocessing, metrics
from sklearn.model_selection import train_test_split


train=pd.read_csv('census-training.csv')
train = train.replace('?', np.nan)
for column in train.columns:
    train[column].fillna(train[column].mode()[0], inplace=True)
x['Income'] = x['Income'].str.contains('>50K').astype(int)
x['Gender'] = x['Gender'].str.contains('Male').astype(int)

obj = train.select_dtypes(include=['object']) #all features that are 'object' datatypes
le = preprocessing.LabelEncoder()
for i in range(len(obj.columns)):
    train[obj.columns[i]] = le.fit_transform(train[obj.columns[i]])#TODO  #Encode input data

train_set, test_set = train_test_split(train, test_size=0.3, random_state=42)

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score
from sklearn.metrics import accuracy_score


log_rgr = LogisticRegression(random_state=0)


X_train=train_set.iloc[:, 0:9]
y_train=train_set.iloc[:, 9:10]


X_test=test_set.iloc[:, 0:9]
y_test=test_set.iloc[:, 9:10]

log_rgr.fit(X_train, y_train)

y_pred = log_rgr.predict(X_test)

lr_acc = accuracy_score(y_test, y_pred)

probs = log_rgr.predict_proba(X_test)
preds = probs[:,1]
print(preds)
from sklearn.preprocessing import label_binarize
y = label_binarize(y_test, classes=[0, 1]) #note to myself: class need to have only 0,1
fpr, tpr, threshold = metrics.roc_curve(y, preds)

roc_auc = roc_auc_score(y_test, preds)

print("Accuracy: {}".format(lr_acc))
print("AUC: {}".format(roc_auc))


from sklearn.feature_selection import RFE


rfe = RFE(log_rgr, 5)
fit = rfe.fit(X_train, y_train)

X_train_new = fit.transform(X_train)
X_test_new = fit.transform(X_test)

log_rgr.fit(X_train_new, y_train)
y_pred = log_rgr.predict(X_test_new)

lr_acc = accuracy_score(y_test, y_pred)

probs = rfe.predict_proba(X_test)
preds = probs[:,1]
y = label_binarize(y_test, classes=[0, 1]) 

fpr, tpr, threshold = metrics.roc_curve(y, preds)
roc_auc =roc_auc_score(y_test, preds)

print("Accuracy: {}".format(lr_acc))
print("AUC: {}".format(roc_auc))

推荐答案

不能保证,实际上,任何类型的功能选择(后退,前进,递归-您可以自行命名)都会导致总的来说性能更好.一个都没有.此类工具仅是为了方便起见-它们可能起作用,也可能不起作用.最佳的指导和最终的判断永远是实验.

There is simply no guarantee that any kind of feature selection (backward, forward, recursive - you name it) will actually lead to better performance in general. None at all. Such tools are there for convenience only - they may work, or they may not. Best guide and ultimate judge is always the experiment.

除了线性或逻辑回归中的一些非常特殊的情况外,最显着的是套索(绝非偶然,实际上是来自统计),或者有些极端情况具有 个特征(又名维数的诅咒),即使它起作用(或不起作用),也没有足够来解释为什么(或为什么不这样).

Apart from some very specific cases in linear or logistic regression, most notably the Lasso (which, no coincidence, actually comes from statistics), or somewhat extreme cases with too many features (aka The curse of dimensionality), even when it works (or doesn't), there is not necessarily much to explain as to why (or why not).

这篇关于递归功能选择可能不会产生更高的性能吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-25 07:44