我想使用交叉验证来测试/训练我的数据集,并在整个数据集而不是仅在测试集(例如25%)上评估逻辑回归模型的性能。

这些概念对我来说是全新的,并且不确定是否做得正确。如果有人可以建议我采取正确的措施来解决我的问题,我将不胜感激。我的部分代码如下所示。

另外,如何在与当前图形相同的图形上绘制“y2”和“y3”的ROC?

谢谢

import pandas as pd
Data=pd.read_csv ('C:\\Dataset.csv',index_col='SNo')
feature_cols=['A','B','C','D','E']
X=Data[feature_cols]

Y=Data['Status']
Y1=Data['Status1']  # predictions from elsewhere
Y2=Data['Status2'] # predictions from elsewhere

from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
logreg.fit(X_train,y_train)

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

from sklearn import metrics, cross_validation
predicted = cross_validation.cross_val_predict(logreg, X, y, cv=10)
metrics.accuracy_score(y, predicted)

from sklearn.cross_validation import cross_val_score
accuracy = cross_val_score(logreg, X, y, cv=10,scoring='accuracy')
print (accuracy)
print (cross_val_score(logreg, X, y, cv=10,scoring='accuracy').mean())

from nltk import ConfusionMatrix
print (ConfusionMatrix(list(y), list(predicted)))
#print (ConfusionMatrix(list(y), list(yexpert)))

# sensitivity:
print (metrics.recall_score(y, predicted) )

import matplotlib.pyplot as plt
probs = logreg.predict_proba(X)[:, 1]
plt.hist(probs)
plt.show()

# use 0.5 cutoff for predicting 'default'
import numpy as np
preds = np.where(probs > 0.5, 1, 0)
print (ConfusionMatrix(list(y), list(preds)))

# check accuracy, sensitivity, specificity
print (metrics.accuracy_score(y, predicted))

#ROC CURVES and AUC
# plot ROC curve
fpr, tpr, thresholds = metrics.roc_curve(y, probs)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate)')
plt.show()

# calculate AUC
print (metrics.roc_auc_score(y, probs))

# use AUC as evaluation metric for cross-validation
from sklearn.cross_validation import cross_val_score
logreg = LogisticRegression()
cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()

最佳答案

你说得差不多了。 cross_validation.cross_val_predict为您提供整个数据集的预测。您只需要在代码前面删除logreg.fit。具体来说,它的作用如下:
它将您的数据集划分为n折叠,并在每次迭代中将其中一个折叠留作测试集,并在其余折叠(n-1折叠)上训练模型。因此,最后您将获得整个数据的预测。

让我们用虹膜sklearn中的一个内置数据集来说明这一点。该数据集包含150个具有4个特征的训练样本。 iris['data']X,而iris['target']y

In [15]: iris['data'].shape
Out[15]: (150, 4)

要通过交叉验证对整个集合进行预测,您可以执行以下操作:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics, cross_validation
from sklearn import datasets
iris = datasets.load_iris()
predicted = cross_validation.cross_val_predict(LogisticRegression(), iris['data'], iris['target'], cv=10)
print metrics.accuracy_score(iris['target'], predicted)

Out [1] : 0.9537

print metrics.classification_report(iris['target'], predicted)

Out [2] :
                     precision    recall  f1-score   support

                0       1.00      1.00      1.00        50
                1       0.96      0.90      0.93        50
                2       0.91      0.96      0.93        50

      avg / total       0.95      0.95      0.95       150

因此,回到您的代码。您需要的是:
from sklearn import metrics, cross_validation
logreg=LogisticRegression()
predicted = cross_validation.cross_val_predict(logreg, X, y, cv=10)
print metrics.accuracy_score(y, predicted)
print metrics.classification_report(y, predicted)

要在多类分类中绘制ROC,可以遵循this tutorial,它为您提供了以下内容:

通常,sklearn具有非常好的教程和文档。我强烈建议阅读他们的tutorial on cross_validation

关于python - 使用交叉验证评估Logistic回归,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/39163354/

10-12 19:51