如何在 Python 中编写混淆矩阵?

本文介绍了如何在 Python 中编写混淆矩阵?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我用Python写了一个混淆矩阵计算代码:

def conf_mat(prob_arr, input_arr):#混淆矩阵conf_arr = [[0, 0], [0, 0]]对于我在范围内(len(prob_arr)):如果 int(input_arr[i]) == 1:如果 float(prob_arr[i]) = 0.5:conf_arr[1][0] = conf_arr[1][0] +1别的:conf_arr[1][1] = conf_arr[1][1] +1精度 = 浮点数(conf_arr[0][0] + conf_arr[1][1])/(len(input_arr))

prob_arr 是我的分类码返回的数组，样本数组是这样的:

[1.0，1.0，1.0，0.41592955657342651，1.0，0.0053405015805891975，4.5321494433440449e-299，1.0，1.0，1.0，1.0，1.0，1.0，0.70943426182688163，1.0，1.0，1.0，1.0]

input_arr 是数据集的原始类标签，如下所示:

[2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1]

我的代码试图做的是:我得到 prob_arr 和 input_arr 并且对于每个类(1 和 2)我检查它们是否被错误分类.

但我的代码只适用于两个类.如果我为多个分类的数据运行此代码，则它不起作用.我怎样才能为多个班级做到这一点?

例如对于三个类的数据集，它应该返回:[[21,7,3],[3,38,6],[5,4,19]]

解决方案

Scikit-Learn 提供了一个 confusion_matrix 函数

from sklearn.metrics 导入混淆_矩阵y_actu = [2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2]y_pred = [0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2]混淆矩阵(y_actu，y_pred)

输出一个Numpy数组

array([[3, 0, 0],[0, 1, 2],[2, 1, 3]])

但您也可以使用 Pandas 创建混淆矩阵:

将pandas导入为pdy_actu = pd.Series([2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2], name='实际')y_pred = pd.Series([0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2], name='预测')df_confusion = pd.crosstab(y_actu, y_pred)

你会得到一个(很好标记的)Pandas DataFrame:

预测 0 1 2实际的0 3 0 01 0 1 22 2 1 3

如果你添加 margins=True 就像

df_confusion = pd.crosstab(y_actu, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)

您还将获得每行和每列的总和:

预测 0 1 2 全部实际的0 3 0 0 31 0 1 2 32 2 1 3 6全部 5 2 5 12

您还可以使用以下方法获得归一化的混淆矩阵:

df_conf_norm = df_confusion/df_confusion.sum(axis=1)预测 0 1 2实际的0 1.000000 0.000000 0.0000001 0.000000 0.333333 0.3333332 0.666667 0.333333 0.500000

您可以使用

绘制这个混淆矩阵

将 matplotlib.pyplot 导入为 pltdef plot_confusion_matrix(df_confusion, title='混淆矩阵', cmap=plt.cm.gray_r):plt.matshow(df_confusion, cmap=cmap) # imshow#plt.title(title)plt.colorbar()tick_marks = np.arange(len(df_confusion.columns))plt.xticks(tick_marks，df_confusion.columns，旋转=45)plt.yticks(tick_marks，df_confusion.index)#plt.tight_layout()plt.ylabel(df_confusion.index.name)plt.xlabel(df_confusion.columns.name)plot_confusion_matrix(df_confusion)

或使用以下方法绘制归一化混淆矩阵:

plot_confusion_matrix(df_conf_norm)

您可能也对这个项目感兴趣 https://github.com/pandas-ml/pandas-ml 及其 Pip 包 https://pypi.python.org/pypi/pandas_ml

有了这个包混淆矩阵可以漂亮地打印出来，情节.您可以对混淆矩阵进行二值化，获取类别统计信息，例如 TP、TN、FP、FN、ACC、TPR、FPR、FNR、TNR (SPC)、LR+、LR-、DOR、PPV、FDR、FOR、NPV 和一些总体统计数据

In [1]: from pandas_ml import ConfusionMatrix在 [2] 中:y_actu = [2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2]在 [3] 中:y_pred = [0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2]在 [4] 中:cm = ConfusionMatrix(y_actu, y_pred)在 [5]: cm.print_stats()混淆矩阵:预测 0 1 2 __all__实际的0 3 0 0 31 0 1 2 32 2 1 3 6__所有__ 5 2 5 12总体统计:准确度:0.58333333333395% CI: (0.27666968568210581, 0.84834777019156982)无信息率:待办事项P 值 [累加 >近红外]:0.189264302376河童:0.354838709677Mcnemar 的测试 P 值:待办事项班级统计:班级 0 1 2人口 12 12 12P:条件为正 3 3 6N:条件否定 9 9 6测试结果阳性 5 2 5测试结果阴性 7 10 7TP:真阳性 3 1 3TN:真阴性 7 8 4FP:误报 2 1 2FN:假阴性 0 2 3TPR:(灵敏度、命中率、召回率)1 0.3333333 0.5TNR=SPC:(特异性)0.7777778 0.8888889 0.6666667PPV:Pos 预测值(精度) 0.6 0.5 0.6NPV:负预测值 1 0.8 0.5714286FPR:误报 0.2222222 0.1111111 0.3333333FDR:错误发现率 0.4 0.5 0.4FNR:未命中率 0 0.6666667 0.5ACC:准确度 0.8333333 0.75 0.5833333F1 分数 0.75 0.4 0.5454545MCC:马修斯相关系数 0.6831301 0.2581989 0.1690309知情度 0.7777778 0.2222222 0.1666667标记 0.6 0.3 0.1714286流行率 0.25 0.25 0.5LR+:正似然比 4.5 3 1.5LR-:负似然比 0 0.75 0.75DOR:诊断优势比 inf 4 2FOR: 误漏率 0 0.2 0.4285714

我注意到一个名为 PyCM 的关于混淆矩阵的新 Python 库已经发布:也许你可以看看.

I wrote a confusion matrix calculation code in Python:

def conf_mat(prob_arr, input_arr):
        # confusion matrix
        conf_arr = [[0, 0], [0, 0]]

        for i in range(len(prob_arr)):
                if int(input_arr[i]) == 1:
                        if float(prob_arr[i]) < 0.5:
                                conf_arr[0][1] = conf_arr[0][1] + 1
                        else:
                                conf_arr[0][0] = conf_arr[0][0] + 1
                elif int(input_arr[i]) == 2:
                        if float(prob_arr[i]) >= 0.5:
                                conf_arr[1][0] = conf_arr[1][0] +1
                        else:
                                conf_arr[1][1] = conf_arr[1][1] +1

        accuracy = float(conf_arr[0][0] + conf_arr[1][1])/(len(input_arr))

prob_arr is an array that my classification code returned and a sample array is like this:

 [1.0, 1.0, 1.0, 0.41592955657342651, 1.0, 0.0053405015805891975, 4.5321494433440449e-299, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.70943426182688163, 1.0, 1.0, 1.0, 1.0]

input_arr is the original class labels for a dataset and it is like this:

[2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1]

What my code is trying to do is: i get prob_arr and input_arr and for each class (1 and 2) I check if they are misclassified or not.

But my code only works for two classes. If I run this code for a multiple classed data, it doesn't work. How can I make this for multiple classes?

For example, for a data set with three classes, it should return: [[21,7,3],[3,38,6],[5,4,19]]

解决方案

Scikit-Learn provides a confusion_matrix function

from sklearn.metrics import confusion_matrix
y_actu = [2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2]
y_pred = [0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2]
confusion_matrix(y_actu, y_pred)

which output a Numpy array

array([[3, 0, 0],
       [0, 1, 2],
       [2, 1, 3]])

But you can also create a confusion matrix using Pandas:

import pandas as pd
y_actu = pd.Series([2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2], name='Actual')
y_pred = pd.Series([0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2], name='Predicted')
df_confusion = pd.crosstab(y_actu, y_pred)

You will get a (nicely labeled) Pandas DataFrame:

Predicted  0  1  2
Actual
0          3  0  0
1          0  1  2
2          2  1  3

If you add margins=True like

df_confusion = pd.crosstab(y_actu, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)

you will get also sum for each row and column:

Predicted  0  1  2  All
Actual
0          3  0  0    3
1          0  1  2    3
2          2  1  3    6
All        5  2  5   12

You can also get a normalized confusion matrix using:

df_conf_norm = df_confusion / df_confusion.sum(axis=1)

Predicted         0         1         2
Actual
0          1.000000  0.000000  0.000000
1          0.000000  0.333333  0.333333
2          0.666667  0.333333  0.500000

You can plot this confusion_matrix using

import matplotlib.pyplot as plt
def plot_confusion_matrix(df_confusion, title='Confusion matrix', cmap=plt.cm.gray_r):
    plt.matshow(df_confusion, cmap=cmap) # imshow
    #plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(df_confusion.columns))
    plt.xticks(tick_marks, df_confusion.columns, rotation=45)
    plt.yticks(tick_marks, df_confusion.index)
    #plt.tight_layout()
    plt.ylabel(df_confusion.index.name)
    plt.xlabel(df_confusion.columns.name)

plot_confusion_matrix(df_confusion)

Or plot normalized confusion matrix using:

plot_confusion_matrix(df_conf_norm)

You might also be interested by this project https://github.com/pandas-ml/pandas-ml and its Pip package https://pypi.python.org/pypi/pandas_ml

With this package confusion matrix can be pretty-printed, plot.You can binarize a confusion matrix, get class statistics such as TP, TN, FP, FN, ACC, TPR, FPR, FNR, TNR (SPC), LR+, LR-, DOR, PPV, FDR, FOR, NPV and some overall statistics

In [1]: from pandas_ml import ConfusionMatrix
In [2]: y_actu = [2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2]
In [3]: y_pred = [0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2]
In [4]: cm = ConfusionMatrix(y_actu, y_pred)
In [5]: cm.print_stats()
Confusion Matrix:

Predicted  0  1  2  __all__
Actual
0          3  0  0        3
1          0  1  2        3
2          2  1  3        6
__all__    5  2  5       12


Overall Statistics:

Accuracy: 0.583333333333
95% CI: (0.27666968568210581, 0.84834777019156982)
No Information Rate: ToDo
P-Value [Acc > NIR]: 0.189264302376
Kappa: 0.354838709677
Mcnemar's Test P-Value: ToDo


Class Statistics:

Classes                                        0          1          2
Population                                    12         12         12
P: Condition positive                          3          3          6
N: Condition negative                          9          9          6
Test outcome positive                          5          2          5
Test outcome negative                          7         10          7
TP: True Positive                              3          1          3
TN: True Negative                              7          8          4
FP: False Positive                             2          1          2
FN: False Negative                             0          2          3
TPR: (Sensitivity, hit rate, recall)           1  0.3333333        0.5
TNR=SPC: (Specificity)                 0.7777778  0.8888889  0.6666667
PPV: Pos Pred Value (Precision)              0.6        0.5        0.6
NPV: Neg Pred Value                            1        0.8  0.5714286
FPR: False-out                         0.2222222  0.1111111  0.3333333
FDR: False Discovery Rate                    0.4        0.5        0.4
FNR: Miss Rate                                 0  0.6666667        0.5
ACC: Accuracy                          0.8333333       0.75  0.5833333
F1 score                                    0.75        0.4  0.5454545
MCC: Matthews correlation coefficient  0.6831301  0.2581989  0.1690309
Informedness                           0.7777778  0.2222222  0.1666667
Markedness                                   0.6        0.3  0.1714286
Prevalence                                  0.25       0.25        0.5
LR+: Positive likelihood ratio               4.5          3        1.5
LR-: Negative likelihood ratio                 0       0.75       0.75
DOR: Diagnostic odds ratio                   inf          4          2
FOR: False omission rate                       0        0.2  0.4285714

I noticed that a new Python library about Confusion Matrix named PyCM is out: maybe you can have a look.

这篇关于如何在 Python 中编写混淆矩阵?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！