问题描述
在
有没有办法获得另一种在 scikit-learn 中计算准确度的典型方法,即
(如 (1) 和 (2) 中所定义,也不太含糊地称为 Hamming score (4)(因为它与汉明损失密切相关),或 基于标签的准确性)?
(1) Sorower, Mohammad S. "A多标签学习算法的文献调查." 俄勒冈州立大学,科瓦利斯 (2010).
(2) Tsoumakas、Grigorios 和 Ioannis Katakis."多标签分类:概述." 希腊塞萨洛尼基亚里士多德大学信息学系 (2006).
(3) Ghamrawi、Nadia 和 Andrew McCallum."集体多标签分类." 论文集第 14 届 ACM 信息与知识管理国际会议.ACM,2005 年.
(4) Godbole、Shantanu 和 Sunita Sarawagi."多标签分类的判别方法." 知识发现和数据挖掘的进展.Springer Berlin Heidelberg, 2004. 22-30.
你可以自己写一个版本,这里举个例子,不考虑权重和归一化.
将 numpy 导入为 npy_true = np.array([[0,1,0],[0,1,1],[1,0,1],[0,0,1]])y_pred = np.array([[0,1,1],[0,1,1],[0,1,0],[0,0,0]])def hamming_score(y_true, y_pred, normalize=True, sample_weight=None):'''计算多标签情况的汉明分数(又名基于标签的准确性)http://stackoverflow.com/q/32239577/395857'''acc_list = []对于范围内的 i(y_true.shape[0]):set_true = set( np.where(y_true[i])[0] )set_pred = set( np.where(y_pred[i])[0] )#print('
set_true: {0}'.format(set_true))#print('set_pred: {0}'.format(set_pred))tmp_a = 无如果 len(set_true) == 0 和 len(set_pred) == 0:tmp_a = 1别的:tmp_a = len(set_true.intersection(set_pred))/浮动(len(set_true.union(set_pred)))#print('tmp_a: {0}'.format(tmp_a))acc_list.append(tmp_a)返回 np.mean(acc_list)如果 __name__ == "__main__":打印('汉明分数:{0}'.format(hamming_score(y_true,y_pred)))#0.375(=(0.5+1+0+0)/4)# 为了比较:导入 sklearn.metrics# 子集精度# 0.25 (= 0+1+0+0/4) -->1 如果对一个样本的预测与黄金完全匹配.0 否则.打印('子集精度:{0}'.format(sklearn.metrics.accuracy_score(y_true,y_pred,normalize=True,sample_weight=None)))# 汉明损失(越小越好)# $$ ext{HammingLoss}(x_i, y_i) = frac{1}{|D|} sum_{i=1}^{|D|} frac{xor(x_i, y_i)}{|L|}, $$# 在哪里# - \(|D|\) 是样本数# - \(|L|\) 是标签的数量# - \(y_i\) 是基本事实# - \(x_i\) 是预测.# 0.416666666667 (= (1+0+3+1)/(3*4) )打印('汉明损失:{0}'.format(sklearn.metrics.hamming_loss(y_true,y_pred)))
输出:
汉明分数:0.375子集精度:0.25汉明损失:0.416666666667
In a multilabel classification setting, sklearn.metrics.accuracy_score
only computes the subset accuracy (3): i.e. the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.
This way of computing the accuracy is sometime named, perhaps less ambiguously, exact match ratio (1):
Is there any way to get the other typical way to compute the accuracy in scikit-learn, namely
(as defined in (1) and (2), and less ambiguously referred to as the Hamming score (4) (since it is closely related to the Hamming loss), or label-basedaccuracy)?
(1) Sorower, Mohammad S. "A literature survey on algorithms for multi-label learning." Oregon State University, Corvallis (2010).
(2) Tsoumakas, Grigorios, and Ioannis Katakis. "Multi-label classification: An overview." Dept. of Informatics, Aristotle University of Thessaloniki, Greece (2006).
(3) Ghamrawi, Nadia, and Andrew McCallum. "Collective multi-label classification." Proceedings of the 14th ACM international conference on Information and knowledge management. ACM, 2005.
(4) Godbole, Shantanu, and Sunita Sarawagi. "Discriminative methods for multi-labeled classification." Advances in Knowledge Discovery and Data Mining. Springer Berlin Heidelberg, 2004. 22-30.
You can write one version yourself, here is a example without considering the weight and normalize.
import numpy as np
y_true = np.array([[0,1,0],
[0,1,1],
[1,0,1],
[0,0,1]])
y_pred = np.array([[0,1,1],
[0,1,1],
[0,1,0],
[0,0,0]])
def hamming_score(y_true, y_pred, normalize=True, sample_weight=None):
'''
Compute the Hamming score (a.k.a. label-based accuracy) for the multi-label case
http://stackoverflow.com/q/32239577/395857
'''
acc_list = []
for i in range(y_true.shape[0]):
set_true = set( np.where(y_true[i])[0] )
set_pred = set( np.where(y_pred[i])[0] )
#print('
set_true: {0}'.format(set_true))
#print('set_pred: {0}'.format(set_pred))
tmp_a = None
if len(set_true) == 0 and len(set_pred) == 0:
tmp_a = 1
else:
tmp_a = len(set_true.intersection(set_pred))/
float( len(set_true.union(set_pred)) )
#print('tmp_a: {0}'.format(tmp_a))
acc_list.append(tmp_a)
return np.mean(acc_list)
if __name__ == "__main__":
print('Hamming score: {0}'.format(hamming_score(y_true, y_pred))) # 0.375 (= (0.5+1+0+0)/4)
# For comparison sake:
import sklearn.metrics
# Subset accuracy
# 0.25 (= 0+1+0+0 / 4) --> 1 if the prediction for one sample fully matches the gold. 0 otherwise.
print('Subset accuracy: {0}'.format(sklearn.metrics.accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)))
# Hamming loss (smaller is better)
# $$ ext{HammingLoss}(x_i, y_i) = frac{1}{|D|} sum_{i=1}^{|D|} frac{xor(x_i, y_i)}{|L|}, $$
# where
# - \(|D|\) is the number of samples
# - \(|L|\) is the number of labels
# - \(y_i\) is the ground truth
# - \(x_i\) is the prediction.
# 0.416666666667 (= (1+0+3+1) / (3*4) )
print('Hamming loss: {0}'.format(sklearn.metrics.hamming_loss(y_true, y_pred)))
Outputs:
Hamming score: 0.375
Subset accuracy: 0.25
Hamming loss: 0.416666666667
这篇关于在 scikit-learn 中获得多标签预测的准确性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!