问题描述
我希望我的分类算法能够并且仅当满足特定类别的某个阈值准确性(例如准确性的80%)时,才基于一组类别对基于自然语言的原始数据进行分类.我的分类器,将特定的原始文本分类为未分类"类别.我该怎么做?
I want my classification algorithm to classify my natural language based raw data based on a set of category if and only if it is going to meet a certain threshold accuracy with respect to a category(say 80% of accuracy) else I want my classifier to classify that particular raw text to a 'unclassified' category. How do I do this?
我的示例数据集:
+----------------------+------------+
| Details | Category |
+----------------------+------------+
| Any raw text1 | cat1 |
+----------------------+------------+
| any raw text2 | cat1 |
+----------------------+------------+
| any raw text5 | cat2 |
+----------------------+------------+
| any raw text7 | cat1 |
+----------------------+------------+
| any raw text8 | cat2 |
+----------------------+------------+
| Any raw text4 | cat4 |
+----------------------+------------+
| any raw text5 | cat4 |
+----------------------+------------+
| any raw text6 | cat3 |
+----------------------+------------+
这将是我的训练数据,我将把与测试集和训练集相同的数据进行划分
this would be my training data, I'll divide the same data as test set and train set
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
data= pd.read_csv('mydata.xls.gold', delimiter='\t',usecols=
['Details','Category'],encoding='utf-8')
target_one=data['Category']
target_list=data['Category'].unique()
x_train, x_test, y_train, y_test = train_test_split(data.Details,
data.NUM_CATEGORY, random_state=42)
vect = CountVectorizer(ngram_range=(1,2))
#converting traning features into numeric vector
X_train = vect.fit_transform(x_train.values.astype('U'))
#converting training labels into numeric vector
X_test = vect.transform(x_test.values.astype('U'))
start = time.clock()
mnb = MultinomialNB(alpha =0.13)
mnb.fit(X_train,y_train)
result= mnb.predict(X_test)
print (time.clock()-start)
# mnb.predict_proba(x_test)[0:10,1]
accuracy_score(result,y_test)
我该如何进行?分类器是否需要设置任何参数?预先感谢.
How do I proceed ? Is there any parameter that needs to be set for the classifier?Thanks in advance.
推荐答案
您可以使用 predict_proba
结果并使用 columns = target_list
创建熊猫数据框,然后使用 max
和 idxmax
查找测试集中每个元素的可能性最高的类别.完成后,您可以使用布尔值掩蔽和广播将低于阈值的类别设置为未分类"
You can use predict_proba
result and create a pandas data-frame with columns = target_list
then use max
and idxmax
to find the category with the highest probability for each element in the test set. once that is done you can use boolean masking and broadcasting to set the categories that's below the threshold to "unclassified"
import pandas as pd
df = pd.DataFrame(clf.predict_proba(X_test), columns=target_list)
res_df = pd.DataFrame()
res_df['max_prob'] = df.max(axis=1)
res_df['max_prob_cat'] = df.idxmax(axis=1)
res_df.loc[res_df['max_prob'] < .8, 'max_prob_cat'] = 'unclassified'
df如下所示
cat1 cat2 cat3 cat4
0 1.091685e-06 2.257549e-04 9.994661e-01 3.070665e-04
1 2.288312e-02 9.752170e-01 1.783878e-03 1.159706e-04
2 1.980685e-01 3.494765e-01 4.416871e-01 1.076788e-02
3 2.205478e-07 9.999601e-01 3.276864e-05 6.920325e-06
4 2.736805e-03 9.795997e-01 1.718200e-02 4.815429e-04
res_df看起来像
res_df will look like
max_prob max_prob_cat
0 0.999466 cat3
1 0.975217 cat2
2 0.441687 unclassified
3 0.999960 cat2
4 0.979600 cat2
5 0.999956 cat2
6 0.998864 cat3
7 0.996888 cat3
8 0.999422 cat1
9 0.994412 cat3
10 0.954508 cat2
11 0.999999 cat2
这篇关于Sci-kit learn/python中自然文本的有效分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!