问题描述
以下示例显示了如何使用Sklearn 20新闻组数据训练分类器.
The following example shows how one can train a classifier with the Sklearn 20 newsgroups data.
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
>>> newsgroups_train = fetch_20newsgroups(subset='train', ... categories=categories)
>>> vectorizer = TfidfVectorizer() >>> vectors = vectorizer.fit_transform(newsgroups_train.data)
>>> vectors.shape (2034, 34118)
但是,我有自己想使用的标注语料库.
However, I have my own labeled corpus that I would like to use.
获取我自己的数据的tfidfvector后,我会训练这样的分类器吗?
After getting a tfidfvector of my own data, would I train a classifier like this?
classif_nb = nltk.NaiveBayesClassifier.train(vectorizer)
回顾一下:如何使用我自己的语料库而不是20newsgroups,但如何使用这里的方法?然后如何使用TFIDFVectorized语料库来训练分类器?
To recap:How can I use my own corpus instead of the 20newsgroups, but in the same way used here?How can I then use my TFIDFVectorized corpus to train a classifier?
谢谢!
推荐答案
解决评论中的问题;在某些分类任务中使用tfidf表示形式的整个基本过程应该是:
To address questions from comments; The whole basic process of working with tfidf representation in some classification task you should:
- 您使适合您的训练数据,并将其保存在某个变量中,让我们将其称为 tfidf
- 您通过data = tfidf.transform(...)变换训练数据(没有标签,只有文本)
- 您使用some_classifier.fit(data,labels)拟合模型(分类器),其中标签与数据中的文档顺序相同
- 在测试过程中,对新数据使用tfidf.transform(...),并检查模型的预测
- You fit the vectorizer to your training data and save it in some variable, lets call it tfidf
- You transform training data (without labels, just text) through data = tfidf.transform(...)
- You fit the model (classifier) using some_classifier.fit( data, labels ), where labels are in the same order as documnents in data
- During testing you use tfidf.transform( ... ) on new data, and check the prediction of your model
这篇关于如何使用TfIdfVectorizer使用SciKitLearn对文档进行分类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!