问题描述
我有一堆带有文章的文件.对于每篇文章,应该有一些功能,例如:文本长度, text_spam (均为整数或浮点数,在大多数情况下,应从csv加载).我想做的就是-将这些功能与CountVectorizer结合起来,然后对这些文本进行分类.
I have a bunch of files with articles. For each article there should be some features, like: text length, text_spam (all are ints or floats, and in most cases they should be loaded from csv). And what I want to do is - to combine these features with CountVectorizer and then classify those texts.
我看了一些教程,但是我仍然不知道如何实现这些东西.在此处找到了一些东西,但实际上并不能满足我的需要.
I have watched some tutorials, but still I have no idea how to implement this stuff. Found something here, but can't actually implement this for my needs.
有什么想法可以用scikit完成吗?
Any ideas how that could be done with scikit?
谢谢.
我现在遇到的是:
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import FeatureUnion
measurements = [
{'text_length': 1000, 'text_spam': 4.3},
{'text_length': 2000, 'text_spam': 4.1},
]
corpus = [
'some text',
'some text 2 hooray',
]
vectorizer = DictVectorizer()
count_vectorizer = CountVectorizer(min_df=1)
first_x = vectorizer.fit_transform(measurements)
second_x = count_vectorizer.fit_transform(corpus)
combined_features = FeatureUnion([('first', first_x), ('second', second_x)])
对于这堆代码,我不明白如何加载真实"数据,因为已经加载了训练集.第二个-如何加载类别(拟合函数的y参数)?
For this bunch of code I do not understand how to load "real"-data, since training sets are already loaded. And the second one - how to load categories (y parameter for fit function)?
推荐答案
您误会了FeatureUnion
.应该要带两个变压器,而不是两批样品.
You're misunderstanding FeatureUnion
. It's supposed to take two transformers, not two batches of samples.
您可以强迫它处理您拥有的矢量化器,但是将每个样本的所有功能都放进一个大袋子中,然后使用一个DictVectorizer
可以从这些袋子中制作出向量,这要容易得多.
You can force it into dealing with the vectorizers you have, but it's much easier to just throw all your features into one big bag per sample and use a single DictVectorizer
to make vectors out of those bags.
# make a CountVectorizer-style tokenizer
tokenize = CountVectorizer().build_tokenizer()
def features(document):
terms = tokenize(document)
d = {'text_length': len(terms), 'text_spam': whatever_this_means}
for t in terms:
d[t] = d.get(t, 0) + 1
return d
vect = DictVectorizer()
X_train = vect.fit_transform(features(d) for d in documents)
别忘了用sklearn.preprocessing.Normalizer
对其进行归一化,并且要注意,即使在归一化之后,这些text_length
要素也必将在规模上主导其他要素.改用1. / text_length
或np.log(text_length)
可能更明智.
Don't forget to normalize this with sklearn.preprocessing.Normalizer
, and be aware that even after normalization, those text_length
features are bound to dominate the other features in terms of scale. It might be wiser to use 1. / text_length
or np.log(text_length)
instead.
取决于数据的组织方式. scikit-learn有很多辅助函数和类,但是如果您的设置不标准,它确实希望您编写代码.
Depends on how your data is organized. scikit-learn has a lot of helper functions and classes, but it does expect you to write code if your setup is non-standard.
这篇关于将自定义功能与CountVectorizer连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!