问题描述
我从 scikit-learn 开始,我正在尝试将一组文档转换为可以应用聚类和分类的格式.我已经看到了有关矢量化方法的详细信息,以及用于加载文件和索引其词汇表的 tfidf 转换.
I am starting with scikit-learn and I am trying to transform a set of documents into a format on which I could apply clustering and classification. I have seen the details about the vectorization methods, and the tfidf transformations to load the files and index their vocabularies.
但是,我对每个文档都有额外的元数据,例如作者、负责的部门、主题列表等.
However, I have extra metadata for each documents, such as the authors, the division that was responsible, list of topics, etc.
如何向矢量化函数生成的每个文档向量添加特征?
How can I add features to each document vector generated by the vectorizing function?
推荐答案
您可以使用 DictVectorizer
获取额外的分类数据,然后使用 scipy.sparse.hstack 将它们组合起来.
You could use the DictVectorizer
for the extra categorical data and then use scipy.sparse.hstack to combine them.
这篇关于scikit-learn,将特征添加到一组矢量化的文档中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!