将特征添加到一组矢量化的文档中

将特征添加到一组矢量化的文档中

本文介绍了scikit-learn,将特征添加到一组矢量化的文档中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从 scikit-learn 开始,我正在尝试将一组文档转换为可以应用聚类和分类的格式.我已经看到了有关矢量化方法的详细信息,以及用于加载文件和索引其词汇表的 tfidf 转换.

I am starting with scikit-learn and I am trying to transform a set of documents into a format on which I could apply clustering and classification. I have seen the details about the vectorization methods, and the tfidf transformations to load the files and index their vocabularies.

但是,我对每个文档都有额外的元数据,例如作者、负责的部门、主题列表等.

However, I have extra metadata for each documents, such as the authors, the division that was responsible, list of topics, etc.

如何向矢量化函数生成的每个文档向量添加特征?

How can I add features to each document vector generated by the vectorizing function?

推荐答案

您可以使用 DictVectorizer 获取额外的分类数据,然后使用 scipy.sparse.hstack 将它们组合起来.

You could use the DictVectorizer for the extra categorical data and then use scipy.sparse.hstack to combine them.

这篇关于scikit-learn,将特征添加到一组矢量化的文档中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-01 20:32