python - 提供单词词汇与空间以学习scikit CountVectorizer

引用此post。我想知道我们如何为CountVectorizer模型提供单词词汇空间distributed systems还是machine learning？这是一个例子：

import numpy as np
from itertools import chain

tags = [
  "python, tools",
  "linux, tools, ubuntu",
  "distributed systems, linux, networking, tools",
]

vocabulary = list(map(lambda x: x.split(', '), tags))
vocabulary = list(np.unique(list(chain(*vocabulary))))

我们可以将此词汇表提供给模型

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(vocabulary=vocabulary)
print(vec.fit_transform(tags).toarray())

在这里，我失去了单词distributed systems的计数（第一列）。结果如下：

[[0 0 0 1 1 0]
 [0 1 0 0 1 1]
 [0 1 1 0 1 0]]

我必须更改token_pattern还是其他地方？

最佳答案

我认为基本上您已经预定义了要分析的词汇表，并且希望通过拆分'，'来标记化标记。

您可以通过以下方法欺骗CountVectorizer：

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(vocabulary=vocabulary, tokenizer=lambda x: x.split(', '))
print(vec.fit_transform(tags).toarray())

，这使：

[[0 0 0 1 1 0]
 [0 1 0 0 1 1]
 [1 1 1 0 1 0]]

关于python - 提供单词词汇与空间以学习scikit CountVectorizer，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/37873007/