python - 如何在 Scikit-learn 中手工设计 TfidfVectorizer 的功能？

我正在尝试按关键字对文档进行聚类。我正在使用以下代码制作 tdidf-matrix :

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=.8, max_features=1000,
                             min_df=0.07, stop_words='english',
                             use_idf=True, tokenizer=tokenize_and_stem,
                             ngram_range=(1,2))

tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

print(tfidf_matrix.shape)返回 (567, 209) ，这意味着有 567 个文档，每个文档都有一些由 scikit-learn TdidfVectorizer 检测到的 209 个特征词的混合。

现在，我使用 terms = tfidf_vectorizer.get_feature_names() 来获取术语列表。运行 print(len(terms)) 给出 209
其中许多词对于任务来说是不必要的，它们会给聚类增加噪音。我手动浏览了列表并提取了有意义的特征名称，从而生成了一个新的 terms 列表。现在，运行 print(len(terms)) 给出 67
但是，运行 tfidf_vectorizer.fit_transform(documents) 仍然给出 (567, 209) 的形状，这意味着 fit_transform(documents) 函数仍然使用 209 个术语的嘈杂列表，而不是手动选择的 67 个术语列表。

如何使用 67 个手动选择的术语列表来运行 tfidf_vectorizer.fit_transform(documents) 函数？我在想，这可能需要我在机器上的 Scikit-Learn 包中至少添加一个函数，对吗？

任何帮助是极大的赞赏。谢谢!

最佳答案

有两种方式:

如果您已经确定了一个停用词列表(您称它们为“任务不必要的”)，只需将它们放入 stop_words 的 TfidfVectorizer 参数中即可在创建词袋时忽略它们。但是请注意，预定义的英文停用词如果将 stop_words 参数设置为自定义列表，则将不再使用。如果要将预定义的英语列表与其他停用词组合在一起，只需添加两个列表:

from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
stop_words = list(ENGLISH_STOP_WORDS) + ['your','additional', 'stopwords']
tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words) # add your other params here

如果你有一个固定的词汇表并且只想统计这些单词(即你的 terms 列表)，只需设置 vocabulary 的 TfidfVectorizer 参数:

tfidf_vectorizer = TfidfVectorizer(vocabulary=terms) # add your other params here

关于python - 如何在 Scikit-learn 中手工设计 TfidfVectorizer 的功能？，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/47917287/