本文介绍了将令牌传递给 CountVectorizer的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文本分类问题,我有两种类型的特征:

I have a text classification problem where i have two types of features:

  • n-gram 特征(由 CountVectorizer 提取)
  • 其他文本特征(例如,来自给定词典的单词的存在).这些特征与 n-gram 不同,因为它们应该是从文本中提取的任何 n-gram 的一部分.

两种类型的特征都是从文本的标记中提取的.我只想运行一次标记化,然后将这些标记传递给 CountVectorizer 和其他存在特征提取器.所以,我想将令牌列表传递给 CountVectorizer,但只接受一个字符串作为某些样本的表示.有没有办法传递一组令牌?

Both types of features are extracted from the text's tokens. I want to run tokenization only once,and then pass these tokens to CountVectorizer and to the other presence features extractor. So, i want to pass a list of tokens to CountVectorizer, but is only accepts a string as a representation to some sample. Is there a way to pass an array of tokens?

推荐答案

总结@user126350 和@miroli 的回答以及这个链接:

Summarizing the answers of @user126350 and @miroli and this link:

from sklearn.feature_extraction.text import CountVectorizer

def dummy(doc):
    return doc

cv = CountVectorizer(
        tokenizer=dummy,
        preprocessor=dummy,
    )

docs = [
    ['hello', 'world', '.'],
    ['hello', 'world'],
    ['again', 'hello', 'world']
]

cv.fit(docs)
cv.get_feature_names()
# ['.', 'again', 'hello', 'world']

要记住的一件事是在调用 transform() 函数之前将新的标记化文档包装到列表中,以便将其作为单个文档处理,而不是将每个标记解释为文档:

The one thing to keep in mind is to wrap the new tokenized document into a list before calling the transform() function so that it is handled as a single document instead of interpreting each token as a document:

new_doc = ['again', 'hello', 'world', '.']
v_1 = cv.transform(new_doc)
v_2 = cv.transform([new_doc])

v_1.shape
# (4, 4)

v_2.shape
# (1, 4)

这篇关于将令牌传递给 CountVectorizer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-28 03:54