PYTHON:如何将带有关键字参数的 token 生成器传递给scikit的CountVectorizer？

我有一个带有一些关键字参数的自定义 token 生成器功能:

def tokenizer(text, stem=True, lemmatize=False, char_lower_limit=2, char_upper_limit=30):
    do things...
    return tokens

现在，如何将这个标记器及其所有参数传递给CountVectorizer？我没有尝试过任何工作；这也不起作用:

from sklearn.feature_extraction.text import CountVectorizer
args = {"stem": False, "lemmatize": True}
count_vect = CountVectorizer(tokenizer=tokenizer(**args), stop_words='english', strip_accents='ascii', min_df=0, max_df=1., vocabulary=None)

任何帮助深表感谢。提前致谢。

最佳答案

tokenizer应该是可调用的或无。

(tokenizer=tokenize(**args)是一个错字吗？您上面的函数名称是tokenizer。)

您可以尝试以下方法:

count_vect = CountVectorizer(tokenizer=lambda text: tokenizer(text, **args), stop_words='english', strip_accents='ascii', min_df=0, max_df=1., vocabulary=None)

关于PYTHON:如何将带有关键字参数的 token 生成器传递给scikit的CountVectorizer？，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/31843996/