所以我得到这个错误试图返回我的sklearn矢量化器的不同值:
>>> python features.py lemmatize_PS Gold.xlsx
Traceback (most recent call last):
File "features.py", line 351, in <module>
fea1, fea0, fe, fi, fo, fu, fo, fea2 = build_feature_matrix_S(sentences)
File "features.py", line 100, in build_feature_matrix_S
vectorizer_freq = CountVectorizer(tokenizer = tokenize_lemmatize_spacy(first_arg), binary=False, min_df=5, ngram_range=gram)
TypeError: tokenize_lemmatize_spacy() missing 1 required positional argument: 'first_arg'
tokenize_lemmatize函数如下所示:
def tokenize_lemmatize_spacy(texte, first_arg):
texte = normalize(texte)
mytokens = nlp(texte)
if first_arg == 'lemmatize_only':
# Lemmatizing each token and converting each token into lowercase
mytokens = [word.lemma_.lower().strip() for word in mytokens if word.pos_ != "SPACE"]
elif first_arg == 'lemmatize_PS':
# Lemmatizing each token and converting each token into lowercase
mytokens = [word.lemma_.lower().strip() for word in mytokens if word.pos_ != "SPACE" ]
# Removing stop words and punctuations
mytokens = [word for word in mytokens if word not in stopwords and word not in punctuations]
else:
raise Exception("Wrong feature type entered. Possible values: 'lemmatize_only', 'lemmatize_PS'")
return mytokens
我测试了功能token_lemmatize,它可以工作,但是现在当我尝试在另一个脚本中使用它时,出现以下错误。
最佳答案
CountVectorizer
需要一个可调用对象,但是您正在尝试提供该函数的输出。
使用partial
from functools import partial
vectorizer_freq = CountVectorizer(tokenizer=partial(tokenize_lemmatize_spacy,
first_arg='lemmatize_PS')
binary=False, min_df=5, ngram_range=gram)