问题描述
我正在尝试通过继承 CountVectorizer
来创建自定义矢量化器.向量化器会在计算词频之前对句子中的所有词进行词干.然后我在管道中使用这个向量化器,当我执行 pipeline.fit(X,y)
时它工作正常.
I am trying to create a custom vectorizer by subclassing the CountVectorizer
. The vectorizer will stem all the words in the sentence before counting the word frequency. I then use this vectorizer in a pipeline which works fine when I do pipeline.fit(X,y)
.
但是,当我尝试使用 pipeline.set_params(rf__verbose=1).fit(X,y)
设置参数时,出现以下错误:
However, when I try to set a parameter with pipeline.set_params(rf__verbose=1).fit(X,y)
, I get the following error:
RuntimeError: scikit-learn estimators should always specify their parameters in the signature of their __init__ (no varargs). <class 'features.extraction.labels.StemmedCountVectorizer'> with constructor (self, *args, **kwargs) doesn't follow this convention.
这是我的自定义矢量化器:
Here is my custom vectorizer:
class StemmedCountVectorizer(CountVectorizer):
def __init__(self, *args, **kwargs):
self.stemmer = SnowballStemmer("english", ignore_stopwords=True)
super(StemmedCountVectorizer, self).__init__(*args, **kwargs)
def build_analyzer(self):
analyzer = super(StemmedCountVectorizer, self).build_analyzer()
return lambda doc: ([' '.join([self.stemmer.stem(w) for w in word_tokenize(word)]) for word in analyzer(doc)])
我知道我可以设置 CountVectorizer
类的每个参数,但它似乎不遵循 DRY 原则.
I understand that I could set every single parameter of the CountVectorizer
class but it doesn't seem to follow the DRY principle.
感谢您的帮助!
推荐答案
我没有在 sklearn
中使用矢量化器的经验,但是我遇到了类似的问题.我已经实现了一个自定义估算器,我们暂时将其称为 MyBaseEstimator
,扩展 sklearn.base.BaseEstimator
.然后我实现了其他一些扩展 MyBaseEstimator
的自定义子估计器.MyBaseEstimator
类在它的 __init__
中定义了多个参数,我不想在每个的 __init__
方法中使用相同的参数子估计量.
I have no experience with vectorizers in sklearn
, however I ran into a similar problem. I've implemented a custom estimator, let's call it MyBaseEstimator
for now, extending sklearn.base.BaseEstimator
. Then I've implemted a few other custom sub-estimators extending MyBaseEstimator
. The MyBaseEstimator
class defined multiple arguments in its __init__
, and I didn't want to have the same arguments in the __init__
methods of each of the sub-estimators.
然而,如果没有重新定义子类中的参数,sklearn
的大部分功能都不起作用,特别是设置这些参数以进行交叉验证.sklearn
似乎希望使用 BaseEstimator.get_params()
和 BaseEstimator.set_params()
可以检索和修改估算器的所有相关参数代码>方法.并且这些方法在子类之一上调用时,不会返回基类中定义的任何参数.
However, without re-defining the arguments in the subclasses, much of sklearn
functionality didn't work, specificlaly, setting these parameters for cross-validation. It seems that sklearn
expects that all the relevant parameters for an estimator can be retrieved and modified using the BaseEstimator.get_params()
and BaseEstimator.set_params()
methods. And these methods, when invoked on one of the subclasses, do not return any parameters defined in the baseclass.
为了解决这个问题,我在 MyBaseEstimator
中实现了一个覆盖 get_params()
,它使用一个丑陋的 hack 来合并动态类型的参数(它的一个子类) 使用由它自己的 __init__
定义的参数.
To work around this I implemented an overriding get_params()
in MyBaseEstimator
that uses an ugly hack to merge the parameters of the dynamic type (one of it's sub-calsses) with the parameters defined by its own __init__
.
这是应用于您的 CountVectorizer
...
Here's the same ugly hack applied to your CountVectorizer
...
import copy
from sklearn.feature_extraction.text import CountVectorizer
class SubCountVectorizer(CountVectorizer):
def __init__(self, p1=1, p2=2, **kwargs):
super().__init__(**kwargs)
def get_params(self, deep=True):
params = super().get_params(deep)
# Hack to make get_params return base class params...
cp = copy.copy(self)
cp.__class__ = CountVectorizer
params.update(CountVectorizer.get_params(cp, deep))
return params
if __name__ == '__main__':
scv = SubCountVectorizer(p1='foo', input='bar', encoding='baz')
scv.set_params(**{'p2': 'foo2', 'analyzer': 'bar2'})
print(scv.get_params())
运行上面的代码打印如下:
Running the above code prints the following:
{'p1': None, 'p2': 'foo2',
'analyzer': 'bar2', 'binary': False,
'decode_error': 'strict', 'dtype': <class 'numpy.int64'>,
'encoding': 'baz', 'input': 'bar',
'lowercase': True, 'max_df': 1.0, 'max_features': None,
'min_df': 1, 'ngram_range': (1, 1), 'preprocessor': None,
'stop_words': None, 'strip_accents': None,
'token_pattern': '(?u)\\b\\w\\w+\\b',
'tokenizer': None, 'vocabulary': None}
这表明 sklearn
的 get_params()
和 set_params()
现在都可以工作并且还传递两个子类的关键字参数和子类 __init__
的基类工作.
which shows that sklearn
's get_params()
and set_params()
now both work and also passing keyword-arguments of both the subbclass and the baseclass to the subclass __init__
works.
不确定这有多强大以及它是否能解决您的确切问题,但它可能对某人有用.
Not sure how robust this is and whether it solves your exact issue, but it may be of use to someone.
这篇关于如何在不重复构造函数中的所有参数的情况下,在 scikit-learn 中对矢量化器进行子类化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!