问题描述
我正在 scikit-learn 中编写自定义转换器,以便对数组进行特定操作.为此,我使用类 TransformerMixin 的继承.当我只处理一个变压器时它工作正常.但是,当我尝试使用 FeatureUnion(或 make_union)链接它们时,数组被复制了 n 次.我能做些什么来避免这种情况?我是否按预期使用 scikit-learn?
I am writing custom transformers in scikit-learn in order to do specific operations on the array. For that I use inheritance of class TransformerMixin.It works fine when I deal only with one transformer.However when I try to chain them using FeatureUnion (or make_union), the array is replicated n-times.What could I do to avoid that?Am I using scikit-learn as it is supposed to be?
import numpy as np
from sklearn.base import TransformerMixin
from sklearn.pipeline import FeatureUnion
# creation of array
s1 = np.array(['foo', 'bar', 'baz'])
s2 = np.array(['a', 'b', 'c'])
X = np.column_stack([s1, s2])
print('base array: \n', X, '\n')
# A fake example that appends a column (Could be a score, ...) calculated on specific columns from X
class DummyTransformer(TransformerMixin):
def __init__(self, value=None):
TransformerMixin.__init__(self)
self.value = value
def fit(self, *_):
return self
def transform(self, X):
# appends a column (in this case, a constant) to X
s = np.full(X.shape[0], self.value)
X = np.column_stack([X, s])
return X
# as such, the transformer gives what I need first
transfo = DummyTransformer(value=1)
print('single transformer: \n', transfo.fit_transform(X), '\n')
# but when I try to chain them and create a pipeline I run into the replication of existing columns
stages = []
for i in range(2):
transfo = DummyTransformer(value=i+1)
stages.append(('step'+str(i+1),transfo))
pipeunion = FeatureUnion(stages)
print('Given result of the Feature union pipeline: \n', pipeunion.fit_transform(X), '\n')
# columns 1&2 from X are replicated
# I would expect:
expected = np.column_stack([X, np.full(X.shape[0], 1), np.full(X.shape[0], 2) ])
print('Expected result of the Feature Union pipeline: \n', expected, '\n')
输出:
base array:
[['foo' 'a']
['bar' 'b']
['baz' 'c']]
single transformer:
[['foo' 'a' '1']
['bar' 'b' '1']
['baz' 'c' '1']]
Given result of the Feature union pipeline:
[['foo' 'a' '1' 'foo' 'a' '2']
['bar' 'b' '1' 'bar' 'b' '2']
['baz' 'c' '1' 'baz' 'c' '2']]
Expected result of the Feature Union pipeline:
[['foo' 'a' '1' '2']
['bar' 'b' '1' '2']
['baz' 'c' '1' '2']]
非常感谢
推荐答案
FeatureUnion
只会连接它从内部转换器获得的信息.现在在您的内部转换器中,您从每个转换器发送相同的列.正确地向前发送正确的数据取决于变压器.
FeatureUnion
will just concatenate what its getting from internal transformers. Now in your internal transformers, you are sending same columns from each one. Its upon the transformers to correctly send the correct data forward.
我建议您只从内部转换器返回新数据,然后从 FeatureUnion
外部或内部连接剩余的列.
I would advise you to just return the new data from the internal transformers, and then concatenate the remaining columns either from outside or inside the FeatureUnion
.
如果你还没有,请看这个例子:
Look at this example if you havent already:
您可以这样做,例如:
# This dont do anything, just pass the data as it is
class DataPasser(TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
return X
# Your transformer
class DummyTransformer(TransformerMixin):
def __init__(self, value=None):
TransformerMixin.__init__(self)
self.value = value
def fit(self, *_):
return self
# Changed this to only return new column after some operation on X
def transform(self, X):
s = np.full(X.shape[0], self.value)
return s.reshape(-1,1)
之后,在您的代码中进一步更改:
After that, further down in your code, change this:
stages = []
# Append our DataPasser here, so original data is at the beginning
stages.append(('no_change', DataPasser()))
for i in range(2):
transfo = DummyTransformer(value=i+1)
stages.append(('step'+str(i+1),transfo))
pipeunion = FeatureUnion(stages)
运行这个新代码的结果是:
Running this new code has the result:
('Given result of the Feature union pipeline: \n',
array([['foo', 'a', '1', '2'],
['bar', 'b', '1', '2'],
['baz', 'c', '1', '2']], dtype='|S21'), '\n')
('Expected result of the Feature Union pipeline: \n',
array([['foo', 'a', '1', '2'],
['bar', 'b', '1', '2'],
['baz', 'c', '1', '2']], dtype='|S21'), '\n')
这篇关于在 scikit-learn 中使用 FeatureUnion 自定义转换器混合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!