我正在使用sklearn流水线构建机器学习流水线在预处理步骤中,我尝试对两个不同的sting变量进行两种不同的处理1)businesstype上的一种热编码2)areacode上的mean编码,如下所示:
preprocesses_pipeline = make_pipeline (
FeatureUnion (transformer_list = [
("text_features1", make_pipeline(
FunctionTransformer(getBusinessTypeCol, validate=False), CustomOHE()
)),
("text_features2", make_pipeline(
FunctionTransformer(getAreaCodeCol, validate=False)
))
])
)
preprocesses_pipeline.fit_transform(trainDF[X_cols])
使用TraseMixin类定义为:
class MeanEncoding(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
tmp = X['AreaCode1'].map(X.groupby('AreaCode1')['isFail'].mean())
return tmp.values
class CustomOHE(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
tmp = pd.get_dummies(X)
return tmp.values
以及返回指定字段的函数transformer函数
def getBusinessTypeCol(df):
return df['BusinessType']
def getAreaCodeCol(df):
return df[['AreaCode1','isFail']]
现在,当我打开上面的管道时,它会生成以下错误
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-146-7f3a31a39c81> in <module>()
15 )
16
---> 17 preprocesses_pipeline.fit_transform(trainDF[X_cols])
~\Anaconda3\lib\site-packages\sklearn\pipeline.py in fit_transform(self, X, y, **fit_params)
281 Xt, fit_params = self._fit(X, y, **fit_params)
282 if hasattr(last_step, 'fit_transform'):
--> 283 return last_step.fit_transform(Xt, y, **fit_params)
284 elif last_step is None:
285 return Xt
~\Anaconda3\lib\site-packages\sklearn\pipeline.py in fit_transform(self, X, y, **fit_params)
747 Xs = sparse.hstack(Xs).tocsr()
748 else:
--> 749 Xs = np.hstack(Xs)
750 return Xs
751
~\Anaconda3\lib\site-packages\numpy\core\shape_base.py in hstack(tup)
286 return _nx.concatenate(arrs, 0)
287 else:
--> 288 return _nx.concatenate(arrs, 1)
289
290
ValueError: all the input arrays must have same number of dimensions
似乎在流水线中有“meanencoding”的错误正在发生,因为删除它可以使流水线正常工作。不知道到底怎么了需要帮助。
最佳答案
好吧,我来解这个谜。基本上,MeanEncoding()
在转换后返回格式数组(n,)
,而返回的调用期望格式为(n,1)
,因此它可以将此(n,1)
与第一个管道返回的其他已处理的(n,k)
数组CustomOHE()
组合起来因为numpy
不能将(n,)
和(n,k)
组合起来,所以需要将其重塑为(n,1)
所以,现在我的MeanEncoding
类如下所示:
class MeanEncoding(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
tmp = X['AreaCode1'].map(X.groupby('AreaCode1')['isFail'].mean())
return tmp.values.reshape(len(tmp), 1)