我有一个问题,我尝试构建自己的分类器,它已经完成并且可以完美地工作,但是当我尝试使用交叉验证得分时,我得到了一个错误:
File "/home/webinterpret/workspace/nlp/wi-item-attribute-extraction/attr_extractor.py", line 95, in fit
print self.fitted_models[attr][len(self.fitted_models[attr]) - 1].cross_validation_score(x_train, y_train, 5, 0.2)
File "/home/webinterpret/workspace/nlp/wi-item-attribute-extraction/attr_extractor.py", line 163, in cross_validation_score
cv=self.cv).mean()
File "/home/webinterpret/.virtualenvs/nlp/local/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1361, in cross_val_score
for train, test in cv)
File "/home/webinterpret/.virtualenvs/nlp/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 659, in __call__
self.dispatch(function, args, kwargs)
File "/home/webinterpret/.virtualenvs/nlp/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 406, in dispatch
job = ImmediateApply(func, args, kwargs)
File "/home/webinterpret/.virtualenvs/nlp/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 140, in __init__
self.results = func(*args, **kwargs)
File "/home/webinterpret/.virtualenvs/nlp/local/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1478, in _fit_and_score
test_score = _score(estimator, X_test, y_test, scorer)
File "/home/webinterpret/.virtualenvs/nlp/local/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1534, in _score
score = scorer(estimator, X_test, y_test)
File "/home/webinterpret/.virtualenvs/nlp/local/lib/python2.7/site-packages/sklearn/metrics/scorer.py", line 201, in _passthrough_scorer
return estimator.score(*args, **kwargs)
File "/home/webinterpret/workspace/nlp/wi-item-attribute-extraction/attr_extractor.py", line 198, in score
return (pd.Series(self.predict(x_test)) == y_test).mean()
File "/home/webinterpret/workspace/nlp/wi-item-attribute-extraction/attr_extractor.py", line 190, in predict
result[i] = 1 if self.pattern in item else 0
File "/home/webinterpret/.virtualenvs/nlp/local/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 216, in __eq__
if np.isnan(other):
TypeError: Not implemented for this type
我的预测功能:
result = np.zeros(text.shape[0])
i = 0
for item in text:
result[i] = 1 if self.pattern in item else 0
i+=1
return result
该错误是在“如果else.0中的self.pattern”中,但我不知道如何以其他方式制作它?
模式是一个文本,例如:“汽车”,文本只是一个文本:“这辆汽车坏了”。
最佳答案
因此,scikit-learn确实真的希望您的数据采用严格的矩阵形式。 x_train应该是数字矩阵,而y_train应该是数字矩阵或向量。交叉验证例程对输入进行数组化处理,以确保其对于内置分类器的格式正确。
在这里,发生的事情是数组化步骤(有效地)创建一个字符矩阵,该矩阵具有与最大长度的文本一样多的列。结果,大多数文本行都用“ np.nans”填充了其余的列。
如果要像这样使用分类器,则需要避免内置的管道和交叉验证例程。您可以遍历交叉验证并建立自己的分数,如下所示:
for train,test in StratifiedKFold( target_classes ):
train_data = data[train]
test_data = data[test]
# Train with train, predict with test, score with your favorite scorer