python - ValueError:找到的数组的样本数不一致[6 1786]

这是我的代码：

from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import KFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import datasets
import numpy as np

newsgroups = datasets.fetch_20newsgroups(
                subset='all',
                categories=['alt.atheism', 'sci.space']
         )
X = newsgroups.data
y = newsgroups.target

TD_IF = TfidfVectorizer()
y_scaled = TD_IF.fit_transform(newsgroups, y)
grid = {'C': np.power(10.0, np.arange(-5, 6))}
cv = KFold(y_scaled.size, n_folds=5, shuffle=True, random_state=241)
clf = SVC(kernel='linear', random_state=241)

gs = GridSearchCV(estimator=clf, param_grid=grid, scoring='accuracy', cv=cv)
gs.fit(X, y_scaled)

我犯了错误，我不明白为什么。回溯：
回溯（最近的最后一次调用）：文件
“C:/Users/Roman/PycharmProjects/week_3/assignment_2.py”，第23行，in
gs.fit（X，yúu scaled）#TODO:检查该行文件“C:\ Users\Roman\AppData\Roaming\Python\Python35\site packages\sklearn\grid戋u search.py”，
804线，配合
返回self.\u fit（X，y，ParameterGrid（self.param\u grid））文件“C:\用户\Roman\AppData\Roaming\Python\Python35\site packages\sklearn\grid\u search.py”，
525线，适合
X，y=可索引（X，y）文件“C:\用户\Roman\AppData\Roaming\Python\Python35\site packages\sklearn\utils\validation.py”，
201行，可转位
check_consistent_length（*result）File“C:\用户\罗马\应用程序数据\漫游\蟒蛇\蟒蛇35 \站点包\ sklearn\utils\validation.py”，
第176行，检查一致长度
%s%str（uniques）
值错误：找到样本数不一致的数组：[6 1786]
有人能解释一下为什么会发生这个错误吗？

最佳答案

我想你对这里的X和y有些困惑。你想把你的X转换成一个tf-idf向量，并用它来训练y。见下文

from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import KFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import datasets
import numpy as np

newsgroups = datasets.fetch_20newsgroups(
                subset='all',
                categories=['alt.atheism', 'sci.space']
         )
X = newsgroups.data
y = newsgroups.target

TD_IF = TfidfVectorizer()
X_scaled = TD_IF.fit_transform(X, y)
grid = {'C': np.power(10.0, np.arange(-1, 1))}
cv = KFold(y_scaled.size, n_folds=5, shuffle=True, random_state=241)
clf = SVC(kernel='linear', random_state=241)

gs = GridSearchCV(estimator=clf, param_grid=grid, scoring='accuracy', cv=cv)
gs.fit(X_scaled, y)