python - 混排数据行时分类器的准确度为100％

我正在研究蘑菇分类数据集（在这里找到：https://www.kaggle.com/uciml/mushroom-classification）

我已经对数据做了一些预处理（删除了冗余属性，将分类数据更改为数值），并且试图使用我的数据来训练分类器。

每当我手动或使用train_test_split整理数据时，我使用的所有模型（XGB，MLP，LinearSVC，决策树）都具有100％的准确性。每当我在未经混洗的数据上测试模型时，准确性约为50-85％。

这是我分割数据的方法：

x = testing.copy()
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3, shuffle=True)

和手动

x = testing.copy()
x = x.sample(frac=1)

testRatio = 0.3
testCount = int(len(x)*testRatio)

x_train = x[testCount:]
x_test = x[0:testCount]
y_train = y[testCount:]
y_test = y[0:testCount]

我正在做的事情完全错了并且想念吗？

编辑：
在拆分数据时，我可以看到的唯一区别是类的分布。

不改组：

x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3, shuffle=False)

print(y_test.value_counts())
print(y_train.value_counts())

结果是：

0    1828
1     610
Name: class, dtype: int64
1    3598
0    2088
Name: class, dtype: int64

改组时：

x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3, shuffle=True)

print(y_test.value_counts())
print(y_train.value_counts())

结果是：

0    1238
1    1200
Name: class, dtype: int64
1    3008
0    2678
Name: class, dtype: int64

我不认为这会对模型的准确性产生很大的影响。

编辑2：
遵循PV8的建议，我尝试使用交叉验证来验证我的结果，而且似乎可以解决问题，通过这种方式，我得到的结果更加合理。

model = LinearSVC()
scores = cross_val_score(model,x,y,cv=5)
print(scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

输出：

[1.         1.         1.         1.         0.75246305]
Accuracy: 0.95 (+/- 0.20)

最佳答案

这可能是正常现象，您尝试了几次随机播放？

这表明您的数据与拆分方式完全不一致。我希望您测量的是测试的准确性，而不是火车的准确性？

我建议您使用cross validation，这将帮助您验证常规结果。

关于python - 混排数据行时分类器的准确度为100％，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/59939691/