从标题我想知道之间有什么区别
参数 shuffle的StratifiedKFold =真
StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
和
StratifiedShuffleSplit
StratifiedShuffleSplit(n_splits=10, test_size=’default’, train_size=None, random_state=0)
使用StratifiedShuffleSplit有什么好处
最佳答案
在KFolds中,即使混洗,每个测试集也不应重叠。使用KFolds和shuffle,数据在开始时会被shuffle一次,然后划分为所需的分割数。测试数据始终是分割的其中之一,其余的是火车数据。
在ShuffleSplit中,每次都会对数据进行混洗,然后进行拆分。这意味着测试集可能在拆分之间重叠。
有关差异的示例,请参见此块。请注意ShuffleSplit测试集中元素的重叠。
splits = 5
tx = range(10)
ty = [0] * 5 + [1] * 5
from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold
from sklearn import datasets
kfold = StratifiedKFold(n_splits=splits, shuffle=True, random_state=42)
shufflesplit = StratifiedShuffleSplit(n_splits=splits, random_state=42, test_size=2)
print("KFold")
for train_index, test_index in kfold.split(tx, ty):
print("TRAIN:", train_index, "TEST:", test_index)
print("Shuffle Split")
for train_index, test_index in shufflesplit.split(tx, ty):
print("TRAIN:", train_index, "TEST:", test_index)
输出:
KFold
TRAIN: [0 2 3 4 5 6 7 9] TEST: [1 8]
TRAIN: [0 1 2 3 5 7 8 9] TEST: [4 6]
TRAIN: [0 1 3 4 5 6 8 9] TEST: [2 7]
TRAIN: [1 2 3 4 6 7 8 9] TEST: [0 5]
TRAIN: [0 1 2 4 5 6 7 8] TEST: [3 9]
Shuffle Split
TRAIN: [8 4 1 0 6 5 7 2] TEST: [3 9]
TRAIN: [7 0 3 9 4 5 1 6] TEST: [8 2]
TRAIN: [1 2 5 6 4 8 9 0] TEST: [3 7]
TRAIN: [4 6 7 8 3 5 1 2] TEST: [9 0]
TRAIN: [7 2 6 5 4 3 0 9] TEST: [1 8]
至于何时使用它们,我倾向于使用KFolds进行任何交叉验证,并且我将ShuffleSplit的分割为2作为训练/测试集分割。但是我敢肯定这两种情况都有其他用例。
关于python - sklearn中StratifiedKFold和StratifiedShuffleSplit之间的区别,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/45969390/