问题描述
在 GroupKFold
源中, random_state
设置为None
def __init__(self, n_splits=3):
super(GroupKFold, self).__init__(n_splits, shuffle=False,
random_state=None)
因此,如果多次运行(来自此处的代码)
Hence, when run multiple times (code from here)
import numpy as np
from sklearn.model_selection import GroupKFold
for i in range(0,10):
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])
groups = np.array([0, 0, 2, 2])
group_kfold = GroupKFold(n_splits=2)
group_kfold.get_n_splits(X, y, groups)
print(group_kfold)
for train_index, test_index in group_kfold.split(X, y, groups):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
print(X_train, X_test, y_train, y_test)
print
print
o/p
GroupKFold(n_splits=2)
('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))
(array([[1, 2],
[3, 4]]), array([[5, 6],
[7, 8]]), array([1, 2]), array([3, 4]))
('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))
(array([[5, 6],
[7, 8]]), array([[1, 2],
[3, 4]]), array([3, 4]), array([1, 2]))
GroupKFold(n_splits=2)
('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))
(array([[1, 2],
[3, 4]]), array([[5, 6],
[7, 8]]), array([1, 2]), array([3, 4]))
('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))
(array([[5, 6],
[7, 8]]), array([[1, 2],
[3, 4]]), array([3, 4]), array([1, 2]))
等...
分割相同.
如何为GroupKFold
设置random_state
,以便在一些交叉验证的不同试验中获得一组不同的(但可重复的)拆分?
How do I set a random_state
for GroupKFold
in order to get a different (but repoducible) set of splits over a few different trials of cross validation?
例如,我想要
GroupKFold(n_splits=2, random_state=42)
('TRAIN:', array([0, 1]),
'TEST:', array([2, 3]))
('TRAIN:', array([2, 3]),
'TEST:', array([0, 1]))
GroupKFold(n_splits=2, random_state=13)
('TRAIN:', array([0, 2]),
'TEST:', array([1, 3]))
('TRAIN:', array([1, 3]),
'TEST:', array([0, 2]))
到目前为止,似乎一种策略可能是首先使用sklearn.utils.shuffle
,如本.但是,这实际上只是重新排列了每一折的元素---它并没有给我们带来新的分裂.
So far, it seems a strategy might be to use a sklearn.utils.shuffle
first, as suggested in this post. However, this actually just rearranges the elements of each fold --- it doesn't give us new splits.
from sklearn.utils import shuffle
from sklearn.model_selection import GroupKFold
import numpy as np
import sys
import pdb
random_state = int(sys.argv[1])
X = np.arange(20).reshape((10,2))
y = np.arange(10)
groups = np.array([0,0,0,1,2,3,4,5,6,7])
def cv(X, y, groups, random_state):
X_s, y_s, groups_s = shuffle(X,y, groups, random_state=random_state)
cv_out = GroupKFold(n_splits=2)
cv_out_splits = cv_out.split(X_s, y_s, groups_s)
for train, test in cv_out_splits:
print "---"
print X_s[test]
print y_s[test]
print "test groups", groups_s[test]
print "train groups", groups_s[train]
pdb.set_trace()
print "***"
cv(X, y, groups, random_state)
输出:
>python sshuf.py 32
***
---
[[ 2 3]
[ 4 5]
[ 0 1]
[ 8 9]
[12 13]]
[1 2 0 4 6]
test groups [0 0 0 2 4]
train groups [7 6 1 3 5]
---
[[18 19]
[16 17]
[ 6 7]
[10 11]
[14 15]]
[9 8 3 5 7]
test groups [7 6 1 3 5]
train groups [0 0 0 2 4]
>python sshuf.py 234
***
---
[[12 13]
[ 4 5]
[ 0 1]
[ 2 3]
[ 8 9]]
[6 2 0 1 4]
test groups [4 0 0 0 2]
train groups [7 3 1 5 6]
---
[[18 19]
[10 11]
[ 6 7]
[14 15]
[16 17]]
[9 5 3 7 8]
test groups [7 3 1 5 6]
train groups [4 0 0 0 2]
推荐答案
-
KFold
仅在shuffle=True
时才是随机的. 某些数据集不应该混洗. -
GroupKFold
完全没有被随机化.因此,random_state=None
. -
GroupShuffleSplit
可能更接近您要寻找的东西. KFold
is only randomized ifshuffle=True
. Some datasets should not be shuffled.GroupKFold
is not randomized at all. Hence therandom_state=None
.GroupShuffleSplit
may be closer to what you're looking for.- 在
GroupKFold
中,测试集形成所有数据的完整分区. -
LeavePGroupsOut
组合地排除P组的所有可能子集; P> 1的测试集将重叠.由于这意味着P ** n_groups
完全分开,因此通常您需要一个小的P,并且最经常需要LeaveOneGroupOut
,与GroupKFold
和k=1
基本相同. -
GroupShuffleSplit
不对连续测试集之间的关系;每个训练/测试拆分均独立执行. - In
GroupKFold
, the test sets form a complete partition of all the data. LeavePGroupsOut
leaves all possible subsets of P groups out, combinatorially; test sets will overlap for P > 1. Since this meansP ** n_groups
splits altogether, often you want a small P, and most often wantLeaveOneGroupOut
which is basically the same asGroupKFold
withk=1
.GroupShuffleSplit
makes no statement about the relationship between successive test sets; each train/test split is performed independently.
基于组的拆分器的比较:
A comparison of the group-based splitters:
顺便说一句,Dmytro Lituiev 提出了另一种GroupShuffleSplit
算法在指定的test_size
的测试集中正确数量的样本(不仅仅是正确数量的组).
As an aside, Dmytro Lituiev has proposed an alternative GroupShuffleSplit
algorithm which is better at getting the right number of samples (not merely the right number of groups) in the test set for a specified test_size
.
这篇关于如何获得可复制但不同的GroupKFold实例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!