本文介绍了Scikit学习,GroupKFolding与改组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用scikit-learn的StratifiedKFold,但是现在我还需要注意组。有很好的功能GroupKFold,但是我的数据非常依赖时间。类似于帮助中的内容,即星期数是分组索引。但是每周应该只折一遍。

I was using StratifiedKFold from scikit-learn, but now I need to watch also for "groups". There is nice function GroupKFold, but my data are very time dependent. So similary as in help, ie number of week is the grouping index. But each week should be only in one fold.

假设我需要十折。我需要的是先重组数据,然后再使用GroupKFold。

Suppose I need 10 folds. What I need is to shuffle data first, before I can used GroupKFold.

改组是基于分组的意义-因此,整个分组之间应该进行重组。

Shuffling is in group sence - so whole groups should be shuffle among each other.

有什么办法可以通过scikit-learn进行优雅学习?在我看来,GroupKFold可以很好地先随机播放数据。

Is there way to do is with scikit-learn elegant somehow? Seems to me GroupKFold is robust to shuffle data first.

如果无法使用scikit做到这一点,有人可以写一些有效的代码吗?我有大量数据集。

If there is no way to do it with scikit, can anyone write some effective code of this? I have large data sets.

矩阵,标签,组作为输入

matrix, label, groups as inputs

推荐答案

编辑:此解决方案不起作用。

This solution does not work.

我认为使用是一个优雅的解决方案!

I think using sklearn.utils.shuffle is an elegant solution!

对于X,y和组中的数据:

For data in X, y and groups:

from sklearn.utils import shuffle
X_shuffled, y_shuffled, groups_shuffled = shuffle(X, y, groups, random_state=0)

然后使用X_shuffled,y_shuffled和groups_shuffled使用GroupKFold:

Then use X_shuffled, y_shuffled and groups_shuffled with GroupKFold:

from sklearn.model_selection import GroupKFold
group_k_fold = GroupKFold(n_splits=10)
splits = group_k_fold.split(X_shuffled, y_shuffled, groups_shuffled)

当然,您可能希望多次洗牌并与每个洗牌进行交叉验证。您可以将整个过程放到一个循环中-这是一个完整的示例,其中包含5个随机播放(只有3个拆分而不是您需要的10个拆分):

Of course, you probably want to shuffle multiple times and do the cross-validation with each shuffle. You could put the entire thing in a loop - here's a complete example with 5 shuffles (and only 3 splits instead of your required 10):

X = np.arange(20).reshape((10, 2))
y = np.arange(10)
groups = [0, 0, 0, 1, 2, 3, 4, 5, 6, 7]

n_shuffles = 5
group_k_fold = GroupKFold(n_splits=3)

for i in range(n_shuffles):
    X_shuffled, y_shuffled, groups_shuffled = shuffle(X, y, groups, random_state=i)
    splits = group_k_fold.split(X_shuffled, y_shuffled, groups_shuffled)
    # do something with splits here, I'm just printing them out
    print 'Shuffle', i
    print 'groups_shuffled:', groups_shuffled
    for train_idx, val_idx in splits:
        print 'Train:', train_idx
        print 'Val:', val_idx

这篇关于Scikit学习,GroupKFolding与改组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-01 08:18