Scikit-learn 平衡子采样

本文介绍了Scikit-learn 平衡子采样的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试为我的大型不平衡数据集创建 N 个平衡随机子样本.有没有办法简单地使用 scikit-learn/pandas 来做到这一点，或者我必须自己实现它?任何指向执行此操作的代码的指针?

I'm trying to create N balanced random subsamples of my large unbalanced dataset. Is there a way to do this simply with scikit-learn / pandas or do I have to implement it myself? Any pointers to code that does this?

这些子样本应该是随机的并且可以重叠，因为我将每个子样本提供给一个非常大的分类器集合中的单独分类器.

These subsamples should be random and can be overlapping as I feed each to separate classifier in a very large ensemble of classifiers.

在 Weka 中有一个叫做 spreadsubsample 的工具，在 sklearn 中是否有等价物?http://wiki.pentaho.com/display/DATAMINING/SpreadSubsample

In Weka there is tool called spreadsubsample, is there equivalent in sklearn?http://wiki.pentaho.com/display/DATAMINING/SpreadSubsample

(我知道权重，但这不是我要找的.)

(I know about weighting but that's not what I'm looking for.)

推荐答案

这是我的第一个版本，似乎工作正常，请随意复制或提出有关如何提高效率的建议(我有很长的经验一般用编程，但用 python 或 numpy 没那么长)

Here is my first version that seems to be working fine, feel free to copy or make suggestions on how it could be more efficient (I have quite a long experience with programming in general but not that long with python or numpy)

此函数创建单个随机平衡子样本.

This function creates single random balanced subsample.

子样本大小现在对少数类进行抽样，这可能应该改变.

edit: The subsample size now samples down minority classes, this should probably be changed.

def balanced_subsample(x,y,subsample_size=1.0):

    class_xs = []
    min_elems = None

    for yi in np.unique(y):
        elems = x[(y == yi)]
        class_xs.append((yi, elems))
        if min_elems == None or elems.shape[0] < min_elems:
            min_elems = elems.shape[0]

    use_elems = min_elems
    if subsample_size < 1:
        use_elems = int(min_elems*subsample_size)

    xs = []
    ys = []

    for ci,this_xs in class_xs:
        if len(this_xs) > use_elems:
            np.random.shuffle(this_xs)

        x_ = this_xs[:use_elems]
        y_ = np.empty(use_elems)
        y_.fill(ci)

        xs.append(x_)
        ys.append(y_)

    xs = np.concatenate(xs)
    ys = np.concatenate(ys)

    return xs,ys

对于尝试使用 Pandas DataFrame 进行上述操作的任何人，您需要进行一些更改:

For anyone trying to make the above work with a Pandas DataFrame, you need to make a couple of changes:

将 np.random.shuffle 行替换为

this_xs = this_xs.reindex(np.random.permutation(this_xs.index))

将 np.concatenate 行替换为

xs = pd.concat(xs)ys = pd.Series(data=np.concatenate(ys),name='target')

这篇关于Scikit-learn 平衡子采样的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！