本文介绍了从数据集中随机抽样,同时保留原始概率分布的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组> 2000个数字,从测量中收集。我想从这个数据集中抽样,每次测试约10次,同时保持整体的概率分布,并在每次测试中(尽可能地扩展)。例如,在每个测试中,我想要一些小值,一些中产阶级值,一些大值,其中均值和方差大致接近原始分布。结合所有测试,我还想要所有样本的总平均值和方差,大约接近原始分布。



因为我的数据集是



图1.约2k数据元素的密度图。



我正在使用Java,现在我正在使用,并使用数据集中的随机int,并返回该位置的数据元素:

  public int getRandomData(){
int data [] = {1231,414,222,4211 ,,,41,203,123,432,...};
length = data.length;
随机r = new Random();
int randomInt = r.nextInt(length);
返回数据[randomInt];
}

我不知道它是否正常工作,因为我使用数据按顺序测量,它具有大量的序列相关性。

解决方案

它可以按你的需要工作。数据的顺序无关紧要。


I have a set of >2000 numbers, gathered from measurement. I want to sample from this data set, ~10 times in each test, while preserving probability distribution overall, and in each test (to extent approximately possible). For example, in each test, I want some small value, some middle class value, some big value, with the mean and variance approximately close to the original distribution. Combining all the tests, I also want the total mean and variance of all the samples, approximately close to the original distribution.

As my dataset is a long-tail probability distribution, the amount of data at each quantile are not the same:

Fig 1. Density plot of ~2k elements of data.

I am using Java, and right now I am using a uniform distribution, and use a random int from the dataset, and return the data element at that position:

public int getRandomData() {
    int data[] ={1231,414,222,4211,,41,203,123,432,...};
    length=data.length;
    Random r=new Random();
    int randomInt = r.nextInt(length);
    return data[randomInt];
}

I don't know if it works as I want, because I use data in order it is measured, which has great amount of serial correlation.

解决方案

It works as you want. The order of the data is irrelevant.

这篇关于从数据集中随机抽样,同时保留原始概率分布的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-23 16:07