问题描述
我有一个整数的rdd(即RDD[Int]
),我想做的是计算以下10个百分位数:[0th, 10th, 20th, ..., 90th, 100th]
.最有效的方法是什么?
I have an rdd of integers (i.e. RDD[Int]
) and what I would like to do is to compute the following ten percentiles: [0th, 10th, 20th, ..., 90th, 100th]
. What is the most efficient way to do that?
推荐答案
您可以:
- 通过rdd.sortBy()对数据集进行排序
- 通过rdd.count()计算数据集的大小
- 压缩索引以方便百分位数检索
- 通过rdd.lookup()检索所需的百分位数,例如第十个rdd.lookup(0.1 * size)
要计算中位数和第99个百分位数:getPercentiles(rdd,new double [] {0.5,0.99},size,numPartitions);
To compute the median and the 99th percentile:getPercentiles(rdd, new double[]{0.5, 0.99}, size, numPartitions);
在Java 8中:
public static double[] getPercentiles(JavaRDD<Double> rdd, double[] percentiles, long rddSize, int numPartitions) {
double[] values = new double[percentiles.length];
JavaRDD<Double> sorted = rdd.sortBy((Double d) -> d, true, numPartitions);
JavaPairRDD<Long, Double> indexed = sorted.zipWithIndex().mapToPair((Tuple2<Double, Long> t) -> t.swap());
for (int i = 0; i < percentiles.length; i++) {
double percentile = percentiles[i];
long id = (long) (rddSize * percentile);
values[i] = indexed.lookup(id).get(0);
}
return values;
}
请注意,这需要对数据集O(n.log(n))进行排序,并且在大型数据集上可能会很昂贵.
Note that this requires sorting the dataset, O(n.log(n)) and can be expensive on large datasets.
另一个建议仅计算直方图的答案将无法正确计算百分位数:这是一个反例:一个由100个数字,99个数字为0和一个数字为1组成的数据集.最终,所有99个0都为0第一个垃圾箱中的垃圾箱,最后一个垃圾箱中的垃圾箱,中间有8个空垃圾箱.
The other answer suggesting simply computing a histogram would not compute correctly the percentile: here is a counter example: a dataset composed of 100 numbers, 99 numbers being 0, and one number being 1. You end up with all the 99 0's in the first bin, and the 1 in the last bin, with 8 empty bins in the middle.
这篇关于如何在Apache Spark中计算百分位数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!