如何在Apache Spark中计算百分位数

本文介绍了如何在Apache Spark中计算百分位数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个整数的rdd(即RDD[Int])，我想做的是计算以下10个百分位数:[0th, 10th, 20th, ..., 90th, 100th].最有效的方法是什么?

I have an rdd of integers (i.e. RDD[Int]) and what I would like to do is to compute the following ten percentiles: [0th, 10th, 20th, ..., 90th, 100th]. What is the most efficient way to do that?

推荐答案

您可以:

通过rdd.sortBy()对数据集进行排序
通过rdd.count()计算数据集的大小
压缩索引以方便百分位数检索
通过rdd.lookup()检索所需的百分位数，例如第十个rdd.lookup(0.1 * size)

要计算中位数和第99个百分位数:getPercentiles(rdd，new double [] {0.5，0.99}，size，numPartitions);

To compute the median and the 99th percentile:getPercentiles(rdd, new double[]{0.5, 0.99}, size, numPartitions);

在Java 8中:

public static double[] getPercentiles(JavaRDD<Double> rdd, double[] percentiles, long rddSize, int numPartitions) {
    double[] values = new double[percentiles.length];

    JavaRDD<Double> sorted = rdd.sortBy((Double d) -> d, true, numPartitions);
    JavaPairRDD<Long, Double> indexed = sorted.zipWithIndex().mapToPair((Tuple2<Double, Long> t) -> t.swap());

    for (int i = 0; i < percentiles.length; i++) {
        double percentile = percentiles[i];
        long id = (long) (rddSize * percentile);
        values[i] = indexed.lookup(id).get(0);
    }

    return values;
}

请注意，这需要对数据集O(n.log(n))进行排序，并且在大型数据集上可能会很昂贵.

Note that this requires sorting the dataset, O(n.log(n)) and can be expensive on large datasets.

另一个建议仅计算直方图的答案将无法正确计算百分位数:这是一个反例:一个由100个数字，99个数字为0和一个数字为1组成的数据集.最终，所有99个0都为0第一个垃圾箱中的垃圾箱，最后一个垃圾箱中的垃圾箱，中间有8个空垃圾箱.

The other answer suggesting simply computing a histogram would not compute correctly the percentile: here is a counter example: a dataset composed of 100 numbers, 99 numbers being 0, and one number being 1. You end up with all the 99 0's in the first bin, and the 1 in the last bin, with 8 empty bins in the middle.

这篇关于如何在Apache Spark中计算百分位数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！