c# - 计算百分位数以消除异常值的快速算法

我有一个程序需要重复计算数据集的近似百分位数(顺序统计量)，以便在进行进一步处理之前删除异常值。我目前正在通过对值数组进行排序并选择适当的元素来进行此操作；这是可行的，但是尽管在程序中只占很小的一部分，但它却是一个明显的问题。

更多信息:

数据集最多包含100000个浮点数，并假定是“合理地”分布的-在特定值附近不太可能出现重复或密度出现大的峰值。如果由于某种奇怪的原因，分布是奇数，则可以将近似值的准确性降低，因为数据可能会被弄乱，并进一步处理可疑的数据。但是，数据不一定是统一的或正态分布的。它几乎不可能退化。

一个近似的解决方案会很好，但是我确实需要了解近似值是如何引入错误以确保其有效的。

由于目标是消除异常值，因此我一直都在计算同一数据的两个百分位数:一种是95％，另一种是5％。

该应用程序是C#语言，在C++中有点繁重；伪代码或任何一个中预先存在的库都可以。

只要合理，一种完全不同的消除异常值的方法也可以。

更新:似乎我正在寻找一个近似的selection algorithm。

尽管这都是循环完成的，但每次数据都略有不同，因此像for this question那样重用数据结构并不容易。

已实现的解决方案

使用Gronim建议的Wikipedia选择算法，将这部分运行时间减少了大约20倍。

由于找不到C#实现，因此这是我想出的。即使是小的输入，它也比Array.Sort更快。在1000个元素上，速度提高了25倍。

public static double QuickSelect(double[] list, int k) {
    return QuickSelect(list, k, 0, list.Length);
}
public static double QuickSelect(double[] list, int k, int startI, int endI) {
    while (true) {
        // Assume startI <= k < endI
        int pivotI = (startI + endI) / 2; //arbitrary, but good if sorted
        int splitI = partition(list, startI, endI, pivotI);
        if (k < splitI)
            endI = splitI;
        else if (k > splitI)
            startI = splitI + 1;
        else //if (k == splitI)
            return list[k];
    }
    //when this returns, all elements of list[i] <= list[k] iif i <= k
}
static int partition(double[] list, int startI, int endI, int pivotI) {
    double pivotValue = list[pivotI];
    list[pivotI] = list[startI];
    list[startI] = pivotValue;

    int storeI = startI + 1;//no need to store @ pivot item, it's good already.
    //Invariant: startI < storeI <= endI
    while (storeI < endI && list[storeI] <= pivotValue) ++storeI; //fast if sorted
    //now storeI == endI || list[storeI] > pivotValue
    //so elem @storeI is either irrelevant or too large.
    for (int i = storeI + 1; i < endI; ++i)
        if (list[i] <= pivotValue) {
            list.swap_elems(i, storeI);
            ++storeI;
        }
    int newPivotI = storeI - 1;
    list[startI] = list[newPivotI];
    list[newPivotI] = pivotValue;
    //now [startI, newPivotI] are <= to pivotValue && list[newPivotI] == pivotValue.
    return newPivotI;
}
static void swap_elems(this double[] list, int i, int j) {
    double tmp = list[i];
    list[i] = list[j];
    list[j] = tmp;
}

谢谢Gronim，为我指出了正确的方向!

最佳答案

Henrik的直方图解决方案将起作用。您还可以使用选择算法来高效地找到O(n)中n个元素的数组中的k个最大或最小元素。要将其用于第95个百分位数，请设置k = 0.05n并找到k个最大元素。

引用:

http://en.wikipedia.org/wiki/Selection_algorithm#Selecting_k_smallest_or_largest_elements

关于c# - 计算百分位数以消除异常值的快速算法，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/3779763/