

我想从Python中多个大型向量的集合中计算百分位数.而不是尝试连接向量,然后将生成的巨大向量放入 numpy.percentile ,有没有更有效的方法?

I want to calculate percentiles from an ensemble of multiple large vectors in Python. Instead of trying to concatenate the vectors and then putting the resulting huge vector through numpy.percentile, is there a more efficient way?

我的想法是,首先计算不同值的频率(例如,使用 scipy.stats.itemfreq ),其次,将不同向量的那些项频率结合起来,最后从计数中计算百分位.

My idea would be, first, counting the frequencies of different values (e.g. using scipy.stats.itemfreq), second, combining those item frequencies for the different vectors, and finally, calculating the percentiles from the counts.


Unfortunately I haven't been able to find functions either for combining the frequency tables (it is not very simple, as different tables may cover different items), or for calculating percentiles from an item frequency table. Do I need to implement these, or can I use existing Python functions? What would those functions be?


遵循朱利安·帕拉德(Julien Palard)的建议,使用collections.Counter解决第一个问题(计算和合并频率表),第二个问题的实现(根据频率计算百分位数)表格):

Using collections.Counter for solving the first problem (calculating and combining frequency tables) following Julien Palard's suggestion, and my implementation for the second problem (calculating percentiles from frequency tables):

from collections import Counter

def calc_percentiles(cnts_dict, percentiles_to_calc=range(101)):
    """Returns [(percentile, value)] with nearest rank percentiles.
    Percentile 0: <min_value>, 100: <max_value>.
    cnts_dict: { <value>: <count> }
    percentiles_to_calc: iterable for percentiles to calculate; 0 <= ~ <= 100
    assert all(0 <= p <= 100 for p in percentiles_to_calc)
    percentiles = []
    num = sum(cnts_dict.values())
    cnts = sorted(cnts_dict.items())
    curr_cnts_pos = 0  # current position in cnts
    curr_pos = cnts[0][1]  # sum of freqs up to current_cnts_pos
    for p in sorted(percentiles_to_calc):
        if p < 100:
            percentile_pos = p / 100.0 * num
            while curr_pos <= percentile_pos and curr_cnts_pos < len(cnts):
                curr_cnts_pos += 1
                curr_pos += cnts[curr_cnts_pos][1]
            percentiles.append((p, cnts[curr_cnts_pos][0]))
            percentiles.append((p, cnts[-1][0]))  # we could add a small value
    return percentiles

cnts_dict = Counter()
for segment in segment_iterator:
    cnts_dict += Counter(segment)

percentiles = calc_percentiles(cnts_dict)


06-26 10:58