问题描述
我在寻找一单通算法发现漂浮在流中我不知道提前总数的百分之topX ...但它在5-30万元花车的订单。它需要单通因为数据是在飞行中产生和重新创建精确流的第二时间
I'm looking for a single-pass algorithm for finding the topX percent of floats in a stream where I do not know the total number ahead of time ... but its on the order of 5-30 million floats. It needs to be single-pass since the data is generated on the fly and recreate the exact stream a second time.
这个算法我到目前为止是保持我到目前为止看到的topX项目排序列表。作为流继续予放大根据需要的列表。然后,我用 bisect_left
如果需要找到插入点。
The algorithm I have so far is to keep a sorted list of the topX items that I've seen so far. As the stream continues I enlarge the list as needed. Then I use bisect_left
to find the insertion point if needed.
下面是该算法我到目前为止有:
Below is the algorithm I have so far:
from bisect import bisect_left
from random import uniform
from itertools import islice
def data_gen(num):
for _ in xrange(num):
yield uniform(0,1)
def get_top_X_percent(iterable, percent = 0.01, min_guess = 1000):
top_nums = sorted(list(islice(iterable, int(percent*min_guess)))) #get an initial guess
for ind, val in enumerate(iterable, len(top_nums)):
if int(percent*ind) > len(top_nums):
top_nums.insert(0,None)
newind = bisect_left(top_nums, val)
if newind > 0:
top_nums.insert(newind, val)
top_nums.pop(0)
return top_nums
if __name__ == '__main__':
num = 1000000
all_data = sorted(data_gen(num))
result = get_top_X_percent(all_data)
assert result[0] == all_data[-int(num*0.01)], 'Too far off, lowest num:%f' % result[0]
print result[0]
在实际情况下,数据并非来自任何标准分配(否则我可以用一些统计数据的知识)。
In the real case the data does not come from any standard distribution (otherwise I could use some statistics knowledge).
任何建议将是AP preciated。
Any suggestions would be appreciated.
推荐答案
我不知道有什么办法能真正做到这一点可靠,并标示为最高X%的人的范围内可以生长未predictably作为你会看到更多的元素。考虑下面的输入:
I'm not sure there's any way to actually do that reliably, as the range denoted by the "top X percent" can grow unpredictably as you see more elements. Consider the following input:
101 102 103 104 105 106 107 108 109 110 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
如果你想要元素的前25%,你最终挑选101和102出前十的元素,但看到后有足够的零之后,你最终还是要结束了所有选择前十的。同样的模式可以扩展到任何足够大的数据流 - 它总是能够最终得到误导的外观和你实际应该保持丢弃的元素。因此,除非你知道时间提前流的精确长度,我不认为这是可能的(短期保持每个元素在内存中,直到你打流的结束)。
If you wanted the top 25% of elements, you'd end up picking 101 and 102 out of the first ten elements, but after seeing enough zeroes after there you'd eventually have to end up selecting all of the first ten. This same pattern can be expanded to any sufficiently large stream -- it's always possible to end up getting misled by appearances and discarding elements that you actually should have kept. As such, unless you know the exact length of the stream ahead of time, I don't think this is possible (short of keeping every element in memory until you hit the end of the stream).
这篇关于单通道算法寻找项目的topX%的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!