algorithm - 在十亿个文件中查找最大的一百个数字

我今天去接受采访，并被问到以下问题:

我什至不知道我将从这个问题开始。给出正确结果的最有效过程是什么？我需要遍历磁盘文件一百次以获取列表中尚未出现的最高编号，还是有更好的方法？

最佳答案

这是我的初始算法:

create array of size 100 [0..99].
read first 100 numbers and put into array.
sort array in ascending order.
while more numbers in file:
    get next number N.
    if N > array[0]:
        if N > array[99]:
            shift array[1..99] to array[0..98].
            set array[99] to N.
        else
            find, using binary search, first index i where N <= array[i].
            shift array[1..i-1] to array[0..i-2].
            set array[i-1] to N.
        endif
    endif
endwhile

(非常轻微的)优点是，前100个元素没有O(n ^ 2)改组，只有O(n log n)排序，您可以很快地识别并丢弃那些太小的元素。它还使用二进制搜索(最多7个比较)来找到正确的插入点，而不是简单的线性搜索(平均而言是50个)(不是我建议其他人提供这样的解决方案，只是可能给面试官印象深刻) )。

如果您可以确定重叠不是问题，甚至可以建议使用C中的shift等优化的memcpy操作，从而获得积分。

您可能要考虑的另一种可能性是维护三个列表(每个列表最多100个整数):

read first hundred numbers into array 1 and sort them descending.
while more numbers:
    read up to next hundred numbers into array 2 and sort them descending.
    merge-sort lists 1 and 2 into list 3 (only first (largest) 100 numbers).
    if more numbers:
        read up to next hundred numbers into array 2 and sort them descending.
        merge-sort lists 3 and 2 into list 1 (only first (largest) 100 numbers).
    else
        copy list 3 to list 1.
    endif
endwhile

我不确定，但是最终可能会比连续改组更有效率。

合并排序是按照以下方式进行的简单选择(对于将列表1和2合并为3):

list3.clear()
while list3.size() < 100:
    while list1.peek() >= list2.peek():
        list3.add(list1.pop())
    endwhile
    while list2.peek() >= list1.peek():
        list3.add(list2.pop())
    endwhile
endwhile

简而言之，由于它们已经按降序排序，因此将前100个值从组合列表中拉出。我没有详细检查这是否会更有效，我只是提供了一种可能性。

我怀疑面试官会对“开箱即用”的思维能力以及您说应该对其表现进行评估的事实印象深刻。

与大多数面试一样，技术技能是他们正在考虑的内容之一。