问题描述
2 路归并作为归并排序算法的一部分被广泛研究.但我有兴趣找出执行 N 路合并的最佳方式?
A 2-way merge is widely studied as a part of Mergesort algorithm.But I am interested to find out the best way one can perform an N-way merge?
比方说,我有 N
个文件,每个文件都对 100 万个整数进行了排序.我必须将它们合并到 1 个包含 1 亿个排序整数的文件中.
Lets say, I have N
files which have sorted 1 million integers each.I have to merge them into 1 single file which will have those 100 million sorted integers.
请记住,此问题的用例实际上是基于磁盘的外部排序.因此,在实际场景中也会存在内存限制.因此,一次(99 次)合并 2 个文件的幼稚方法是行不通的.假设我们只有一个小的内存滑动窗口可用于每个数组.
Please keep in mind that use case for this problem is actually external sorting which is disk based. Therefore, in real scenarios there would be memory limitation as well. So a naive approach of merging 2 files at a time (99 times) won't work. Lets say we have only a small sliding window of memory available for each array.
我不确定是否已经有针对此 N 路合并的标准化解决方案.(谷歌搜索并没有告诉我太多).
I am not sure if there is already a standardized solution to this N-way merge. (Googling didn't tell me much).
但是如果你知道一个好的n-way合并算法,请发布algo/link.
But if you know if a good n-way merge algorithm, please post algo/link.
时间复杂度:如果我们大大增加要合并的文件数量(N
),这将如何影响算法的时间复杂度?
Time complexity: If we greatly increase the number of files (N
) to be merged, how would that affect the time complexity of your algorithm?
感谢您的回答.
我在任何地方都没有被问到这个问题,但我觉得这可能是一个有趣的面试问题.因此标记.
推荐答案
下面的想法怎么样:
- 创建优先队列
- 遍历每个文件f
- 使用第一个值作为优先键对 (nextNumberIn(f), f) 入队
- 出队头(m, f)
- 输出m
- 如果 f 没有耗尽
- 入队(nextNumberIn(f), f)
由于向优先队列添加元素可以在对数时间内完成,因此第 2 项是 O(N × log N).由于(几乎所有)while 循环迭代添加了一个元素,整个while 循环是O(M × log N) 其中M 是数字的总数排序.
Since adding elements to a priority queue can be done in logarithmic time, item 2 is O(N × log N). Since (almost all) iterations of the while loop adds an element, the whole while-loop is O(M × log N) where M is the total number of numbers to sort.
假设所有文件都有一个非空的数字序列,我们有 M > N,因此整个算法应该是 O(M × log N).
Assuming all files have a non-empty sequence of numbers, we have M > N and thus the whole algorithm should be O(M × log N).
这篇关于N路合并算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!