问题描述
在阅读之后,我开始怀疑:是否有可能有一个改组算法,它不会修改或复制原始列表?
After reading this question I started to wonder: is it possible to have a shuffling algorithm which does not modify or copy the original list?
要清楚一点:
想象您会得到一个对象列表。列表大小可以是任意的,但要假定它很大(例如10,000,000个项目)。您需要以随机顺序打印出列表中的项目,并且需要尽快进行打印。但是,您不应该:
Imagine you are given a list of objects. The list size can be arbitrary, but assume it's pretty large (say, 10,000,000 items). You need to print out the items of the list in random order, and you need to do it as fast as possible. However, you should not:
- 复制原始列表,因为它很大,复制会浪费很多内存(可能会达到极限)可用的RAM);
- 修改原始列表,因为它是以某种方式排序的,以后的其他部分则取决于它的排序方式。
- 创建索引列表,因为该列表又很大,并且复制占用了太多时间和内存。 (澄清:这表示其他任何列表,其元素数量与原始列表相同)。
这可能吗?
已添加:更多说明。
- I希望以真正随机的方式对列表进行混洗,所有排列的可能性均等(当然,假设我们有一个适当的Rand()函数开始)。
- 我列出的建议指针,索引列表或元素列表数量与原始列表相同的任何其他列表被原始问题明确视为无效。您可以根据需要创建其他列表,但它们的数量应比原始列表小几个数量级。
- 原始列表就像一个数组,您可以从中检索任何项目。根据其在O(1)中的索引。 (因此,没有双向链接的列表内容,您必须在列表中进行迭代才能找到所需的项目。)
添加2 :好的,让我们这样说:您有一个1TB的HDD,其中装有数据项,每个数据项大512字节(一个扇区)。您想在将所有项目混排的同时将所有这些数据复制到另一个1TB HDD。您想要尽快执行此操作(单次传递数据等)。您有512MB的可用内存,无需依靠交换。 (这是理论上的情况,实际上我没有这样的东西。我只是想找到理想的算法。项目。)
Added 2: OK, let's put it this way: You have a 1TB HDD filled with data items, each 512 bytes large (a single sector). You want to copy all this data to another 1TB HDD while shuffling all the items. You want to do this as fast as possible (single pass over data, etc). You have 512MB of RAM available, and don't count on swap. (This is a theoretical scenario, I don't have anything like this in practice. I just want to find the perfect algorithm.item.)
推荐答案
这是一个非常简单的证明,没有PRNG方案可以工作:
Here is a very simple proof that no PRNG scheme can work:
但是,这并不排除该算法从环境中接收新随机位的方案。例如,可能有一个PRNG读取其初始状态 lazily ,但保证不会重复。
However this does not rule out schemes where the algorithm receives new random bits from the environment as it goes. There might be, say, a PRNG that reads its initial state lazily and yet is guaranteed not to repeat. Can we prove there isn't?
假设我们确实有一个完美的改组算法。想象我们开始运行它,当它完成一半时,我们使计算机进入睡眠状态。现在,程序的完整状态已保存在某处。假设 S 为程序可能处于此中间标记的所有可能状态的集合。
Suppose we do have a perfect shuffling algorithm. Imagine we start running it, and when it's halfway done, we put the computer to sleep. Now the full state of the program has been saved somewhere. Let S be the set of all possible states the program could be in at this halfway mark.
由于算法正确且可以保证终止,有一个函数 f ,在给定保存的程序状态以及足够长的位字符串的情况下,该函数会产生有效的磁盘读取和写入序列,从而完成随机播放。计算机本身实现此功能。而是将其视为数学函数:
Since the algorithm is correct and guaranteed to terminate, there is a function f which, given the saved program state plus any long enough string of bits, produces a valid sequence of disk reads and writes completing the shuffle. The computer itself implements this function. But consider it as a mathematical function:
f :(S× bits)→读取和写入的顺序
然后,简单地,存在一个函数 g ,仅给出 已保存的程序状态,将产生一组尚待读取和写入的磁盘位置。 (只需将一些任意的字符串传递给 f ,然后查看结果。)
Then, trivially, there exists a function g which, given only the saved program state, produces the set of disk locations yet to be read and written. (Simply pass some arbitrary string of bits to f, then look at the results.)
g : S → 要读写的位置集
证明的其余部分是要显示 g 的域不管选择哪种算法,至少包含 C 个不同的集合。如果是这样,那么必须至少有 S 个元素,因此程序的状态必须至少包含 log C 位在中间,违反要求。
The remaining bit of the proof is to show that the domain of g contains at least C different sets regardless of the choice of algorithm. If that's true, there must be at least that many elements of S, and so the state of the program must contain at least log C bits at the halfway mark, in violation of the requirements.
我不确定但是,如何证明最后一点,因为 要读取的位置集或要写入的位置集可能很低,熵,取决于算法。我怀疑信息理论中有一些显而易见的原则可以打破常规。将这个社区Wiki标记为希望有人提供它。
I'm not sure how to prove that last bit, though, since either the set-of-locations-to-read or the set-of-locations-to-write can be low-entropy, depending on the algorithm. I suspect there's some obvious principle of information theory that can cut the knot. Marking this community wiki in the hopes someone will supply it.
这篇关于用于在现场并使用O(1)内存打印混洗的列表的算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!