使用浮点源的整数均匀分布

本文介绍了使用浮点源的整数均匀分布的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在 JavaScript 中获取 [0, n) 范围内随机整数的标准方法 - 或任何其他仅提供 random() 函数返回 [0,1) 范围内浮点数的语言 - 是使用Math.floor(Math.random() * n).

The standard way to get a random integer in the range [0, n) in JavaScript - or any other language that only offers a random() function that returns a float in the range [0,1) - is to use Math.floor(Math.random() * n).

现在，假设我们正在对一组有理数进行运算，那么这背后的数学是微不足道的.问题是:由于 IEEE-754 浮点数的所有复杂性，结果分布实际上真的是均匀的吗?

Now the math behind this is trivial assuming we're operating on the set of rational numbers. The question is: With all the complications of IEEE-754 floating point numbers is the resulting distribution actually really uniform?

考虑到一个浮点数和下一个更高的浮点数之间的差距随着它们变大而增加，我认为这应该会引入某种对较小数字的偏见.

Considering that the gap between one floating point number and the next higher one increases as they grow larger I would think that this should introduce some kind of bias towards smaller numbers.

推荐答案

不，对于 n 的大多数值，结果分布不会完全均匀.对于较小的值，它将非常接近于均匀分布，以至于您很难检测到与均匀分布的任何差异，但是随着 n 变大，偏差会变得很明显.

No, the resulting distribution is not going to be perfectly uniform, for most values of n. For small values, it'll be so close to uniform that you'd have a hard time detecting any difference from a uniform distribution, but as n gets larger the bias can become noticeable.

为了说明，这里有一些 Python 代码(不是 JavaScript，抱歉，但原理是一样的):

To illustrate, here's some Python code (not JavaScript, sorry, but the principle is the same):

from collections import Counter
from random import random

def badrand(n):
    return int(random() * n)

print(Counter(badrand(6755399441055744) % 3 for _ in range(10000000)))

这将生成 [0, 6755399441055744) 范围内的 1000 万个随机整数，将这些整数中的每一个减少模 3，并计算余数为 0、1 或 2 的次数.如果我们均匀地生成这些整数，我们希望以 3 为模的余数大致均匀分布，因此我们希望计数相似.

This is producing 10 million random integers in the range [0, 6755399441055744), reducing each of those integers modulo 3, and counting the number of times the remainder is 0, 1, or 2. If we're generating those integers uniformly, we'd expect the remainders modulo 3 to be roughly evenly distributed, so we'd expect the counts to be similar.

这是在我的机器上运行的示例结果:

Here's an example result from running this on my machine:

Counter({1: 3751915, 0: 3334643, 2: 2913442})

也就是说，1 的余数显着比0更容易出现，而0又比0更容易出现2 的剩余部分.这里的差异方式太大，无法用随机变化来解释.

That is, a remainder of 1 is significantly more likely to occur than 0, which in turn is significantly more likely to occur than a remainder of 2. The differences here are way too big to be explained by random variation.

所以出了什么问题?Python的random()函数质量比较高，基于Mersenne Twister，所以我们不太可能看到由基本随机数生成器导致的统计问题.发生的事情是 random() 生成 2^53 个(大致)同样可能的结果之一 - 每个结果都是 x/2^53 形式的某个整数的数字x 在 [0, 2^53) 范围内.现在在 badrand 调用中，我们有效地将这些结果映射到 6755399441055744 可能的输出.现在该值不是随机选择的(哈！)；它正好是 2^53 的 3/4.这意味着在可能的最均匀分布下，可能 badrand 输出值的 2/3 恰好被 2^53 个可能的 random() 输出值之一命中，而另外 1/3 被 2^53 个可能的 random() 输出值中的两个命中.也就是说，某些潜在输出发生的可能性是其他输出的两倍.所以我们离统一还有很长的路要走.

So what went wrong? Python's random() function is relatively high quality, based on the Mersenne Twister, so we're unlikely to be seeing statistical problems resulting from the base random number generator. What's happening is that random() generates one of 2^53 (roughly) equally likely outcomes - each outcome is a number of the form x / 2^53 for some integer x in the range [0, 2^53). Now in the badrand call, we're effectively mapping those outcomes to 6755399441055744 possible outputs. Now that value wasn't chosen at random (ha!); it's exactly 3/4 of 2^53. That means that under the most uniform distribution possible, 2/3 of the possible badrand output values are being hit by exactly one of the 2^53 possible random() output values, while the other 1/3 are being hit by two of the 2^53 possible random() output values. That is, some of the potential outputs are twice as likely to occur as others. So we're a long way from uniform.

您将在 JavaScript 中看到相同的效果.在 Chrome 的情况下，似乎只有 2^32 个不同的结果来自 Math.random()，所以你应该能够找到类似上面的效果，n 小于(但接近)2^32.

You're going to see the same effect in JavaScript. In the case of Chrome, it appears that there are only 2^32 distinct results from Math.random(), so you should be able to find effects like the above with n smaller than (but close to) 2^32.

当然，同样的效果也适用于小的 n:如果 n = 5，那么因为 5 不是2^32 我们无法完美地将所有 2^32 可能的 Math.random() 结果均匀分布在 5 个期望结果之间:我们所能希望的最好结果是 5 个结果中的 4 个出现在 858993459 个可能的 random() 结果中，而第五个出现在 random() 结果中的 858993460 个.但是这种分布将非常接近均匀，以至于几乎不可能找到任何统计测试来告诉您不同的情况.因此，出于实际目的，使用小的 n 应该是安全的.

Of course, the same effect holds for small n, too: if n = 5, then because 5 is not a divisor of 2^32 there's no way we can perfectly evenly distribute all 2^32 possible Math.random() results between the 5 desired outcomes: the best we can hope for is that 4 of the 5 outcomes appear for 858993459 of the possible random() results each, while the fifth occurs for 858993460 of the random() results. But that distribution is going to be so close to uniform that it would be well-nigh impossible to find any statistical test to tell you differently. So for practical purposes, you should be safe with small n.

http://bugs.python.org/issue9025.通过摆脱计算这些数字的 int(random() * n) 方法，Python 3 解决了该错误.但是，该错误仍然仍然存在于 Python 2 中.

There's a related Python bug that might be interesting at http://bugs.python.org/issue9025. That bug was solved for Python 3 by moving away from the int(random() * n) method of computing these numbers. The bug still remains in Python 2, though.

这篇关于使用浮点源的整数均匀分布的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！