问题描述
序言
这个问题与 (P)RNG 和 rand()
的行为无关.这是关于使用对模均匀分布的两个值的幂.
This question is not about the behavior of (P)RNG and rand()
. It's about using power of two values uniformly distributed against modulo.
简介
我知道不应该使用模 %
将值从一个范围转换为另一个值,例如从 rand()
功能:会有偏差.这里解释了 https://bitbucket.org/haypo/hasard/src/ebf5870a1a54/doc/common_errors.rst?at=default 和在这个答案 为什么人们说使用随机数生成器时存在模偏差?
I knew that one should not use modulo %
to convert a value from a range to another, for example to get a value between 0 and 5 from the rand()
function: there will be a bias. It's explained here https://bitbucket.org/haypo/hasard/src/ebf5870a1a54/doc/common_errors.rst?at=default and in this answer Why do people say there is modulo bias when using a random number generator?
但是今天在调查了一些看起来有问题的代码之后,我制作了一个工具来演示模的行为:https://gitorious.org/modulo-test/modulo-test/trees/master 发现还不够清楚.
But today after investigating some code which was looking wrong, I've made a tool to demonstrate the behavor of modulo: https://gitorious.org/modulo-test/modulo-test/trees/master and found that's not clear enough.
骰子只有 3 位
我检查了 0..5 范围内的 6 个值.只需要 3 位就可以对这些值进行编码.
I checked with 6 values in range 0..5. Only 3 bits are needed to code those values.
$ ./modulo-test 10000 6 3
interations = 10000, range = 6, bits = 3 (0x00000007)
[0..7] => [0..5]
theorical occurences 1666.67 probability 0.16666667
[ 0] occurences 2446 probability 0.24460000 ( +46.76%)
[ 1] occurences 2535 probability 0.25350000 ( +52.10%)
[ 2] occurences 1275 probability 0.12750000 ( -23.50%)
[ 3] occurences 1297 probability 0.12970000 ( -22.18%)
[ 4] occurences 1216 probability 0.12160000 ( -27.04%)
[ 5] occurences 1231 probability 0.12310000 ( -26.14%)
minimum occurences 1216.00 probability 0.12160000 ( -27.04%)
maximum occurences 2535.00 probability 0.25350000 ( +52.10%)
mean occurences 1666.67 probability 0.16666667 ( +0.00%)
stddev occurences 639.43 probability 0.06394256 ( 38.37%)
使用 3 位输入,结果确实很糟糕,但表现符合预期.查看答案 https://stackoverflow.com/a/14614899/611560
With 3 bits of input, the results are indeed awful, but behave as expected. See answer https://stackoverflow.com/a/14614899/611560
增加输入位数
令我困惑的是,增加输入位数会使结果不同.您不应忘记增加迭代次数,例如样本数,否则结果可能是错误的(请参阅错误统计).
What puzzled me, was increasing the number of input bits made the results different.You should not forgot to increase the number of iterations, eg the number of sample otherwise the results are likely wrong (see Wrong Statistics).
让我们尝试 4 位:
$ ./modulo-test 20000 6 4
interations = 20000, range = 6, bits = 4 (0x0000000f)
[0..15] => [0..5]
theorical occurences 3333.33 probability 0.16666667
[ 0] occurences 3728 probability 0.18640000 ( +11.84%)
[ 1] occurences 3763 probability 0.18815000 ( +12.89%)
[ 2] occurences 3675 probability 0.18375000 ( +10.25%)
[ 3] occurences 3721 probability 0.18605000 ( +11.63%)
[ 4] occurences 2573 probability 0.12865000 ( -22.81%)
[ 5] occurences 2540 probability 0.12700000 ( -23.80%)
minimum occurences 2540.00 probability 0.12700000 ( -23.80%)
maximum occurences 3763.00 probability 0.18815000 ( +12.89%)
mean occurences 3333.33 probability 0.16666667 ( +0.00%)
stddev occurences 602.48 probability 0.03012376 ( 18.07%)
让我们尝试 5 位:
$ ./modulo-test 40000 6 5
interations = 40000, range = 6, bits = 5 (0x0000001f)
[0..31] => [0..5]
theorical occurences 6666.67 probability 0.16666667
[ 0] occurences 7462 probability 0.18655000 ( +11.93%)
[ 1] occurences 7444 probability 0.18610000 ( +11.66%)
[ 2] occurences 6318 probability 0.15795000 ( -5.23%)
[ 3] occurences 6265 probability 0.15662500 ( -6.03%)
[ 4] occurences 6334 probability 0.15835000 ( -4.99%)
[ 5] occurences 6177 probability 0.15442500 ( -7.34%)
minimum occurences 6177.00 probability 0.15442500 ( -7.34%)
maximum occurences 7462.00 probability 0.18655000 ( +11.93%)
mean occurences 6666.67 probability 0.16666667 ( +0.00%)
stddev occurences 611.58 probability 0.01528949 ( 9.17%)
让我们试试 6 位:
$ ./modulo-test 80000 6 6
interations = 80000, range = 6, bits = 6 (0x0000003f)
[0..63] => [0..5]
theorical occurences 13333.33 probability 0.16666667
[ 0] occurences 13741 probability 0.17176250 ( +3.06%)
[ 1] occurences 13610 probability 0.17012500 ( +2.08%)
[ 2] occurences 13890 probability 0.17362500 ( +4.18%)
[ 3] occurences 13702 probability 0.17127500 ( +2.77%)
[ 4] occurences 12492 probability 0.15615000 ( -6.31%)
[ 5] occurences 12565 probability 0.15706250 ( -5.76%)
minimum occurences 12492.00 probability 0.15615000 ( -6.31%)
maximum occurences 13890.00 probability 0.17362500 ( +4.18%)
mean occurences 13333.33 probability 0.16666667 ( +0.00%)
stddev occurences 630.35 probability 0.00787938 ( 4.73%)
问题
请解释为什么更改输入位(并相应地增加样本计数)时结果不同?这些背后的数学推理是什么?
Please explain me why the results are different when changing the input bits (and increasing the sample count accordingly) ? What is the mathematical reasoning behind these ?
错误的统计数据
在之前版本的问题中,我展示了一个输入为 32bits 并且只有 1000000 次迭代的测试,例如 10^6 个样本,并说我很惊讶得到正确的结果.我很惭愧这是错误的:必须有 N 倍的样本才能有信心获得生成器的所有 2^32 个值.这里 10^6 是与 2^32 相比较小的方法.能够用数学/统计语言解释这一点的人的奖励..
In the previous version of the question, I showed a test with 32bits of input and only 1000000 iterations, eg 10^6 samples, and said I was surprised to get correct results.It was so wrong I'm ashamed of: there must be N times more samples to have confidence to get all 2^32 values of the generator. Here 10^6 is way to small compaired to 2^32. Bonus for people able to explain this in mathematical/statistical language..
这里是错误的结果:
$ ./modulo-test 1000000 6 32
interations = 1000000, range = 6, bits = 32 (0xffffffff)
[0..4294967295] => [0..5]
theorical occurences 166666.67 probability 0.16666667
[ 0] occurences 166881 probability 0.16688100 ( +0.13%)
[ 1] occurences 166881 probability 0.16688100 ( +0.13%)
[ 2] occurences 166487 probability 0.16648700 ( -0.11%)
[ 3] occurences 166484 probability 0.16648400 ( -0.11%)
[ 4] occurences 166750 probability 0.16675000 ( +0.05%)
[ 5] occurences 166517 probability 0.16651700 ( -0.09%)
minimum occurences 166484.00 probability 0.16648400 ( -0.11%)
maximum occurences 166881.00 probability 0.16688100 ( +0.13%)
mean occurences 166666.67 probability 0.16666667 ( +0.00%)
stddev occurences 193.32 probability 0.00019332 ( 0.12%)
我还是要反复阅读Zed Shaw的优秀文章 "程序员需要学习统计数据,否则我将杀光他们".
推荐答案
本质上,你在做:
(rand() & 7) % 6
假设 rand()
均匀分布在 [0;RAND_MAX]
,并且 RAND_MAX+1
是 2 的幂.很明显 rand() &7
可以计算为 0
, 1
, ..., 7
,并且结果是等概率的.
Let's assume that rand()
is uniformly distributed on [0; RAND_MAX]
, and that RAND_MAX+1
is a power of two. It is clear that rand() & 7
can evaluate to 0
, 1
, ..., 7
, and that the outcomes are equiprobable.
现在让我们看看当你对结果取模 6
时会发生什么.
Now let's look at what happens when you take the result modulo 6
.
- 0 和 6 映射到 0;
- 1 和 7 映射到 1;
- 2 映射到 2;
- 3 映射到 3;
- 4 映射到 4;
- 5 映射到 5.
这就解释了为什么你得到的零和一是得到其他数字的两倍.
This explains why you get twice as many zeroes and ones as you get the other numbers.
同样的事情发生在第二种情况.然而,额外"数字的值要小得多,使得它们的贡献与噪声无法区分.
The same thing is happening in the second case. However, the value of the "extra" numbers is much smaller, making their contribution indistinguishable from noise.
总而言之,如果你有一个在 [0
上均匀分布的整数;M-1
],然后对 N
取模,结果将偏向于零,除非 M
可以被 N.
To summarize, if you have an integer uniformly distributed on [
0
; M-1
], and you take it modulo N
, the result will be biased towards zero unless M
is divisible by N
.
这篇关于模行为背后的数学的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!