问题描述
序言
此问题与(P)RNG和rand()
的行为无关.这是关于使用取反模数均匀分布的两个值的幂.
This question is not about the behavior of (P)RNG and rand()
. It's about using power of two values uniformly distributed against modulo.
简介
我知道一个人不应该使用模%
将一个范围内的值转换为另一个,例如,从rand()
函数中获取一个介于0和5之间的值:会有偏差.此处说明 https://bitbucket.org/haypo /hasard/src/ebf5870a1a54/doc/common_errors.rst?at=default ,在此答案中
I knew that one should not use modulo %
to convert a value from a range to another, for example to get a value between 0 and 5 from the rand()
function: there will be a bias. It's explained here https://bitbucket.org/haypo/hasard/src/ebf5870a1a54/doc/common_errors.rst?at=default and in this answer Why do people say there is modulo bias when using a random number generator?
但是今天,在调查了一些看起来错误的代码之后,我制作了一个工具来演示模的行为: https://gitorious.org/modulo-test/modulo-test/trees/master ,发现这还不够清楚.
But today after investigating some code which was looking wrong, I've made a tool to demonstrate the behavor of modulo: https://gitorious.org/modulo-test/modulo-test/trees/master and found that's not clear enough.
骰子只有3位
我检查了0..5范围内的6个值.只需3位即可对这些值进行编码.
I checked with 6 values in range 0..5. Only 3 bits are needed to code those values.
$ ./modulo-test 10000 6 3
interations = 10000, range = 6, bits = 3 (0x00000007)
[0..7] => [0..5]
theorical occurences 1666.67 probability 0.16666667
[ 0] occurences 2446 probability 0.24460000 ( +46.76%)
[ 1] occurences 2535 probability 0.25350000 ( +52.10%)
[ 2] occurences 1275 probability 0.12750000 ( -23.50%)
[ 3] occurences 1297 probability 0.12970000 ( -22.18%)
[ 4] occurences 1216 probability 0.12160000 ( -27.04%)
[ 5] occurences 1231 probability 0.12310000 ( -26.14%)
minimum occurences 1216.00 probability 0.12160000 ( -27.04%)
maximum occurences 2535.00 probability 0.25350000 ( +52.10%)
mean occurences 1666.67 probability 0.16666667 ( +0.00%)
stddev occurences 639.43 probability 0.06394256 ( 38.37%)
使用3位输入,结果确实很糟糕,但是表现出预期的效果.查看答案 https://stackoverflow.com/a/14614899/611560
With 3 bits of input, the results are indeed awful, but behave as expected. See answer https://stackoverflow.com/a/14614899/611560
增加输入位数
令我感到困惑的是,增加输入位数使结果有所不同.您不应忘记增加迭代次数,例如样本数,否则结果可能是错误的(请参阅错误的统计信息).
What puzzled me, was increasing the number of input bits made the results different.You should not forgot to increase the number of iterations, eg the number of sample otherwise the results are likely wrong (see Wrong Statistics).
让我们尝试4位:
$ ./modulo-test 20000 6 4
interations = 20000, range = 6, bits = 4 (0x0000000f)
[0..15] => [0..5]
theorical occurences 3333.33 probability 0.16666667
[ 0] occurences 3728 probability 0.18640000 ( +11.84%)
[ 1] occurences 3763 probability 0.18815000 ( +12.89%)
[ 2] occurences 3675 probability 0.18375000 ( +10.25%)
[ 3] occurences 3721 probability 0.18605000 ( +11.63%)
[ 4] occurences 2573 probability 0.12865000 ( -22.81%)
[ 5] occurences 2540 probability 0.12700000 ( -23.80%)
minimum occurences 2540.00 probability 0.12700000 ( -23.80%)
maximum occurences 3763.00 probability 0.18815000 ( +12.89%)
mean occurences 3333.33 probability 0.16666667 ( +0.00%)
stddev occurences 602.48 probability 0.03012376 ( 18.07%)
让我们尝试5位:
$ ./modulo-test 40000 6 5
interations = 40000, range = 6, bits = 5 (0x0000001f)
[0..31] => [0..5]
theorical occurences 6666.67 probability 0.16666667
[ 0] occurences 7462 probability 0.18655000 ( +11.93%)
[ 1] occurences 7444 probability 0.18610000 ( +11.66%)
[ 2] occurences 6318 probability 0.15795000 ( -5.23%)
[ 3] occurences 6265 probability 0.15662500 ( -6.03%)
[ 4] occurences 6334 probability 0.15835000 ( -4.99%)
[ 5] occurences 6177 probability 0.15442500 ( -7.34%)
minimum occurences 6177.00 probability 0.15442500 ( -7.34%)
maximum occurences 7462.00 probability 0.18655000 ( +11.93%)
mean occurences 6666.67 probability 0.16666667 ( +0.00%)
stddev occurences 611.58 probability 0.01528949 ( 9.17%)
让我们尝试使用6位:
$ ./modulo-test 80000 6 6
interations = 80000, range = 6, bits = 6 (0x0000003f)
[0..63] => [0..5]
theorical occurences 13333.33 probability 0.16666667
[ 0] occurences 13741 probability 0.17176250 ( +3.06%)
[ 1] occurences 13610 probability 0.17012500 ( +2.08%)
[ 2] occurences 13890 probability 0.17362500 ( +4.18%)
[ 3] occurences 13702 probability 0.17127500 ( +2.77%)
[ 4] occurences 12492 probability 0.15615000 ( -6.31%)
[ 5] occurences 12565 probability 0.15706250 ( -5.76%)
minimum occurences 12492.00 probability 0.15615000 ( -6.31%)
maximum occurences 13890.00 probability 0.17362500 ( +4.18%)
mean occurences 13333.33 probability 0.16666667 ( +0.00%)
stddev occurences 630.35 probability 0.00787938 ( 4.73%)
问题
请解释一下为什么更改输入位(并相应增加采样数)时结果不同?这些背后的数学原因是什么?
Please explain me why the results are different when changing the input bits (and increasing the sample count accordingly) ? What is the mathematical reasoning behind these ?
错误的统计信息
在先前版本的问题中,我展示了一个具有32位输入和1000000次迭代(例如10 ^ 6个样本)的测试,并对获得正确的结果感到惊讶.我感到羞耻的是,它错了:必须要有N倍的样本才能有信心获得生成器的所有2 ^ 32值.在这里10 ^ 6是将小数位数压缩为2 ^ 32的方法. 向能够用数学/统计语言进行解释的人提供奖励..
In the previous version of the question, I showed a test with 32bits of input and only 1000000 iterations, eg 10^6 samples, and said I was surprised to get correct results.It was so wrong I'm ashamed of: there must be N times more samples to have confidence to get all 2^32 values of the generator. Here 10^6 is way to small compaired to 2^32. Bonus for people able to explain this in mathematical/statistical language..
错误的结果:
$ ./modulo-test 1000000 6 32
interations = 1000000, range = 6, bits = 32 (0xffffffff)
[0..4294967295] => [0..5]
theorical occurences 166666.67 probability 0.16666667
[ 0] occurences 166881 probability 0.16688100 ( +0.13%)
[ 1] occurences 166881 probability 0.16688100 ( +0.13%)
[ 2] occurences 166487 probability 0.16648700 ( -0.11%)
[ 3] occurences 166484 probability 0.16648400 ( -0.11%)
[ 4] occurences 166750 probability 0.16675000 ( +0.05%)
[ 5] occurences 166517 probability 0.16651700 ( -0.09%)
minimum occurences 166484.00 probability 0.16648400 ( -0.11%)
maximum occurences 166881.00 probability 0.16688100 ( +0.13%)
mean occurences 166666.67 probability 0.16666667 ( +0.00%)
stddev occurences 193.32 probability 0.00019332 ( 0.12%)
我仍然必须阅读并重新阅读 Zed Shaw的出色文章 程序员需要学习统计信息还是我会杀死他们全部"..
I still have to read and re-read the excellent article of Zed Shaw "Programmers Need To Learn Statistics Or I Will Kill Them All".
推荐答案
从本质上讲,您正在这样做:
In essence, you're doing:
(rand() & 7) % 6
让我们假设rand()
均匀地分布在[0; RAND_MAX]
上,并且RAND_MAX+1
是2的幂.显然,rand() & 7
可以评估为0
,1
,...,7
,并且结果是等概率的.
Let's assume that rand()
is uniformly distributed on [0; RAND_MAX]
, and that RAND_MAX+1
is a power of two. It is clear that rand() & 7
can evaluate to 0
, 1
, ..., 7
, and that the outcomes are equiprobable.
现在,让我们看一下对6
取模的结果时会发生什么.
Now let's look at what happens when you take the result modulo 6
.
- 0和6映射为0;
- 1和7映射为1;
- 2映射为2;
- 3映射为3;
- 4映射到4;
- 5映射为5.
这说明了为什么零和一的数目是其他数字的两倍.
This explains why you get twice as many zeroes and ones as you get the other numbers.
在第二种情况下,同样的事情正在发生.但是,额外"数字的值要小得多,使得它们的贡献与噪声无法区分.
The same thing is happening in the second case. However, the value of the "extra" numbers is much smaller, making their contribution indistinguishable from noise.
总而言之,如果您有一个整数均匀分布在[0
; M-1
],然后将其取模为N
,除非M
可被N
整除,否则结果将偏向零.
To summarize, if you have an integer uniformly distributed on [0
; M-1
], and you take it modulo N
, the result will be biased towards zero unless M
is divisible by N
.
这篇关于模态行为背后的数学的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!