本文介绍了模态行为背后的数学的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

序言

此问题与(P)RNG和rand()的行为无关.这是关于使用取反模数均匀分布的两个值的幂.

This question is not about the behavior of (P)RNG and rand(). It's about using power of two values uniformly distributed against modulo.

简介

我知道一个人不应该使用模%将一个范围内的值转换为另一个,例如,从rand()函数中获取一个介于0和5之间的值:会有偏差.此处说明 https://bitbucket.org/haypo /hasard/src/ebf5870a1a54/doc/common_errors.rst?at=default ,在此答案中

I knew that one should not use modulo % to convert a value from a range to another, for example to get a value between 0 and 5 from the rand() function: there will be a bias. It's explained here https://bitbucket.org/haypo/hasard/src/ebf5870a1a54/doc/common_errors.rst?at=default and in this answer Why do people say there is modulo bias when using a random number generator?

但是今天,在调查了一些看起来错误的代码之后,我制作了一个工具来演示模的行为: https://gitorious.org/modulo-test/modulo-test/trees/master ,发现这还不够清楚.

But today after investigating some code which was looking wrong, I've made a tool to demonstrate the behavor of modulo: https://gitorious.org/modulo-test/modulo-test/trees/master and found that's not clear enough.

骰子只有3位

我检查了0..5范围内的6个值.只需3位即可对这些值进行编码.

I checked with 6 values in range 0..5. Only 3 bits are needed to code those values.

$ ./modulo-test 10000 6 3
interations = 10000, range = 6, bits = 3 (0x00000007)
  [0..7] => [0..5]

theorical occurences    1666.67 probability 0.16666667

   [   0] occurences    2446    probability 0.24460000 ( +46.76%)
   [   1] occurences    2535    probability 0.25350000 ( +52.10%)
   [   2] occurences    1275    probability 0.12750000 ( -23.50%)
   [   3] occurences    1297    probability 0.12970000 ( -22.18%)
   [   4] occurences    1216    probability 0.12160000 ( -27.04%)
   [   5] occurences    1231    probability 0.12310000 ( -26.14%)

  minimum occurences    1216.00 probability 0.12160000 ( -27.04%)
  maximum occurences    2535.00 probability 0.25350000 ( +52.10%)
     mean occurences    1666.67 probability 0.16666667 (  +0.00%)
   stddev occurences     639.43 probability 0.06394256 (  38.37%)

使用3位输入,结果确实很糟糕,但是表现出预期的效果.查看答案 https://stackoverflow.com/a/14614899/611560

With 3 bits of input, the results are indeed awful, but behave as expected. See answer https://stackoverflow.com/a/14614899/611560

增加输入位数

令我感到困惑的是,增加输入位数使结果有所不同.您不应忘记增加迭代次数,例如样本数,否则结果可能是错误的(请参阅错误的统计信息).

What puzzled me, was increasing the number of input bits made the results different.You should not forgot to increase the number of iterations, eg the number of sample otherwise the results are likely wrong (see Wrong Statistics).

让我们尝试4位:

$ ./modulo-test 20000 6 4
interations = 20000, range = 6, bits = 4 (0x0000000f)
  [0..15] => [0..5]

theorical occurences    3333.33 probability 0.16666667

   [   0] occurences    3728    probability 0.18640000 ( +11.84%)
   [   1] occurences    3763    probability 0.18815000 ( +12.89%)
   [   2] occurences    3675    probability 0.18375000 ( +10.25%)
   [   3] occurences    3721    probability 0.18605000 ( +11.63%)
   [   4] occurences    2573    probability 0.12865000 ( -22.81%)
   [   5] occurences    2540    probability 0.12700000 ( -23.80%)

  minimum occurences    2540.00 probability 0.12700000 ( -23.80%)
  maximum occurences    3763.00 probability 0.18815000 ( +12.89%)
     mean occurences    3333.33 probability 0.16666667 (  +0.00%)
   stddev occurences     602.48 probability 0.03012376 (  18.07%)

让我们尝试5位:

$ ./modulo-test 40000 6 5
interations = 40000, range = 6, bits = 5 (0x0000001f)
  [0..31] => [0..5]

theorical occurences    6666.67 probability 0.16666667

   [   0] occurences    7462    probability 0.18655000 ( +11.93%)
   [   1] occurences    7444    probability 0.18610000 ( +11.66%)
   [   2] occurences    6318    probability 0.15795000 (  -5.23%)
   [   3] occurences    6265    probability 0.15662500 (  -6.03%)
   [   4] occurences    6334    probability 0.15835000 (  -4.99%)
   [   5] occurences    6177    probability 0.15442500 (  -7.34%)

  minimum occurences    6177.00 probability 0.15442500 (  -7.34%)
  maximum occurences    7462.00 probability 0.18655000 ( +11.93%)
     mean occurences    6666.67 probability 0.16666667 (  +0.00%)
   stddev occurences     611.58 probability 0.01528949 (   9.17%)

让我们尝试使用6位:

$ ./modulo-test 80000 6 6
interations = 80000, range = 6, bits = 6 (0x0000003f)
  [0..63] => [0..5]

theorical occurences   13333.33 probability 0.16666667

   [   0] occurences   13741    probability 0.17176250 (  +3.06%)
   [   1] occurences   13610    probability 0.17012500 (  +2.08%)
   [   2] occurences   13890    probability 0.17362500 (  +4.18%)
   [   3] occurences   13702    probability 0.17127500 (  +2.77%)
   [   4] occurences   12492    probability 0.15615000 (  -6.31%)
   [   5] occurences   12565    probability 0.15706250 (  -5.76%)

  minimum occurences   12492.00 probability 0.15615000 (  -6.31%)
  maximum occurences   13890.00 probability 0.17362500 (  +4.18%)
     mean occurences   13333.33 probability 0.16666667 (  +0.00%)
   stddev occurences     630.35 probability 0.00787938 (   4.73%)

问题

请解释一下为什么更改输入位(并相应增加采样数)时结果不同?这些背后的数学原因是什么?

Please explain me why the results are different when changing the input bits (and increasing the sample count accordingly) ? What is the mathematical reasoning behind these ?

错误的统计信息

在先前版本的问题中,我展示了一个具有32位输入和1000000次迭代(例如10 ^ 6个样本)的测试,并对获得正确的结果感到惊讶.我感到羞耻的是,它错了:必须要有N倍的样本才能有信心获得生成器的所有2 ^ 32值.在这里10 ^ 6是将小数位数压缩为2 ^ 32的方法. 向能够用数学/统计语言进行解释的人提供奖励..

In the previous version of the question, I showed a test with 32bits of input and only 1000000 iterations, eg 10^6 samples, and said I was surprised to get correct results.It was so wrong I'm ashamed of: there must be N times more samples to have confidence to get all 2^32 values of the generator. Here 10^6 is way to small compaired to 2^32. Bonus for people able to explain this in mathematical/statistical language..

错误的结果:

$ ./modulo-test 1000000 6 32
interations = 1000000, range = 6, bits = 32 (0xffffffff)
  [0..4294967295] => [0..5]

theorical occurences  166666.67 probability 0.16666667

   [   0] occurences  166881    probability 0.16688100 (  +0.13%)
   [   1] occurences  166881    probability 0.16688100 (  +0.13%)
   [   2] occurences  166487    probability 0.16648700 (  -0.11%)
   [   3] occurences  166484    probability 0.16648400 (  -0.11%)
   [   4] occurences  166750    probability 0.16675000 (  +0.05%)
   [   5] occurences  166517    probability 0.16651700 (  -0.09%)

  minimum occurences  166484.00 probability 0.16648400 (  -0.11%)
  maximum occurences  166881.00 probability 0.16688100 (  +0.13%)
     mean occurences  166666.67 probability 0.16666667 (  +0.00%)
   stddev occurences     193.32 probability 0.00019332 (   0.12%)

我仍然必须阅读并重新阅读 Zed Shaw的出色文章 程序员需要学习统计信息还是我会杀死他们全部"..

I still have to read and re-read the excellent article of Zed Shaw "Programmers Need To Learn Statistics Or I Will Kill Them All".

推荐答案

从本质上讲,您正在这样做:

In essence, you're doing:

(rand() & 7) % 6

让我们假设rand()均匀地分布在[0; RAND_MAX]上,并且RAND_MAX+1是2的幂.显然,rand() & 7可以评估为01,...,7,并且结果是等概率的.

Let's assume that rand() is uniformly distributed on [0; RAND_MAX], and that RAND_MAX+1 is a power of two. It is clear that rand() & 7 can evaluate to 0, 1, ..., 7, and that the outcomes are equiprobable.

现在,让我们看一下对6取模的结果时会发生什么.

Now let's look at what happens when you take the result modulo 6.

  • 0和6映射为0;
  • 1和7映射为1;
  • 2映射为2;
  • 3映射为3;
  • 4映射到4;
  • 5映射为5.

这说明了为什么零和一的数目是其他数字的两倍.

This explains why you get twice as many zeroes and ones as you get the other numbers.

在第二种情况下,同样的事情正在发生.但是,额外"数字的值要小得多,使得它们的贡献与噪声无法区分.

The same thing is happening in the second case. However, the value of the "extra" numbers is much smaller, making their contribution indistinguishable from noise.

总而言之,如果您有一个整数均匀分布在[0; M-1],然后将其取模为N,除非M可被N整除,否则结果将偏向零.

To summarize, if you have an integer uniformly distributed on [0; M-1], and you take it modulo N, the result will be biased towards zero unless M is divisible by N.

这篇关于模态行为背后的数学的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-23 14:35