问题描述
我需要从一组 N
(其中个)中随机选择
)。 n
个记录0< n< N
I need to select n
records at random from a set of N
(where 0 < n < N
).
可能的算法是:
因此,如果您有40个项目,则第一个被选中的可能性为 5/40
。
So if you had 40 items, the first would have a 5/40
chance of being selected.
如果是,则下一个概率为 4/39
,否则,概率为 5/39 $机会。到最后
时,您将拥有5件商品,并且在此之前通常都拥有它们。
If it is, the next has a 4/39
chance, otherwise it has a 5/39
chance. By the time you get to the end you will have your 5 items, and often you'll have all of them before that.
假设一个好的伪随机数生成器,这个算法正确吗?
Assuming a good pseudo-random number generator, is this algorithm correct?
注意
关于stackoverflow的此类问题很多(很多问题被标记为的答案)和
总是受到质疑(例如,请参阅
,,, ...)。
The above algorithm is often proposed (e.g. Kyle Cronin's answer) andit's always questioned (e.g. seehere, here, here, here...).
我可以对这件事有最后的决定吗?
Can I have a final word about the matter?
推荐答案
算法绝对正确。
这不是好的海报的突然发明,而是一种称为选择抽样 / 算法S 的众所周知的技术(由Fan,Muller和Rezucha发现(1)并由Jones(2)在1962年独立提出),在-第2卷-Seminumerical算法-第3.4.2。节
It's not the sudden invention of a good poster, it's a well known technique called Selection sampling / Algorithm S (discovered by Fan, Muller and Rezucha (1) and independently by Jones (2) in 1962), well described in TAOCP - Volume 2 - Seminumerical Algorithms - § 3.4.2.
正如努斯所说:
算法对 n
N 和 t + 1
st的一组元素中选择>元素的概率为(n-m)/(N-t)
,当已经选择 m
个元素时。
The algorithm samples n
elements from a set of size N
and the t + 1
st element is chosen with probability (n - m) / (N - t)
, when already m
elements have been chosen.
很容易看出,在选择 n
个项目之前,我们从来没有跑过集的末尾(因为概率为 1
,当我们有 k
个元素要从其余的 k
个元素中选择时)。
It's easy to see that we never run off the end of the set before choosing n
items (as the probability will be 1
when we have k
elements to choose from the remaining k
elements).
我们也永远不会选择太多元素( n == m的概率将是
)。 0
Also we never pick too many elements (the probability will be 0
as soon n == m
).
要证明样本是完全无偏的要难一点,但这是
It's a bit harder to demonstrate that the sample is completely unbiased, but it's
(所以不仅仅是在Stackoverflow上)
(so not just on Stackoverflow!)
事实是我们不应该混淆条件概率和无条件概率:
The fact is we should not confuse conditional and unconditional probabilities:
选择第二个元素的总概率为(n / N)((n-1)/(N-1))+(1-n / N)(n /(N-1) )= n / N
。
The overall probability of selecting the second element is (n / N) ((n - 1) / (N - 1)) + (1 - n/N)(n / (N - 1)) = n/N
.
除了理论上的考虑之外, Algorithm S (和 algorithm R > / )已在许多知名图书馆中使用(例如,,
在P中ython ...)。
Apart from theoretical considerations, Algorithm S (and algorithm R / reservoir sampling) is used in many well known libraries (e.g. SGI's original STL implementation, std::experimental::sample
,random.sample
in Python...).
当然,算法S并非总是最佳答案:
Of course algorithm S is not always the best answer:
- 它是
O(N)
(即使我们通常不必跳过所有N
条记录:n = 2
大约为2/3 N
时考虑的平均记录数;常规公式在
TAOCP-第2卷-第3.4.2节-ex 5/6中给出; - 不能使用。
- it's
O(N)
(even if we will usually not have to pass over allN
records: the average number of records considered whenn=2
is about2/3 N
; the general formulas are given inTAOCP - Vol 2 - § 3.4.2 - ex 5/6); - it cannot be used when the value of
N
isn't known in advance.
无论如何!
- C。 T. Fan,M。E. Muller和I. Rezucha,J。Amer。统计副会长57(1962),第387-402页
- T。 G.Jones,CACM 5(1962),第343页
编辑
[CUT]
在极少数情况下,当您需要5个元素时甚至可以选择4个或6个元素
In rare cases, you might even pick 4 or 6 elements when you wanted 5
这来自(为避免通用接口/标签分发而作的小修改):
This is from N3925 (small modifications to avoid the common interface / tag dispatch):
template<class PopIter, class SampleIter, class Size, class URNG>
SampleIter sample(PopIter first, PopIter last, SampleIter out, Size n, URNG &&g)
{
using dist_t = uniform_int_distribution<Size>;
using param_t = typename dist_t::param_type;
dist_t d{};
Size unsampled_sz = distance(first, last);
for (n = min(n, unsampled_sz); n != 0; ++first)
{
param_t const p{0, --unsampled_sz};
if (d(g, p) < n) { *out++ = *first; --n; }
}
return out;
}
没有浮点数。
- 如果需要5个元素,您将获得5个元素;
- 如果
uniform_int_distribution
,没有偏见。
- If you need 5 elements you get 5 elements;
- if
uniform_int_distribution
"works as advertised" there is no bias.
这篇关于从N个集合中随机选择n个记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!