本文介绍了C++ 随机非重复整数与权重的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在(封闭)范围 [0, rnd_max] 中有效地生成唯一(非重复)整数的随机样本,范围内的每个数字都可以选择,并且每个都与一个样本权重相关联(权重越大,选择该数字的可能性就越大,下一个被选择的概率恰好是 weight[i]/sum(weight[not_taken])如果样本中尚未采用它).

I want to efficiently generate a random sample of unique (non-repeated) integers in a (closed) range [0, rnd_max], with each number in the range being possible to choose, and each being associated with a sample weight (the more weight, the more likely it should be that the number is chosen, with probability exactly weight[i] / sum(weight[not_taken]) to be chosen next if it's not already taken in the sample).

我看到 C++ 有 std::discrete_distribution 可以生成随机加权整数,但是如果我用它来生成随机整数并丢弃重复的整数,当要取的样本相对于长度来说很大时在可能的范围内,将有很多已经采取的失败样本,导致程序效率极低.我不清楚弗洛伊德的算法是否对样本权重有一些扩展(https://math.stackexchange.com/questions/178690/whats-the-proof-of-correctness-for-robert-floyds-algorithm-for-selecting-a-sin) - 我个人想不出一个.

I see C++ has std::discrete_distribution which can generate random weighted integers, but if I use it to generate random integers and discard repeated ones, when the sample to take is large relative to the length of the possible range, there will be a lot of failed samples which are already taken, resulting in a highly inefficient procedure. It's not clear to me if Floyd's algorithm has some extension to the case with sample weights (https://math.stackexchange.com/questions/178690/whats-the-proof-of-correctness-for-robert-floyds-algorithm-for-selecting-a-sin) - I personally cannot think of one.

也可以例如使用 std::discrete_distribution 将权重降至零,或执行部分加权洗牌,如以下答案:C++.加权 std::shuffle - 但在那个答案中, std::discrete_distribution 在每次迭代时重新生成,因此运行时间变为二次(它需要循环通过权重每次都传递给它).

It's also possible to e.g. use std::discrete_distribution dropping the weight to zero, or performing a partial weighted shuffle like in this answer: C++. Weighted std::shuffle - but in that answer, std::discrete_distribution is re-generated at each iteration and thus the running time becomes quadratic (it needs to cycle through the weights that are passed to it every time).

想知道什么是 C++ 中唯一整数的有效加权随机样本,它适用于不同的样本大小(例如,可用范围内采样数的 1% 到 90%).

In wondering what could be an efficient weighted random sample for unique integers in C++, that would work well for varying sample sizes (e.g. from 1% to 90% of sampled numbers in the available range).

#include <vector>
#include <random>
#include <algorithm>

int main()
{
    size_t rnd_max = 1e5;
    size_t ntake = 1e3;

    unsigned int seed = 12345;
    std::mt19937 rng(seed);
    std::gamma_distribution<double> rgamma(1.0, 1.0);
    std::vector<double> weights(rnd_max);
    for (double &w : weights) w = rgamma(rng);

    std::vector<int> chosen_sample(ntake);
    // sampler goes here...

    return 0;
}

推荐答案

使用增强二叉搜索树可以很好地解决这个问题.它给出了一个 O(k log n) 时间的随机采样 k 个元素的算法.

There is a nice way to solve this problem using augmented binary search trees. It gives an O(k log n)-time algorithm for sampling k elements at random.

这个想法是这样的.假设您将所有元素按排序顺序存储在一个数组中,每个元素都标有其权重.然后,您可以按如下方式(效率低下)解决此问题:

The idea goes like this. Let's imagine that you stash all your elements in an array, in sorted order, with each element tagged with its weight. You could then solve this problem (inefficiently) as follows:

  1. 生成一个介于 0 和所有元素的总权重之间的随机数.
  2. 迭代数组,直到找到一个元素,使得随机数在该元素跨越的范围"内.此处,范围"表示从该元素的开头到下一个元素的开头的权重窗口.
  3. 删除该元素并重复.

如果你按照上面提到的方式实现这个,每次选择一个随机元素都需要时间 O(n):你必须遍历数组的所有元素,然后在你选择它后在某处删除一个元素.这不是很好;整体运行时间为 O(kn).

If you implement this as mentioned above, each pass of picking a random element will take time O(n): you have to iterate over all the elements of the array, then remove a single element somewhere once you've picked it. That's not great; the overall runtime is O(kn).

我们可以通过以下方式稍微改进这个想法.当存储数组中的所有元素时,让每个元素存储它的实际重量和它之前的所有元素的组合重量.现在,要查找要采样的元素,不需要使用线性搜索.您可以改为在数组上使用 二进制搜索 以在 O(log n) 时间内定位您的元素.然而,这种方法的整体运行时间仍然是每次迭代 O(n),因为这是删除您选择的元素的成本,所以我们仍然处于 O(kn) 范围内.

We can slightly improve upon this idea in the following way. When storing all the elements in the array, have each element store both its actual weight and the combined weight of all elements that come before it. Now, to find which element you're going to sample, you don't need to use a linear search. You can instead use a binary search over the array to locate your element in time O(log n). However, the overall runtime of this approach is still O(n) per iteration, since that's the cost of removing the element you picked, so we're still in O(kn) territory.

但是,如果您不是将元素存储在排序的数组中,其中每个元素存储它之前所有元素的权重,而是平衡二分查找树,其中每个元素存储其左子树中所有元素的权重,您可以模拟上述算法(二分搜索被替换为遍历树).此外,这样做的好处是可以在 O(log n) 时间内从树中移除元素,因为它是一个平衡的 BST.

However, if you store the elements not in a sorted array where each element stores the weight of all elements before it, but in a balanced binary search tree where each element stores the weight of all elements in its left subtree, you can simulate the above algorithm (the binary search gets replaced with a walk over the tree). Moreover, this has the advantage that removing an element from the tree can be done in time O(log n), since it's a balanced BST.

(如果您想知道如何寻找所需元素,请快速搜索订单统计树."这里的想法本质上是这个想法的概括.)

(If you're curious how you'd do the walk to find the element that you want, do a quick search for "order statistics tree." The idea here is essentially a generalization of this idea.)

按照@dyukha 的建议,您可以通过从时间 O(n) 中的项目构建完美平衡的树来获得每个操作的 O(log n) 时间(实际上不必为此对项目进行排序)技术 - 你明白为什么吗?),然后在每次需要删除某些内容时使用标准的树删除算法.这给出了 O(k log n) 的整体解决方案运行时间.

Following the advice from @dyukha, you can get O(log n) time per operation by building a perfectly-balanced tree from the items in time O(n) (the items don't actually have to be sorted for this technique to work - do you see why?), then using the standard tree deletion algorithm each time you need to remove something. This gives an overall solution runtime of O(k log n).

这篇关于C++ 随机非重复整数与权重的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-21 03:27