问题描述
我想解决以下问题。我必须在10 ^ 20数量级的一个非常大的集合中进行采样,并提取一个没有重复大约10%-20%大小的样本。给定集合的大小,我相信像Fisher-Yates这样的算法是不可行的。
I want to solve the following problem. I have to sample among an extremely large set, of the order of 10^20 and extracting a sample without repetitions of size about 10%-20%. Given the size of the set, I believe that an algorithm like Fisher–Yates is not feasible.
我想像随机路径树之类的东西可能可以在O(n log n)中完成,并且不能更快地完成,但是我想问一下
I'm thinking that something like random path tree might work for doing it in O(n log n) and can't be done faster, but I want to ask if something like this has already been implemented.
谢谢您的时间!
推荐答案
我不知道我下面描述的技术在形式化随机测试中的表现如何,但是它的确给出了看起来很随机的结果。
I don't know how well the technique I describe below would do on formal tests of randomness, but it does give "random-looking" results.
您可以使用。这个想法是,您可以使用数学函数将1-N范围内的每个整数映射到相同范围内的唯一整数。这通常用于生成混淆密钥,但是您可以通过更改种子值和提取项的范围来使其适应生成随机子集。
You can do this with a multiplicative inverse. The idea is that you use a mathematical function to map every integer in the range 1-N to a unique integer in the same range. This is often used to generate obfuscated keys, but you can adapt it to generate random subsets by altering the seed value and the range from which you pull items.
我写了一个关于如何生成混淆的顺序密钥。代码如下:
A while back I wrote a blog entry about how to generate obfuscated sequential keys. Here's the code:
private void DoIt()
{
const long m = 101; // Number of keys + 1
const long x = 387420489; // must be coprime to m
// Compute the multiplicative inverse
var multInv = MultiplicativeInverse(x, m);
// HashSet is used to hold the obfuscated value so we can ensure that no duplicates occur.
var nums = new HashSet<long>();
// Obfuscate each number from 1 to 100.
// Show that the process can be reversed.
// Show that no duplicates are generated.
for (long i = 1; i <= 100; ++i)
{
var obfuscated = i * x % m;
var original = obfuscated * multInv % m;
Console.WriteLine("{0} => {1} => {2}", i, obfuscated, original);
if (!nums.Add(obfuscated))
{
Console.WriteLine("Duplicate");
}
}
}
private long MultiplicativeInverse(long x, long modulus)
{
return ExtendedEuclideanDivision(x, modulus).Item1 % modulus;
}
private static Tuple<long, long> ExtendedEuclideanDivision(long a, long b)
{
if (a < 0)
{
var result = ExtendedEuclideanDivision(-a, b);
return Tuple.Create(-result.Item1, result.Item2);
}
if (b < 0)
{
var result = ExtendedEuclideanDivision(a, -b);
return Tuple.Create(result.Item1, -result.Item2);
}
if (b == 0)
{
return Tuple.Create(1L, 0L);
}
var q = a / b;
var r = a % b;
var rslt = ExtendedEuclideanDivision(b, r);
var s = rslt.Item1;
var t = rslt.Item2;
return Tuple.Create(t, s - q * t);
}
该程序的输出的前几行是:
The first few lines of output for that program are:
1 => 43 => 1
2 => 86 => 2
3 => 28 => 3
4 => 71 => 4
5 => 13 => 5
6 => 56 => 6
7 => 99 => 7
8 => 41 => 8
9 => 84 => 9
10 => 26 => 10
如果要更改 m
并在函数开始处添加 x
值以反映您的数字范围,这将对您有用。而不是总是从1开始抢占前10%或20%,而是可以从50%的关头开始,然后从那里开始。或使用某种技术来捕获第五个数字,或者其他任何形式,只要您的方法不会两次访问相同的数字。
If you were to change the m
and x
values at the beginning of the function to reflect your range of numbers, this would work for you. And rather than always starting at 1 and grabbing the first 10 or 20%, you could start at the 50% mark and go from there. Or use some technique that grabs every fifth number, or whatever, so long as your method doesn't visit the same number twice.
如果您需要更多次运行,只需更改 x
值。
And if you need more runs, just change the x
value.
生成乘法逆(将其视为播种随机数生成器)是一种O(log n)操作。之后,生成每个数字就是O(1)。
Generating the multiplicative inverse (think of it as seeding the random number generator) is an O(log n) operation. After that, generating each number is O(1).
当然,如果要处理10 ^ 20范围内的数字,则必须修改代码以使用大整数类。
Of course, if you're working with numbers in the range of 10^20, you'll have to modify the code to work with a big integer class.
这篇关于生成唯一(非重复)随机数的高效算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!