问题描述
我目前正在对C ++中的某些数据结构进行基准测试,并且想在处理Zipf分布的数字时对其进行测试.
I'm currently benchmarking some data structures in C++ and I want to test them when working on Zipf-distributed numbers.
我正在使用此网站上提供的生成器: http://www.cse.usf.edu/~christen/tools/toolpage.html
I'm using the generator provided on this site: http://www.cse.usf.edu/~christen/tools/toolpage.html
我对实现进行了修改,以使用Mersenne Twister生成器.
I adapted the implementation to use a Mersenne Twister generator.
效果很好,但速度确实很慢.就我而言,范围可能很大(大约一百万),并且生成的随机数可能会达到数百万.
It works well but it is really slow. In my case, the range can be big (about a million) and the number of random numbers generate can be several millions.
alpha参数不会随时间变化,它是固定的.
The alpha parameter does not change over time, it is fixed.
我尝试计算所有sum_prob.它的速度要快得多,但在大范围内仍然会变慢.
I tried to precaculate all the sum_prob. It's much faster, but still slows on big range.
有没有一种更快的方法来生成Zipf分布数?甚至不那么精确的东西也将受到欢迎.
Is there a faster way to generate Zipf distributed numbers ? Even something less precise will be welcome.
谢谢
推荐答案
仅凭预计算并不能提供太多帮助.但是很明显,sum_prob是累积的,并且具有升序.因此,如果我们使用二进制搜索来找到zipf_value,我们将减少生成Zipf分布数的顺序,从O(n)到O(log(n)).效率大大提高.
在这里,只需将genzipf.c
中的zipf()
函数替换为以下代码之一即可:
The pre-calculation alone does not help so much. But as it's obvious the sum_prob is accumulative and has ascending order. So if we use a binary-search to find the zipf_value we would decrease the order of generating a Zipf distributed number from O(n) to O(log(n)). Which is so much improvement in efficiency.
Here it is, just replace the zipf()
function in genzipf.c
with following one:
int zipf(double alpha, int n)
{
static int first = TRUE; // Static first time flag
static double c = 0; // Normalization constant
static double *sum_probs; // Pre-calculated sum of probabilities
double z; // Uniform random number (0 < z < 1)
int zipf_value; // Computed exponential value to be returned
int i; // Loop counter
int low, high, mid; // Binary-search bounds
// Compute normalization constant on first call only
if (first == TRUE)
{
for (i=1; i<=n; i++)
c = c + (1.0 / pow((double) i, alpha));
c = 1.0 / c;
sum_probs = malloc((n+1)*sizeof(*sum_probs));
sum_probs[0] = 0;
for (i=1; i<=n; i++) {
sum_probs[i] = sum_probs[i-1] + c / pow((double) i, alpha);
}
first = FALSE;
}
// Pull a uniform random number (0 < z < 1)
do
{
z = rand_val(0);
}
while ((z == 0) || (z == 1));
// Map z to the value
low = 1, high = n, mid;
do {
mid = floor((low+high)/2);
if (sum_probs[mid] >= z && sum_probs[mid-1] < z) {
zipf_value = mid;
break;
} else if (sum_probs[mid] >= z) {
high = mid-1;
} else {
low = mid+1;
}
} while (low <= high);
// Assert that zipf_value is between 1 and N
assert((zipf_value >=1) && (zipf_value <= n));
return(zipf_value);
}
这篇关于如何有效地生成Zipf分布数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!