如何有效地生成Zipf分布数?

本文介绍了如何有效地生成Zipf分布数?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我目前正在对C ++中的某些数据结构进行基准测试，并且想在处理Zipf分布的数字时对其进行测试.

I'm currently benchmarking some data structures in C++ and I want to test them when working on Zipf-distributed numbers.

我正在使用此网站上提供的生成器: http://www.cse.usf.edu/~christen/tools/toolpage.html

I'm using the generator provided on this site: http://www.cse.usf.edu/~christen/tools/toolpage.html

我对实现进行了修改，以使用Mersenne Twister生成器.

I adapted the implementation to use a Mersenne Twister generator.

效果很好，但速度确实很慢.就我而言，范围可能很大(大约一百万)，并且生成的随机数可能会达到数百万.

It works well but it is really slow. In my case, the range can be big (about a million) and the number of random numbers generate can be several millions.

alpha参数不会随时间变化，它是固定的.

The alpha parameter does not change over time, it is fixed.

我尝试计算所有sum_prob.它的速度要快得多，但在大范围内仍然会变慢.

I tried to precaculate all the sum_prob. It's much faster, but still slows on big range.

有没有一种更快的方法来生成Zipf分布数?甚至不那么精确的东西也将受到欢迎.

Is there a faster way to generate Zipf distributed numbers ? Even something less precise will be welcome.

谢谢

推荐答案

仅凭预计算并不能提供太多帮助.但是很明显，sum_prob是累积的，并且具有升序.因此，如果我们使用二进制搜索来找到zipf_value，我们将减少生成Zipf分布数的顺序，从O(n)到O(log(n)).效率大大提高.

在这里，只需将genzipf.c中的zipf()函数替换为以下代码之一即可:

The pre-calculation alone does not help so much. But as it's obvious the sum_prob is accumulative and has ascending order. So if we use a binary-search to find the zipf_value we would decrease the order of generating a Zipf distributed number from O(n) to O(log(n)). Which is so much improvement in efficiency.

Here it is, just replace the zipf() function in genzipf.c with following one:

int zipf(double alpha, int n)
{
  static int first = TRUE;      // Static first time flag
  static double c = 0;          // Normalization constant
  static double *sum_probs;     // Pre-calculated sum of probabilities
  double z;                     // Uniform random number (0 < z < 1)
  int zipf_value;               // Computed exponential value to be returned
  int    i;                     // Loop counter
  int low, high, mid;           // Binary-search bounds

  // Compute normalization constant on first call only
  if (first == TRUE)
  {
    for (i=1; i<=n; i++)
      c = c + (1.0 / pow((double) i, alpha));
    c = 1.0 / c;

    sum_probs = malloc((n+1)*sizeof(*sum_probs));
    sum_probs[0] = 0;
    for (i=1; i<=n; i++) {
      sum_probs[i] = sum_probs[i-1] + c / pow((double) i, alpha);
    }
    first = FALSE;
  }

  // Pull a uniform random number (0 < z < 1)
  do
  {
    z = rand_val(0);
  }
  while ((z == 0) || (z == 1));

  // Map z to the value
  low = 1, high = n, mid;
  do {
    mid = floor((low+high)/2);
    if (sum_probs[mid] >= z && sum_probs[mid-1] < z) {
      zipf_value = mid;
      break;
    } else if (sum_probs[mid] >= z) {
      high = mid-1;
    } else {
      low = mid+1;
    }
  } while (low <= high);

  // Assert that zipf_value is between 1 and N
  assert((zipf_value >=1) && (zipf_value <= n));

  return(zipf_value);
}

这篇关于如何有效地生成Zipf分布数?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！