有效的算法随机选择频率项目

有效的算法随机选择频率项目

本文介绍了有效的算法随机选择频率项目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于数组 N 字频对:

[ (w, f), (w, f), ..., (w, f) ]

其中,是W ,W 是W P-1

  1. Create an array of size m, and populate it so the first f entries are w, the next f entries are w, and so on, so the last f entries are w.

[ w, ..., w, w,..., w, ..., w, ..., w ]

然后使用PRNG在范围 0 ... M-1 选择 P 指数,并报告存储的话这些索引。
这需要 O(N + M + P)工作,这是不是很大,因为 M 可得多大于n。

Then use the pRNG to select p indices in the range 0...m-1, and report the words stored at those indices.
This takes O(n + m + p) work, which isn't great, since m can be much much larger than n.

通过输入数组步骤一次,计算

Step through the input array once, computing

m = ∑f = m + f

和计算后的 M 我 ,使用PRNG来生成许多 X K 的范围 0 ... M 我 1 每个 K 0 ... P-1 并选择是W 我 是W Ĵ K (可能与更换的当前值w Ĵ K )如果 x K < ˚F我
这就要求 O(N + NP)的工作。

and after computing m, use the pRNG to generate a number x in the range 0...m-1 for each k in 0...p-1and select w for w (possibly replacing the current value of w) if x < f.
This requires O(n + np) work.

[ (w, f, m), (w, f, m), ..., (w, f, m) ]

然后,在 0 ... P-1 每个K,利用PRNG生成一个号码 X K 的范围 0 ... M-1 然后执行三元的阵列上的二进制搜索找到 ST M 我 -f 我和乐; x K &LT;米我 ,然后选择是W 我 是W Ĵ K
这就要求 O(N + P log n)的的工作。

and then, for each k in 0...p-1, use the pRNG to generate a number x in the range 0...m-1 then do binary search on the array of triples to find the i s.t. m-f ≤ x < m, and select w for w.
This requires O(n + p log n) work.

我的问题是,:有没有更有效的算法,我可以利用这一点,或者是这些爱在心里口难开

My question is: Is there a more efficient algorithm I can use for this, or are these as good as it gets?

推荐答案

好吧,我发现了另一种算法:的(也提到in这个答案)。基本上它创建的概率空间的划分,使得:

Ok, I found another algorithm: the alias method (also mentioned in this answer). Basically it creates a partition of the probability space such that:

  • N 分区,所有相同的宽度研究 ST NR = M
  • 每个分区包含两个词语的一些比(存储与分区)。
  • 每个字是W 我 F 我 =总和; 分区牛逼ST是W 我&ISIN代码;牛逼 R&倍;比(T,W 我)
  • There are n partitions, all of the same width r s.t. nr = m.
  • each partition contains two words in some ratio (which is stored with the partition).
  • for each word w, f = ∑ ∈ t r × ratio(t,w)

由于所有的分区都是相同的大小,选择哪一个分区可以在不断的工作来完成的(挑选指数 0,...,N-1 随机)和分区的比率然后可用于选择字用于在恒定工作(比较与两个词之间的比率一pRNGed数)。因此,这意味着 P 的选择可以在 0(P)工作完成后,给出了这样一个分区。

Since all the partitions are of the same size, selecting which partition can be done in constant work (pick an index from 0...n-1 at random), and the partition's ratio can then be used to select which word is used in constant work (compare a pRNGed number with the ratio between the two words). So this means the p selections can be done in O(p) work, given such a partition.

这样的划分存在的原因是,存在一个字是W 我 ST F 我&LT; - [R ,当且仅当存在一个字是W 我' ST F 我'&GT; ř,因为r是频率的平均值。

The reason that such a partitioning exists is that there exists a word w s.t. f < r, if and only if there exists a word w s.t. f > r, since r is the average of the frequencies.

由于这样一对是W 我 是W 我' ,我们可以用一个伪字替换它们 W'我 频率 F我 = R (即重新presents 是W 我 的概率 F 我 / R 是W 我' 的概率 1 - ˚F我 / R )和一个新词 W'我' 的调整频率 F我' = F 我' - (R - ˚F我)分别。的所有字的平均频率仍然是r和从现有段的规则仍然适用。由于伪字具有频率r和由两个单词,频率&ね; R,我们知道,如果我们重复这个过程中,我们永远不会让一个伪字出伪字,而这种迭代必须以正伪字这是所需的分区序列结束。

Given such a pair w and w we can replace them with a pseudo-word w' of frequency f' = r (that represents w with probability f/r and w with probability 1 - f/r) and a new word w' of adjusted frequency f' = f - (r - f) respectively. The average frequency of all the words will still be r, and the rule from the prior paragraph still applies. Since the pseudo-word has frequency r and is made of two words with frequency ≠ r, we know that if we iterate this process, we will never make a pseudo-word out of a pseudo-word, and such iteration must end with a sequence of n pseudo-words which are the desired partition.

要构建在 O(N)该分区时,

  • 办理的词列表一次,构建两个列表:
    • 在频率和乐字之一; - [R
    • 在频率和GT字之一; - [R
    • go through the list of the words once, constructing two lists:
      • one of words with frequency ≤ r
      • one of words with frequency > r
      • 如果其频率= r,则使之变成一个元素分区
      • 否则,拉动从其他列表中的单词,并用它来填写一个双字的分区。然后将第二个字回第一或第二列表,根据调整后的频率。

      这其实还是工作,如果分区数 Q&GT; ñ(你必须要证明这一点不同)。如果你想确保r是不可或缺的,你不能很容易地找到一个因素 M 的ST Q&GT; ñ,你可以垫所有的频率由 N ,所以 F我 = NF 我 ,其中更新 M'= MN 和集 R'= M ①= N

      This actually still works if the number of partitions q > n (you just have to prove it differently). If you want to make sure that r is integral, and you can't easily find a factor q of m s.t. q > n, you can pad all the frequencies by a factor of n, so f' = nf, which updates m' = mn and sets r' = m when q = n.

      在任何情况下,这种算法只需要 O(N + P)的工作,我不得不认为这是最佳的。

      In any case, this algorithm only takes O(n + p) work, which I have to think is optimal.

      在红宝石:

      def weighted_sample_with_replacement(input, p)
        n = input.size
        m = input.inject(0) { |sum,(word,freq)| sum + freq }
      
        # find the words with frequency lesser and greater than average
        lessers, greaters = input.map do |word,freq|
                              # pad the frequency so we can keep it integral
                              # when subdivided
                              [ word, freq*n ]
                            end.partition do |word,adj_freq|
                              adj_freq <= m
                            end
      
        partitions = Array.new(n) do
          word, adj_freq = lessers.shift
      
          other_word = if adj_freq < m
                         # use part of another word's frequency to pad
                         # out the partition
                         other_word, other_adj_freq = greaters.shift
                         other_adj_freq -= (m - adj_freq)
                         (other_adj_freq <= m ? lessers : greaters) << [ other_word, other_adj_freq ]
                         other_word
                       end
      
          [ word, other_word , adj_freq ]
        end
      
        (0...p).map do
          # pick a partition at random
          word, other_word, adj_freq = partitions[ rand(n) ]
          # select the first word in the partition with appropriate
          # probability
          if rand(m) < adj_freq
            word
          else
            other_word
          end
        end
      end
      

      这篇关于有效的算法随机选择频率项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-06 22:34