问题描述
有人可以解释如何计数素描算法的工作?我仍然无法弄清楚如何哈希值的使用,例如。我有一个很难理解本文。
Can someone explain how the Count Sketch Algorithm works? I still can't figure out how hashes are used, for example. I have a hard time understanding this paper.
推荐答案
这流算法实例下面的框架。
This streaming algorithm instantiates the following framework.
-
查找随机数据流的算法,它的输出(作为随机变量)具有所需的期望,但通常是高方差(即噪声)。
Find a randomized streaming algorithm whose output (as a random variable) has the desired expectation but usually high variance (i.e., noise).
要降低方差/噪音,并行运行多个独立的副本,并结合它们的输出。
To reduce the variance/noise, run many independent copies in parallel and combine their outputs.
通常为1比2。更有趣的这个算法的2其实是有点不标准,但我要谈的1只。
Usually 1 is more interesting than 2. This algorithm's 2 actually is somewhat nonstandard, but I'm going to talk about 1 only.
假设我们正在处理输入
a b c a b a .
通过三个计数器,就没有必要凑。
With three counters, there's no need to hash.
a: 3, b: 2, c: 1
然而,让我们假设我们只有一个。有八种可能的功能 H:{A,B,C} - > {+ 1,-1}
。下面是结果的一个表
Let's suppose however that we have just one. There are eight possible functions h : {a, b, c} -> {+1, -1}
. Here is a table of the outcomes.
h |
abc | X = counter
----+--------------
+++ | +3 +2 +1 = 6
++- | +3 +2 -1 = 4
+-- | +3 -2 -1 = 0
+-+ | +3 -2 +1 = 2
--+ | -3 -2 +1 = -4
--- | -3 -2 -1 = -6
-+- | -3 +2 -1 = -2
-++ | -3 +2 +1 = 0
现在,我们可以计算出期望
Now we can calculate expectations
(6 + 4 + 0 + 2) - (-4 + -6 + -2 + 0)
E[h(a) X] = ------------------------------------ = 24/8 = 3
8
(6 + 4 + -2 + 0) - (0 + 2 + -4 + -6)
E[h(b) X] = ------------------------------------ = 16/8 = 2
8
(6 + 2 + -4 + 0) - (4 + 0 + -6 + -2)
E[h(c) X] = ------------------------------------ = 8/8 = 1 .
8
这是怎么回事呢?对于 A
,比方说,我们可以分解 X = Y + Z
,其中是
是总和的变化 A
s和以Z
的总和非 - A
秒。通过预期的线性度,我们有
What's going on here? For a
, say, we can decompose X = Y + Z
, where Y
is the change in the sum for the a
s, and Z
is the sum for the non-a
s. By the linearity of expectation, we have
E[h(a) X] = E[h(a) Y] + E[h(a) Z] .
E [H(一)Y]
是对 A
即每次出现一个学期的总和 H(一)^ 2 = 1
,所以 E [H(一)Y]
是出现的次数 A
。其他期限 E [H(一)Z]
为零;即使考虑 H(一)
,对方散列值也同样可能是加上或减去一个,所以有助于零的预期。
E[h(a) Y]
is a sum with a term for each occurrence of a
that is h(a)^2 = 1
, so E[h(a) Y]
is the number of occurrences of a
. The other term E[h(a) Z]
is zero; even given h(a)
, each other hash value is equally likely to be plus or minus one and so contributes zero in expectation.
在事实上,散列函数并不需要是均匀的随机的,和良好的事:就没有的方式来存储它。就足够了散列函数是成对独立的(任何两个特定散列值是独立的)。对于我们的简单的例子,一个随机选择的以下四个功能就足够了。
In fact, the hash function doesn't need to be uniform random, and good thing: there would be no way to store it. It suffices for the hash function to be pairwise independent (any two particular hash values are independent). For our simple example, a random choice of the following four functions suffices.
abc
+++
+--
-+-
--+
我会离开这个新的计算给你。
I'll leave the new calculations to you.
这篇关于解释伯爵素描算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!