问题描述
让 std :: vector< int> counts
是正整数的向量,并让 N:= counts [0] + ... + counts [counts.length()-1]
是向量分量的总和。设置 pi:= counts [i] / N
,我使用经典公式 H = p0 * log2(p0)+ ...计算熵。 + pn * log2(pn)
。
Let std::vector<int> counts
be a vector of positive integers and let N:=counts[0]+...+counts[counts.length()-1]
be the the sum of vector components. Setting pi:=counts[i]/N
, I compute the entropy using the classic formula H=p0*log2(p0)+...+pn*log2(pn)
.
个计数
向量正在变化- -计数增加---并且每200次更改都会重新计算熵。经过快速的Google和stackoverflow搜索之后,我找不到用于增量熵计算的任何方法。因此,问题就来了:是否有一种增量方法,例如,用于熵
The counts
vector is changing --- counts are incremented --- and every 200 changes I recompute the entropy. After a quick google and stackoverflow search I couldn't find any method for incremental entropy computation. So the question: Is there an incremental method, like the ones for variance, for entropy computation?
编辑:此问题的动机是在型学习者。
Motivation for this question was usage of such formulas for incremental information gain estimation in VFDT-like learners.
已解决::请参见。
推荐答案
我导出了熵和基尼系数的更新公式和算法,请注意。 (该注释的工作版本可在获得。)另请参见答案。
I derived update formulas and algorithms for entropy and Gini index and made the note available on arXiv. (The working version of the note is available here.) Also see this mathoverflow answer.
为方便起见我包括简单的Python代码,演示了派生公式:
For the sake of convenience I am including simple Python code, demonstrating the derived formulas:
from math import log
from random import randint
# maps x to -x*log2(x) for x>0, and to 0 otherwise
h = lambda p: -p*log(p, 2) if p > 0 else 0
# update entropy if new example x comes in
def update(H, S, x):
new_S = S+x
return 1.0*H*S/new_S+h(1.0*x/new_S)+h(1.0*S/new_S)
# entropy of union of two samples with entropies H1 and H2
def update(H1, S1, H2, S2):
S = S1+S2
return 1.0*H1*S1/S+h(1.0*S1/S)+1.0*H2*S2/S+h(1.0*S2/S)
# compute entropy(L) using only `update' function
def test(L):
S = 0.0 # sum of the sample elements
H = 0.0 # sample entropy
for x in L:
H = update(H, S, x)
S = S+x
return H
# compute entropy using the classic equation
def entropy(L):
n = 1.0*sum(L)
return sum([h(x/n) for x in L])
# entry point
if __name__ == "__main__":
L = [randint(1,100) for k in range(100)]
M = [randint(100,1000) for k in range(100)]
L_ent = entropy(L)
L_sum = sum(L)
M_ent = entropy(M)
M_sum = sum(M)
T = L+M
print "Full = ", entropy(T)
print "Update = ", update(L_ent, L_sum, M_ent, M_sum)
这篇关于增量熵计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!