java - 英语词典的熵

我有一个trie数据结构，它存储一系列英语单词。
例如，给定这些单词，字典是这样的：

aa abc aids aimed ami amo b browne brownfield brownie browser brut
butcher casa cash cicca ciccio cicelies cicero cigar ciste conv cony
crumply diarca diarchial dort eserine excursus foul gawkishly he
insomniac mehuman occluding poverty pseud rumina skip slagging
socklessness unensured waspily yes yoga z zoo

蓝色的节点是单词完成的节点。

在我保存的每个节点中：
它所代表的字符
节点所在的级别
指示该节点的“pass”字数的计数器。
节点下一个字符的熵。
我会找到树的每一层的熵和字典的总熵。
这是类TrieNode的一部分，表示单个节点：

class TrieNode {

    public char content;
    public boolean isEnd;
    public double count;
    public LinkedList<TrieNode> childList;
    public String path = "";
    public double entropyNextChar;
    public int level;

    ...
 }

这是一个类Trie的一部分，其中包含一些操作trie数据结构的方法：

 public class Trie {

    ...

    private double totWords = 0;
    public double totChar = 0;

    public double[] levelArray; //in each i position of the array there is the entropy of level i
    public ArrayList<ArrayList<Double>> level; //contains a list of entropies for each level

    private TrieNode root;

    public Trie(String filenameIn, String filenameOut) {
        root = new TrieNode('*'); //blank for root
        getRoot().level = 0;
        totWords = 0;
    }

    public double getTotWords() {
        return totWords;
    }

    /**
     * Function to insert word, setta per ogni nodo il path e il livello.
     */
    public void insert(String word) {
        if(search(word) == true) {
            return;
        }
        int lev = 0;
        totChar += word.length();
        TrieNode current = root;
        current.level = getRoot().level;
        for(char ch : word.toCharArray()) {
            TrieNode child = current.subNode(ch);
            if(child != null) {
                child.level = current.level + 1;
                current = child;
            }
            else {
                current.childList.add(new TrieNode(ch, current.path, current.level + 1));
                current = current.subNode(ch);
            }
            current.count++;
        }
        totWords++;
        getRoot().count = totWords;
        current.isEnd = true;
    }

    /**
     * Function to search for word.
     */
    public boolean search(String word) {
        ...
    }

    /**
     * Set the entropy of each node.
     */
    public void entropyNextChar(TrieNode node) {
        for(TrieNode childToCalculate : node.childList) {
            int numberChildren = node.childList.size();
            int i = 0;
            double entropy = 0.0;
            if(numberChildren > 0) {
                double[] p = new double[numberChildren];
                for(TrieNode child : node.childList) {
                    p[i] = child.count / node.count;
                    i++;
                }
                for(int j = 0; j < p.length; j++) {
                    entropy -= p[j] * log2(p[j]);
                }
                node.entropyNextChar = entropy;
                entropyNextChar(childToCalculate);
            }
        }
    }

    /**
     * Return the number of levels (root has lev = 0).
     */
    public int getLevels(TrieNode node) {
        int lev = 0;
        if(node != null) {
            TrieNode current = node;
            for(TrieNode child : node.childList) {
                lev = Math.max(lev, 1 + getLevels(child));
            }
        }
        return lev;
    }

    public static double log2(double n) {
        return (Math.log(n) / Math.log(2));
    }

    ...
}

现在我可以找到树的每一层的熵。
为此，我创建了以下方法来创建两个数据结构（level和levelArray）。
levelArray是包含结果的双精度数组，即在索引i中存在i级别的熵。
level是arraylist的arraylist。每个列表包含每个节点的加权熵。

public void entropyEachLevel() {
    int num_levels = getLevels(getRoot());
    levelArray = new double[num_levels + 1];
    level = new ArrayList<ArrayList<Double>>(num_levels + 1);
    for(int i = 0; i < num_levels + 1; i++) {
        level.add(new ArrayList<Double>());
        levelArray[i] = 0; //inizializzo l'array
    }
    entropyNextChar(getRoot());
    fillListArray(getRoot(), level);
    for(int i = 1; i < levelArray.length; i++) {
        for(Double el : level.get(i)) {
            levelArray[i] += el;
        }
    }
}

public void fillListArray(TrieNode node, ArrayList<ArrayList<Double>> level) {
    for(TrieNode child : node.childList) {
        double val = child.entropyNextChar * (node.count / totWords);
        level.get(child.level).add(val);
        fillListArray(child, level);
    }
}

如果我在示例词典上运行此方法，我将得到：

[lev 1]  10.355154029112995
[lev 2]  0.6557748405420764
[lev 3]  0.2127659574468085
[lev 4]  0.23925771271992619
[lev 5]  0.17744361708265158
[lev 6]  0.0
[lev 7]  0.0
[lev 8]  0.0
[lev 9]  0.0
[lev 10] 0.0
[lev 11] 0.0
[lev 12] 0.0

问题是我不知道结果是可能的还是完全错误的。
另一个问题：我想计算整个字典的熵。为此，我想添加levelArray中的值。
像这样的程序有问题吗？
如果我这样做，我得到整个字典的熵是11.64。
我需要一些建议。
难怪代码，但我会理解，如果我提出的解决这两个问题的解决方案是正确的。
我提出的例子很简单。事实上，这些方法必须在一本真正的英语词典中找到大约200800个单词。
如果我在这本字典中应用这些方法，我会得到这样的数字（在我看来，这是过分的）。
每个级别的熵：

[lev 1]  65.30073504641602
[lev 2]  44.49825655981045
[lev 3]  37.812193162250765
[lev 4]  18.24599038562219
[lev 5]  7.943507700803994
[lev 6]  4.076715421729149
[lev 7]  1.5934893456776191
[lev 8]  0.7510203704630074
[lev 9]  0.33204345165280974
[lev 10] 0.18290941591943546
[lev 11] 0.10260282173581108
[lev 12] 0.056284946780556455
[lev 13] 0.030038717136269627
[lev 14] 0.014766733727532396
[lev 15] 0.007198162552512713
[lev 16] 0.003420610593927708
[lev 17] 0.0013019239303215001
[lev 18] 5.352246905990619E-4
[lev 19] 2.1483959981088307E-4
[lev 20] 8.270156797847352E-5
[lev 21] 7.327868866691726E-5
[lev 22] 2.848394217759738E-6
[lev 23] 6.6648152186416716E-6
[lev 24] 0.0
[lev 25] 8.545182653279214E-6
[lev 26] 0.0
[lev 27] 0.0
[lev 28] 0.0
[lev 29] 0.0
[lev 30] 0.0
[lev 31] 0.0

我认为他们错了。
而整个字典的熵是无法计算的，我认为总和是一个过程太长了，以至于我看不到结果。
为此，如果我所写的方法是正确的，我会理解的。
提前多谢了
如果字典是：aaaab和abcd，我有：

所以node*的计数器是2，因为该节点传递两个单词。
级别1的节点a也是如此。
我这样改变了方法：

public void entropyNextChar(TrieNode node) {
    for(TrieNode childToCalculate : node.childList) {
        int numberChildren = node.childList.size();
        int i = 0;
        double entropy = 0.0;
        if(numberChildren > 1) {
            double[] p = new double[numberChildren];
            for(TrieNode child : node.childList) {
                p[i] = child.count / node.count;
                i++;
            }
            for(int j = 0; j < p.length; j++) {
                entropy -= p[j] * log2(p[j]);
            }
            node.entropyNextChar = entropy;
            entropyNextChar(childToCalculate);
        }
        else {
            node.entropyNextChar = entropy;
            entropyNextChar(childToCalculate);
        }
    }
}

所以我添加了entropyNextChar。

最佳答案

我不太清楚你想得到什么样的熵（单词？，字符？，trie节点？），所以我假设你的意思是词的熵，你提到的字典可以包含重复的词。
如果你想计算trie中单词的熵和它的水平，你的概率函数在我看来是不对的。请考虑entryNextChar函数：

public void entropyNextChar(TrieNode node) {
    for(TrieNode childToCalculate : node.childList) {
        int numberChildren = node.childList.size();
        int i = 0;
        double entropy = 0.0;
        if(numberChildren > 0) {
            double[] p = new double[numberChildren];
            for(TrieNode child : node.childList) {
                p[i] = child.count / node.count;
                i++;
            }
            for(int j = 0; j < p.length; j++) {
                entropy -= p[j] * log2(p[j]);
            }
            node.entropyNextChar = entropy;
            entropyNextChar(childToCalculate);
        }
    }
}

特别是我对下面这句话感兴趣

p[i] = child.count / node.count;

看起来您可以计算trie节点命中的概率（即，插入从最顶层字符开始的单词时命中节点的可能性）。如果这是你真正想要的，那么我会说你做得对，尽管这对我来说没有多大意义。
否则，如果你想计算每个层次的词熵，那么你的概率函数是错误的。该级别中每个单词的概率应该类似于count（word_i）/total_i，其中count（word_i）是每个单词在该级别中出现的次数，total_i是该级别包含的单词总数。你有totwords计数器，但你错过了单个单词的计数器。我建议你只在设置了isend为true时才增加count字段。在本例中，它将指示每个单词在级别内出现的次数。你的概率函数是这样的：

p[i] = child.count / totWords; // make sure totWords is not 0

另一个问题：我想计算整个字典的熵。为此，我想添加levelarray中的值。像这样的程序有问题吗？如果我这样做，我得到整个字典的熵是11.64。
我认为这样做是不对的，因为你的概率分布改变了。这次需要计算整个trie中的单词总数，并将每个单词的频率计数器除以这个数字，以获得正确的概率。