问题描述
我有一个树的集合,它们的节点被标记(但不是唯一的).具体来说,这些树来自一组已解析的句子(参见 http://en.wikipedia.org/wiki/树库).我希望从集合中提取最常见的子树 - 性能(还)不是问题.我会很感激算法(最好是 Java)或指向为树库执行此操作的工具的指针.请注意,子节点的顺序很重要.
I have a collection of trees whose nodes are labelled (but not uniquely). Specifically the trees are from a collection of parsed sentences (see http://en.wikipedia.org/wiki/Treebank). I wish to extract the most common subtrees from the collection - performance is not (yet) an issue. I'd be grateful for algorithms (ideally Java) or pointers to tools which do this for treebanks. Note that order of child nodes is important.
编辑 @mjv.我们在一个有限的领域(化学)工作,它有一种风格化的语言,所以树的种类并不多 - 可能类似于儿童的读者.猫坐在垫子上"的简单树.
EDIT @mjv. We are working in a limited domain (chemistry) which has a stylised language so the varirty of the trees is not huge - probably similar to children's readers. Simple tree for "the cat sat on the mat".
<sentence>
<nounPhrase>
<article/>
<noun/>
</nounPhrase>
<verbPhrase>
<verb/>
<prepositionPhrase>
<preposition/>
<nounPhrase>
<article/>
<noun/>
</nounPhrase>
</prepositionPhrase>
</verbPhrase>
</sentence>
这里的句子包含两个相同的词性子树(实际标记cat".mat"在匹配中并不重要).所以算法需要检测到这一点.请注意,并非所有名词短语都是相同的——大黑猫"可能是:
Here the sentence contains two identical part-of-speech subtrees (the actual tokens "cat". "mat" are not important in matching). So the algorithm would need to detect this. Note that not all nounPhrases are identical - "the big black cat" could be:
<nounPhrase>
<article/>
<adjective/>
<adjective/>
<noun/>
</nounPhrase>
句子的长度会更长——在 15 到 30 个节点之间.我希望从 1000 棵树中得到有用的结果.如果这不超过一天左右,那是可以接受的.
The length of sentences will be longer - between 15 to 30 nodes. I would expect to get useful results from 1000 trees. If this does not take more than a day or so that's acceptable.
显然树越短越频繁,所以 nounPhrase 会很常见.
Obviously the shorter the tree the more frequent, so nounPhrase will be very common.
EDIT 如果要通过展平树来解决这个问题,那么我认为它与最长公共子串有关,而不是与最长公共序列有关.但请注意,我不一定只想要最长的 - 我想要一个足够长的列表以有趣"(标准尚未确定).
EDIT If this is to be solved by flattening the tree then I think it would be related to Longest Common Substring, not Longest Common Sequence. But note that I don't necessarily just want the longest - I want a list of all those long enough to be "interesting" (criterion yet to be decided).
推荐答案
找到集合中最频繁的子树,创建子树的紧凑形式,然后迭代每个子树并使用哈希集来计算它们的出现次数.30 个节点对于一个完美的哈希来说太大了 - 每个节点只有大约一位,你需要这么多来表明它是兄弟还是孩子.
Finding the most frequent subtrees in the collection, create a compact form of the subtree, then iterate every subtree and use a hashset to count their occurrences. 30 nodes is too big for a perfect hash - it's only about one bit per node, and you need that much to indicate whether it's a sibling or a child.
那个问题不是 LCS - 最常见的序列与最长的公共子序列无关.最频繁的子树是出现最多的子树.
That problem isn't LCS - the most common sequence isn't related to the longest common subsequence. The most frequent subtree is that which occurs the most.
对于长度为 L 的 N 棵树,在最坏的情况下应该是 O(N L^2)(假设测试包含 L 个节点的子树的相等性是 O(L)).
It should be at worst case O(N L^2) for N trees of length L (assuming testing equality of a subtree containing L nodes is O(L)).
这篇关于在(解析)树的集合中查找最频繁的子树的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!