我需要创建一个dawg(http://en.wikipedia.org/wiki/directed_acyclic_word_graph)结构给我的拼字游戏在一个文件的单词列表。我在用Java。我只需要做一次,然后把它存储在一个或多个文件中。到目前为止,我已经看到了两种方法:1)构建Trie并将其减少为DAWG;或者2)立即构建DAWG因为我只需要做一次,我想我只需要最简单的算法来实现它。速度和内存需求无关紧要。
另外,我想知道如何在运行时将结构存储在内存中,以及如何将其保存在文件中?DAWG基本上是一个图,它建议使用我编写的一些非常简单的类的一些节点和边/指针,但是我看到了使用数组和偏移量(在这个数组中)的实现,这看起来很复杂和难以理解这次我关心内存大小(运行时和保存的文件)和加载dawg/使用dawg的速度。
最佳答案
最简单和最有效的DAWG构造算法在this paper中定义,并要求对DAWG要表示的单词集进行排序鉴于您计划从一个预先存在的单词列表构建一个DAWG,该列表可能已经被排序,或者可以用于此目的。
我粗略地将算法的伪代码转录成了一种比论文中给出的更“程序员友好”的格式(免责声明:我可能犯了一些转录错误;您可能应该看看原始代码,以确定是否有):
Given:
startState is the state from which traversal of the DAWG is to start
register is a map of representations (hint: hashes) OF graphs which extend
from states in the DAWG TO said states
While there is newWord in wordList
Get newWord from wordList
Determine longestPrefix of newWord, starting from startState, which already exists in DAWG
Get longestPrefixEndState, the state which the sequence of transitions defined by longestPrefix leads to
Get suffix of newWord, the substring of newWord after longestPrefix
if longestPrefixEndState has children
replace_or_register(longestPrefixEndState)
endIf
Create the sequence of transitions and states defined by suffix, starting from longestPrefixEndState
endWhile
replace_or_register(startState)
function replace_or_register(argState)
Get greatestCharEndState of argState, the state which the lexicographically-greatest-char-labelled-transition in the outgoing transition set of argState leads to
if greatestCharEndState has children
replace_or_register(greatestCharEndState)
endIf
if there exists state in DAWG that is in the register and is equivalent (has an identical graph extending from it) to greatestCharEndState
Redefine the transition that extends from argState to greatestCharEndState, as one that extends from argState to state
Delete greatestCharEndState
endIf
else
add greatestCharEndState to the register
endElse
考虑到您正在使用java,您可以利用Serializable接口来处理所有序列化和反序列化需求。
如果您感兴趣的是在爪哇中实现上述算法的现有DAWG实现,请检查“AA>”,它还提供了一些其他实现不包括的(包括添加和删除字符串)的漂亮特性,并由我来维护!