问题描述
我一直在负责创建一个简单的拼写检查的任务,但已经给旁边没有指导,想知道是否有人可以帮助我。我不是一个人做的,我在转让之后,但该算法的任何指示或帮助将是真棒!如果我问的是不是网站的guildlines内的话,我很抱歉,我会看看其他地方。 :)
I've been tasked with creating a simple spell checker for an assignment but have given next to no guidance so was wondering if anyone could help me out. I'm not after someone to do the assignment for me, but any direction or help with the algorithm would be awesome! If what I'm asking is not within the guildlines of the site then I'm sorry and I'll look elsewhere. :)
该项目加载正确拼写小写单词,然后需要做出基于两个标准拼写建议:
The project loads correctly spelled lower case words and then needs to make spelling suggestions based on two criteria:
-
一个字母的差别(相加或相减得到了这个词一样,在字典中的字)。例如,堆是一个建议,staick'和'酷'将是一个建议,COO。
One letter difference (either added or subtracted to get the word the same as a word in the dictionary). For example 'stack' would be a suggestion for 'staick' and 'cool' would be a suggestion for 'coo'.
一个字母替换。因此,例如,'坏'将是一个建议,董事会。
One letter substitution. So for example, 'bad' would be a suggestion for 'bod'.
那么,只是为了确保我正确地解释。我可能会在字[你好,再见了,太棒了,好,神],然后加载了(拼写错误)单词二年即可回收全部'将是[建议不错,神。
So, just to make sure I've explained properly.. I might load in the words [hello, goodbye, fantastic, good, god] and then the suggestions for the (incorrectly spelled) word 'godd' would be [good, god].
速度是我主要考虑在这里,所以虽然我觉得我知道的方式来得到这个工作,我真的不太清楚如何有效这将是。的方式我想这样做是为了创建一个地图<字符串,矢量<串>>
,然后为每个在加载正确拼写单词的正确添加在作为地图的关键和填充矢量拼写的工作是所有可能的错误这个词的排列。
Speed is my main consideration here so while I think I know a way to get this work, I'm really not too sure about how efficient it'll be. The way I'm thinking of doing it is to create a map<string, vector<string>>
and then for each correctly spelled word that's loaded in, add the correctly spelled work in as a key in the map and the populate the vector to be all the possible 'wrong' permutations of that word.
然后,当我想查一个字,我会期待通过映射中的每个矢量,看看这个词是正确拼写单词之一的置换。如果是,我将添加键作为拼写建议。
Then, when I want to look up a word, I'll look through every vector in the map to see if that word is a permutation of one of the correctly spelled word. If it is, I'll add the key as a spelling suggestion.
这似乎将占用的内存堆不过,事业有一定也会数千排列的每个字?它也好像它会是如果正确拼写单词我最初的意思是大非常非常慢?
This seems like it would take up HEAPS of memory though, cause there would surely be thousands of permutations for each word? It also seems like it'd be very very slow if my initial dictionary of correctly spelled words was large?
我在想,也许我可以减少停机时间,有些只在寻找类似于我在看这个词的钥匙。但话又说回来,如果他们以某种方式类似于那么它很可能意味着,关键将是一个建议,这意味着我不需要那些排列!
I was thinking that maybe I could cut down time a bit by only looking in the keys that are similar to the word I'm looking at. But then again, if they're similar in some way then it probably means that the key will be a suggestion meaning I don't need all those permutations!
所以是的,我有点难倒约我应该在哪个方向,我真的AP preciate任何帮助,因为我真的不知道如何估计的处事方式的不同速度(在课堂上,我们还没有被教导这在所有)。
So yeah, I'm a bit stumped about which direction I should look in. I'd really appreciate any help as I really am not sure how to estimate the speed of the different ways of doing things (we haven't been taught this at all in class).
推荐答案
更简单的方式来解决这个问题确实是precomputed图[坏词] - > [建议]
The simpler way to solve the problem is indeed a precomputed map [bad word] -> [suggestions].
现在的问题是,虽然切除一封信创造一些脏话,为添加或取代你的许多的候选人。
The problem is that while the removal of a letter creates few "bad words", for the addition or substitution you have many candidates.
所以我建议另一种解决方案;)
So I would suggest another solution ;)
注意:你所描述的编辑距离被称为 Levenshtein距离 的
该解决方案在增量步说明,正常的搜索速度应该不断地在每一个理念先行完善,我试图将它们组织与简单的想法(在实施期限)。随意,只要你满意的结果停止。
The solution is described in incremental step, normally the search speed should continuously improve at each idea and I have tried to organize them with the simpler ideas (in term of implementation) first. Feel free to stop whenever you're comfortable with the results.
0。 preliminary 的
- 实施莱文斯坦距离算法
- 存放dictionnary的排序顺序(
的std ::设为
例如,虽然排序的std :: deque的
或的std ::矢量
将是更好的性能明智)
- Implement the Levenshtein Distance algorithm
- Store the dictionnary in a sorted sequence (
std::set
for example, though a sortedstd::deque
orstd::vector
would be better performance wise)
键点:
- 的Levenshtein距离compututation使用数组,在每一步下一行与previous行仅计算
- 在一排的最小距离总是优于(或等于)在最小的previous行
后者属性允许短路的实现:如果你想限制自己2个错误(treshold),则当最小的当前行的优于2,你可以抛弃计算。一个简单的策略是返回treshold + 1的距离。
The latter property allow a short-circuit implementation: if you want to limit yourself to 2 errors (treshold), then whenever the minimum of the current row is superior to 2, you can abandon the computation. A simple strategy is to return the treshold + 1 as the distance.
1。首先暂定的
让我们先简单。
我们将实现一个线性扫描:对于每个单词,我们计算的距离(短路)和我们列出的那些获得的更小的距离迄今词语
We'll implement a linear scan: for each word we compute the distance (short-circuited) and we list those words which achieved the smaller distance so far.
它工作得非常好于短小的字典。
It works very well on smallish dictionaries.
2。改进的数据结构的
在莱文斯坦距离的至少的相等长度的差异。
The levenshtein distance is at least equal to the difference of length.
通过使用一个密钥对(长字)而不是只一句话,你可以限制长度的范围内的搜索[长度 - 编辑,长+编辑]
,大大降低了搜索空间。
By using as a key the couple (length, word) instead of just word, you can restrict your search to the range of length [length - edit, length + edit]
and greatly reduce the search space.
3。 prefixes和修剪的
要改善这一点,我们可以此话时相比,我们所建立的距离矩阵,按行排,同一个世界,完全扫描(我们寻找的话),但其他(所指)是不是:我们只用一个字母为每一行
To improve on this, we can remark than when we build the distance matrix, row by row, one world is entirely scanned (the word we look for) but the other (the referent) is not: we only use one letter for each row.
此非常重要的性质意味着,对于共享相同初始序列2参照物(preFIX),则矩阵的第一行的将是相同的。
This very important property means that for two referents that share the same initial sequence (prefix), then the first rows of the matrix will be identical.
请记住,我问你存储dictionnary排序?这意味着,共享相同的preFIX的话是相邻的。
Remember that I ask you to store the dictionnary sorted ? It means that words that share the same prefix are adjacent.
假设你正在检查你的话对卡通
和车
你意识到它不工作(的距离已经太长),然后通过车
开头的任何单词也不会工作,你可以跳过的话,只要他们开始用车
。
Suppose that you are checking your word against cartoon
and at car
you realize it does not work (the distance is already too long), then any word beginning by car
won't work either, you can skip words as long as they begin by car
.
跳跃本身既可以做线性或使用搜索(找到具有较高的preFIX比汽车的第一个字
):
The skip itself can be done either linearly or with a search (find the first word that has a higher prefix than car
):
- 线性效果最好,如果在preFIX长(几个字跳过)
- 在二进制搜索最适合短preFIX(多的话跳过)
是如何长的长取决于你的字典,你必须来衡量。我将与二进制搜索去开始。
How long is "long" depends on your dictionary and you'll have to measure. I would go with the binary search to begin with.
注:长度划分工作兑preFIX分区,但修剪更多的搜索空间的的
4。 prefixes和再利用的
现在,我们也将尝试重新使用的计算尽可能多的(而不仅仅是它不工作的结果)
Now, we'll also try to re-use the computation as much as possible (and not just the "it does not work" result)
假设你有两个词:
- 卡通
- 洗车
您首先计算矩阵,一行一行,对卡通
。然后读洗车时
您需要确定的共同preFIX长度(这里车
),你可以保持前4矩阵的行(对应于无效, C
, A
,研究
)。
You first compute the matrix, row by row, for cartoon
. Then when reading carwash
you need to determine the length of the common prefix (here car
) and you can keep the first 4 rows of the matrix (corresponding to void, c
, a
, r
).
因此,当开始计算洗车
,你其实开始于迭代是W
。
Therefore, when begin to computing carwash
, you in fact begin iterating at w
.
要做到这一点,只需使用你的搜索年初分配的直阵列,使之足以容纳较大的参考(你应该知道什么是在你的字典中的最大长度)。
To do this, simply use an array allocated straight at the beginning of your search, and make it large enough to accommodate the larger reference (you should know what is the largest length in your dictionary).
5。使用更好的数据结构的
有一个轻松的完成prefixes时,您可以使用特里或Patricia树存储字典。然而,它不是一个STL的数据结构,你需要增加它来存储每个子树的字长度存储,所以你必须让你自己实现的范围内。这并不容易,因为它似乎是因为有内存爆炸问题,它们可杀死地方。
To have an easier time working with prefixes, you could use a Trie or a Patricia Tree to store the dictionary. However it's not a STL data structure and you would need to augment it to store in each subtree the range of words length that are stored so you'll have to make your own implementation. It's not as easy as it seems because there are memory explosion issues which can kill locality.
这是一个不得已的选择。这是昂贵的实施。
This is a last resort option. It's costly to implement.
这篇关于简单的拼写检查算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!