问题描述
这是一道面试题:设计一个自动完成的分布式后端.
This is an interview question: design a distributed back-end for auto-complete.
我会回答如下:
自动完成是通过给定的后缀在字典中进行搜索.字典可能应该组织为 trie.字典是根据最频繁的查询构建的,但这是另一回事.
Auto-complete is a search in a dictionary by a given suffix. The dictionary should be probably organized as a trie. The dictionary is built from the most frequent queries but it's another story.
现在我假设字典不会经常更改(例如每天一次而不是每毫秒一次).因此,我们可以在多个处理自动完成查询的服务器上复制字典(例如,使用负载平衡器和循环策略).
Now I assume the dictionary is not changed frequently (e.g. once a day rather than every millisecond). Thus we can just replicate the dictionary across a number of servers that handle auto-complete queries (e.g. with a load balancer and round-robin policy).
我们也应该考虑字典,但这也是另一回事.
We should also think about dictionary but this is also another story.
有意义吗?我错过了什么吗?
Does it make sense? Am I missing something?
推荐答案
看看什么SOLR4.0(solr 有 trie 并且是分布式的).这在很大程度上取决于他们期望自动完成功能的工作方式.如果它只是一个通配符过滤器,那么像trie这样的东西对于简单来说就可以了ASCII ...否则,如果他们想要自动更正,它会变得更加复杂.话虽如此,我怀疑如果 Trie 是一个通用字段(即不是 SKU 或专用 ID),它会给您带来好的结果,否则您将拥有一个非常大且效率低下的 trie.
Take a look at what SOLR 4.0 (solr has trie's and is distributed). Its highly dependent on how they expect the autocomplete to work. If its just a wild card filter than something like a trie will be fine for simple ASCII... otherwise its gets more complicated if they want auto-correction. That being said I doubt a trie will get you good results if its a generic field (ie not a SKU or specialized ID) otherwise you will have a monstrously large and inefficient trie.
看看:
- 具体看它的Suggester:http://wiki.apache.org/solr/Suggester
- 和 Solr 的分析器:http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
这篇关于自动完成的后端的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!