本文介绍了你如何实现“你的意思"?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可能的重复:
谷歌如何你的意思是?"算法有效吗?

假设您的网站中已经有一个搜索系统.您如何像 Google 在某些 搜索查询?

Suppose you have a search system already in your website. How can you implement the "Did you mean:<spell_checked_word>" like Google does in some search queries?

推荐答案

实际上,Google 所做的事情非常重要,而且起初也违反直觉.他们不会像检查字典那样做任何事情,而是利用统计数据来识别返回比您的查询更多结果的类似"查询,确切的算法当然是未知的.

Actually what Google does is very much non-trivial and also at first counter-intuitive. They don't do anything like check against a dictionary, but rather they make use of statistics to identify "similar" queries that returned more results than your query, the exact algorithm is of course not known.

这里有不同的子问题需要解决,作为所有与自然语言处理相关的统计数据的基本基础,必须有一本书:统计自然语言处理基础.

There are different sub-problems to solve here, as a fundamental basis for all Natural Language Processing statistics related there is one must have book: Foundation of Statistical Natural Language Processing.

具体解决单词/查询相似性的问题,我使用 Edit Distance,一种字符串相似性的数学度量,效果出奇地好.我曾经使用过 Levenshtein,但其他的可能值得研究.

Concretely to solve the problem of word/query similarity I have had good results with using Edit Distance, a mathematical measure of string similarity that works surprisingly well. I used to use Levenshtein but the others may be worth looking into.

Soundex - 根据我的经验 - 很垃圾.

Soundex - in my experience - is crap.

实际上,有效地存储和搜索拼写错误的大型词典并进行亚秒级检索同样重要,最好的办法是利用现有的全文索引和检索引擎(即不是您的数据库的引擎),其中Lucene 是目前最好的之一,并且巧合地移植到了许多平台.

Actually efficiently storing and searching a large dictionary of misspelled words and having sub second retrieval is again non-trivial, your best bet is to make use of existing full text indexing and retrieval engines (i.e. not your database's one), of which Lucene is currently one of the best and coincidentally ported to many many platforms.

这篇关于你如何实现“你的意思"?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-25 05:49