本文介绍了使用Lucene/Java标记名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有我公司所有员工的姓名(5000多名).我想编写一个引擎,该引擎可以在网上文章(博客/Wiki/帮助文档)中快速查找名称,并使用用户电子邮件的"mailto"标签对其进行标记.

I have names of all the employees of my company (5000+). I want to write an engine which can on the fly find names in online articles(blogs/wikis/help documents) and tag them with "mailto" tag with the users email.

到目前为止,我正计划从文章中删除所有停用词,然后在lucene索引中搜索每个词.但是即使在那种情况下,我仍然看到很多查询都在索引上,例如,如果有一篇文章包含2000个单词,并且只有两个人名引用,那么很可能会有1000个lucene查询.

As of now I am planning to remove all the stop words from the article and then search for each word in a lucene index. But even in that case I see a lot of queries hitting the indexes, for example if there is an article with 2000 words and only two references to people names then most probably there will be 1000 lucene queries.

是否有减少这些查询的方法?还是完全相同的其他方式?预先感谢

Is there a way to reduce these queries? Or a completely other way of achieving the same?Thanks in advance

推荐答案

http: //en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
该算法可能对您有用.这种方法的工作方式是,您首先将整个名称列表编译为一个巨大的有限状态机(这可能需要一段时间),但是一旦构建了该状态机,就可以根据需要运行任意数量的文档,并且非常有效地检测名称.
我认为它只会查看每个文档中的每个字符一次,因此它比标记文档和将每个单词与已知名称列表进行比较要有效得多.
网络上有许多可用于不同语言的实现.签出来.

http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
This algorithm might be of use to you. The way this would work is you first compile the entire list of names into a giant finite state machine (which would probably take a while), but then once this state machine is built, you can run it through as many documents as you want and detect names pretty efficiently.
I think it would look at every character in each document only once, so it should be much more efficient than tokenizing the document and comparing each word to a list of known names.
There are a bunch of implementations available for different languages on the web. Check it out.

这篇关于使用Lucene/Java标记名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-04 19:49