问题描述
使用Lucene Standard Analyzer为我的文档建立索引时,我遇到了麻烦.
While indexing my document using lucene Standard Analyzer I got a plroblem.
例如:我的文档中有一个单词"plag-iarism"……在此分析器中,它将其索引为"plag"和"iarism".但是我想像"pla窃".我要做什么才能得到一个完整的单词?
For example:my document had a word "plag-iarism" ... here this analyzer indexed it as "plag" and "iarism". But I want like "plagiarism". What I have to do to get a whole word?
推荐答案
StandardAnalyzer将tokanization委托给StandardTokenizer.您可以创建自己的tokanizer来满足您的确切需求(可以基于StandardTokenizer).
StandardAnalyzer delegates tokanization to StandardTokenizer.You create your own tokanizer to match your exact needs (you could base it on StandardTokenizer).
或者,如果您愿意,可以使用相关的正则表达式对String.replace()进行肮脏的破解,仅运行分析器即可.是的.丑.
Alternatively, if you prefer, you could do a dirty hack of a String.replace(), with the relevant regular expression, just the analyzer runs. Yeah. Ugly.
这篇关于LUCENE标准分析仪连字符注意事项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!