如何在Lucene中搜索特殊字符

如何在Lucene中搜索特殊字符

本文介绍了如何在Lucene中搜索特殊字符(+!\?:)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要在索引中搜索特殊字符.

I want to search special characters in index.

我转义了查询字符串中的所有特殊字符,但是当我对索引中的lucene以+进行查询时,它会以+()创建查询.

I escaped all the special characters in query string but when i perform query as + on lucene in index it create query as +().

因此,它不会搜索任何字段.

Hence it search on no fields.

如何解决这个问题?我的索引包含这些特殊字符.

How to solve this problem? My index contains these special characters.

推荐答案

如果使用的是StandardAnalyzer,则会丢弃非字母字符.尝试用WhitespaceAnalyzer索引相同的值,看看是否保留了所需的字符.它可能还会保留您不想要的东西:那是您可能考虑编写自己的分析器的时候,这基本上意味着创建一个TokenStream堆栈,该堆栈完全执行所需的处理.

If you are using the StandardAnalyzer, that will discard non-alphanum characters. Try indexing the same value with a WhitespaceAnalyzer and see if that preserves the characters you need. It might also keep stuff you don't want: that's when you might consider writing your own Analyzer, which basically means creating a TokenStream stack that does exactly the kind of processing you need.

例如,SimpleAnalyzer实现以下管道:

@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
   return new LowerCaseTokenizer(reader);
}

只是小写令牌.

StandardAnalyzer的作用更多:

/** Constructs a {@link StandardTokenizer} filtered by a {@link
StandardFilter}, a {@link LowerCaseFilter} and a {@link StopFilter}. */
@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
    StandardTokenizer tokenStream = new StandardTokenizer(matchVersion, reader);
    tokenStream.setMaxTokenLength(maxTokenLength);
    TokenStream result = new StandardFilter(tokenStream);
    result = new LowerCaseFilter(result);
    result = new StopFilter(enableStopPositionIncrements, result, stopSet);
    return result;
 }

您可以混合&与org.apache.lucene.analysis中的这些组件和其他组件匹配,或者您可以编写自己的专用TokenStream实例,这些实例由您的自定义Analyzer包装到处理管道中.

You can mix & match from these and other components in org.apache.lucene.analysis, or you can write your own specialized TokenStream instances that are wrapped into a processing pipeline by your custom Analyzer.

要查看的另一件事是您使用的是哪种CharTokenizer. CharTokenizer是一个抽象类,指定用于标记文本字符串的机制.一些更简单的分析器(而不是StandardAnalyzer)使用它. Lucene带有两个子类:LetterTokenizerWhitespaceTokenizer.您可以创建自己的字符,以保留所需的字符,并通过实现boolean isTokenChar(char c)方法中断不需要的字符.

One other thing to look at is what sort of CharTokenizer you're using. CharTokenizer is an abstract class that specifies the machinery for tokenizing text strings. It's used by some simpler Analyzers (but not by the StandardAnalyzer). Lucene comes with two subclasses: a LetterTokenizer and a WhitespaceTokenizer. You can create your own that keeps the characters you need and breaks on those you don't by implementing the boolean isTokenChar(char c) method.

这篇关于如何在Lucene中搜索特殊字符(+!\?:)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-02 02:12