lucene - Lucene StandardAnalyzer和EnglishAnalyzer有什么区别？

我正在使用Lucene 4.3为英语中的推文编制索引，但是我不确定要使用哪个分析器。 Lucene StandardAnalyzer和EnglishAnalyzer有什么区别？

我也尝试使用以下文本测试StandardAnalyzer：“ XY＆Z [email protected]”。输出是：[xy] [z] [corpion] [xyz] [example.com]，但是我认为输出将是：[XY＆Z] [Corporation] [[email protected]]

难道我做错了什么？

最佳答案

看一下来源。通常，分析器可读性强。您只需要查看CreateComponents方法，以查看它正在使用的Tokenizer和Filters：

@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    final Tokenizer source = new StandardTokenizer(matchVersion, reader);
    TokenStream result = new StandardFilter(matchVersion, source);
    // prior to this we get the classic behavior, standardfilter does it for us.
    if (matchVersion.onOrAfter(Version.LUCENE_31))
      result = new EnglishPossessiveFilter(matchVersion, result);
    result = new LowerCaseFilter(matchVersion, result);
    result = new StopFilter(matchVersion, result, stopwords);
    if(!stemExclusionSet.isEmpty())
      result = new KeywordMarkerFilter(result, stemExclusionSet);
    result = new PorterStemFilter(result);
    return new TokenStreamComponents(source, result);
 }

而StandardAnalyzer只是StandardTokenizer，StandardFilter，LowercaseFilter和StopFilter。 EnglishAnalyzer滚动EnglishPossesiveFilter，KeywordMarkerFilter和PorterStemFilter。

主要是，EnglishAnalyzer推出了一些英语词干增强功能，这些功能对于纯英语文本应该很好。

对于StandardAnalyzer，我知道的唯一将它直接与英语分析直接相关的假设是默认停用词集，当然，默认停用词集只是一个默认值，可以更改。现在，StandardAnalyzer实现了Unicode Standard Annex #29，该尝试尝试提供非特定于语言的文本分段。

关于lucene - Lucene StandardAnalyzer和EnglishAnalyzer有什么区别？，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/17011854/