java - 用于测试Solr token 过滤器的Java代码？

我试图编写Java代码以查看Solr令牌过滤器的工作方式。

  public class TestFilter {

  public static void main(String[] args) throws IOException {
    StringReader inputText = new StringReader("This is a TEST string");
    Map<String, String> param = new HashMap<>();
    param.put("luceneMatchVersion", "LUCENE_44");

    TokenizerFactory stdTokenFact = new StandardTokenizerFactory(param);
    Tokenizer tokenizer = stdTokenFact.create(inputText);

    param.put("luceneMatchVersion", "LUCENE_44");
    LowerCaseFilterFactory lowerCaseFactory = new LowerCaseFilterFactory(param);
    TokenStream tokenStream = lowerCaseFactory.create(tokenizer);

    CharTermAttribute termAttrib = (CharTermAttribute) tokenStream.getAttribute(CharTermAttribute.class);
    System.out.println("CharTermAttribute Length = " + termAttrib.length());
    while (tokenStream.incrementToken()) {
      String term = termAttrib.toString();
      System.out.println(term);
    }
  }
}

我得到了此输出和错误消息。

CharTermAttribute Length = 0
Exception in thread "main" java.lang.NullPointerException
    at org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:923)
    at org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:1133)
    at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:171)
    at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
    at com.utsav.solr.TestFilter.main(TestFilter.java:31)

为什么termAttrib.length()给出零？

我想念什么？

最佳答案

跟随the JavaDoc of TokenStream

  新的TokenStream API的工作流程如下：


  实例化TokenStream / TokenFilters，用于向AttributeSource添加属性或从AttributeSource获取属性。
  使用者调用TokenStream.reset（）。
  使用者从流中检索属性，并存储对其要访问的所有属性的本地引用。
  使用者调用增量令牌（），直到在每次调用后返回false并消耗属性为止。
  使用者调用end（），以便可以执行任何流结束操作。
  使用完TokenStream后，使用者将调用close（）释放任何资源。


您需要按如下方式重写您的方法

public static void main(String[] args) throws IOException {
    StringReader inputText = new StringReader("This is a TEST string");
    Map<String, String> param = new HashMap<>();
    param.put("luceneMatchVersion", "LUCENE_44");

    TokenizerFactory stdTokenFact = new StandardTokenizerFactory(param);
    Tokenizer tokenizer = stdTokenFact.create(inputText);

    param.put("luceneMatchVersion", "LUCENE_44");
    LowerCaseFilterFactory lowerCaseFactory = new LowerCaseFilterFactory(param);
    TokenStream tokenStream = lowerCaseFactory.create(tokenizer);

    CharTermAttribute termAttrib = (CharTermAttribute) tokenStream.getAttribute(CharTermAttribute.class);

    tokenStream.reset();

    while (tokenStream.incrementToken()) {
        System.out.println("CharTermAttribute Length = " + termAttrib.length());

        System.out.println(termAttrib.toString());
    }

    tokenStream.end();
    tokenStream.close();
}

这产生了以下输出

CharTermAttribute Length = 4
this
CharTermAttribute Length = 2
is
CharTermAttribute Length = 1
a
CharTermAttribute Length = 4
test
CharTermAttribute Length = 6
string

编辑如注释中所述，不需要像JavaDoc中指出的那样依次调用tokenStream.getAttribute

请注意，每个AttributeImpl仅创建一个实例，并为每个令牌重用。这种方法减少了对象的创建，并允许对AttributeImpls的引用进行本地缓存。

关于java - 用于测试Solr token 过滤器的Java代码？，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/25381564/