我正在尝试编写用于字段类型的简单Solr lemmatizer,但是我似乎找不到有关编写TokenFilter的任何信息,所以我有点迷失了。这是我到目前为止的代码。

import java.io.IOException;
import java.util.List;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

class FooFilter extends TokenFilter {

    private static final Logger log = LoggerFactory.getLogger(FooFilter.class);
    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    private final PositionIncrementAttribute posAtt = addAttribute(PositionIncrementAttribute.class);

    public FooFilter(TokenStream input) {
        super(input);
    }

    @Override
    public boolean incrementToken() throws IOException {
        if (!input.incrementToken()) {
            return false;
        }

        char termBuffer[] = termAtt.buffer();
        List<String> allForms = Lemmatize.getAllForms(new String(termBuffer));
        if (allForms.size() > 0) {
            for (String word : allForms) {
                // Now what?
            }
        }

        return true;
    }
}

最佳答案

接下来,您想用单词replaceappend当前 token termAtt

样本替换语义

termAtt.setEmpty();
termAtt.copyBuffer(word.toCharArray(), 0, word.length());

添加新 token 的示例语义

对于每个要添加的 token ,必须设置CharTermAttribute属性,并且incrementToken例程返回true。
private List<String> extraTokens = ...
public boolean incrementToken() {
  if (input.incrementToken()){
    // ...
    return true;
  } else if (!extraTokens.isEmtpy()) {
    // set the added token and return true
    termAtt.setTerm(extraTokens.remove(0));
    return true;
  }
  return false;
}

关于java - 自定义Solr TokenFilter lemmatizer,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/14936270/

10-10 21:08