本文介绍了使用Lucene 7 OpenNLP查询词性标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了娱乐和学习,我尝试使用OpenNLP和Lucene 7.4构建词性(POS)标记器.目标是一旦索引,我实际上可以搜索一系列 POS 标签并找到与序列匹配的所有句子.我已经获得了索引部分,但仍停留在查询部分.我知道SolR可能对此具有某些功能,并且我已经检查了代码(毕竟不是那么自我解释).但是我的目标是在Lucene 7中而不是在SolR中理解和实现,因为我想独立于顶部的任何搜索引擎.

For fun and learning I am trying to build a part-of-speech (POS) tagger with OpenNLP and Lucene 7.4. The goal would be that once indexed I can actually search for a sequence of POS tags and find all sentences that match sequence. I already get the indexing part, but I am stuck on the query part. I am aware that SolR might have some functionality for this, and I already checked the code (which was not so self-expalantory after all). But my goal is to understand and implement in Lucene 7, not in SolR, as I want to be independent of any search engine on top.

想法输入句子1:敏捷的棕色狐狸跳过了懒狗.应用的Lucene OpenNLP标记程序可导致:[[] [快速] [棕色] [fox] [跳跃] [over] [the] [lazy] [dogs] [.]接下来,将Lucene OpenNLP POS标记结果应用于:[DT] [JJ] [JJ] [NN] [VBD] [IN] [DT] [JJ] [NNS] [.]

IdeaInput sentence 1: The quick brown fox jumped over the lazy dogs.Applied Lucene OpenNLP tokenizer results in: [The][quick][brown][fox][jumped][over][the][lazy][dogs][.]Next, applying Lucene OpenNLP POS tagging results in: [DT][JJ][JJ][NN][VBD][IN][DT][JJ][NNS][.]

输入句子2:请给我,宝贝!应用Lucene OpenNLP标记程序会导致:[Give] [it] [to] [me] [,] [baby] [!]接下来,将Lucene OpenNLP POS标记应用到:[VB] [PRP] [TO] [PRP] [,] [UH] [.]

Input sentence 2: Give it to me, baby!Applied Lucene OpenNLP tokenizer results in: [Give][it][to][me][,][baby][!]Next, applying Lucene OpenNLP POS tagging results in: [VB][PRP][TO][PRP][,][UH][.]

查询: JJ NN VBD 与句子1的一部分匹配,因此应返回句子1.(目前,我只对完全匹配感兴趣,也就是说,让我们忽略部分匹配,通配符等.)

Query: JJ NN VBD matches part of sentence 1, so sentence 1 should be returned. (At this point I am only interested in exact matches, i.e. let's leave aside partial matches, wildcards etc.)

索引首先,我创建了自己的类com.example.OpenNLPAnalyzer:

IndexingFirst, I created my own class com.example.OpenNLPAnalyzer:

public class OpenNLPAnalyzer extends Analyzer {
  protected TokenStreamComponents createComponents(String fieldName) {
    try {

        ResourceLoader resourceLoader = new ClasspathResourceLoader(ClassLoader.getSystemClassLoader());


        TokenizerModel tokenizerModel = OpenNLPOpsFactory.getTokenizerModel("en-token.bin", resourceLoader);
        NLPTokenizerOp tokenizerOp = new NLPTokenizerOp(tokenizerModel);


        SentenceModel sentenceModel = OpenNLPOpsFactory.getSentenceModel("en-sent.bin", resourceLoader);
        NLPSentenceDetectorOp sentenceDetectorOp = new NLPSentenceDetectorOp(sentenceModel);

        Tokenizer source = new OpenNLPTokenizer(
                AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY, sentenceDetectorOp, tokenizerOp);

        POSModel posModel = OpenNLPOpsFactory.getPOSTaggerModel("en-pos-maxent.bin", resourceLoader);
        NLPPOSTaggerOp posTaggerOp = new NLPPOSTaggerOp(posModel);

        // Perhaps we should also use a lower-case filter here?

        TokenFilter posFilter = new OpenNLPPOSFilter(source, posTaggerOp);

        // Very important: Tokens are not indexed, we need a store them as payloads otherwise we cannot search on them
        TypeAsPayloadTokenFilter payloadFilter = new TypeAsPayloadTokenFilter(posFilter);

        return new TokenStreamComponents(source, payloadFilter);
    }
    catch (IOException e) {
        throw new RuntimeException(e.getMessage());
    }              

}

请注意,我们使用的是包裹在OpenNLPPOSFilter周围的TypeAsPayloadTokenFilter.这意味着,我们的POS标签将被索引为有效载荷,而我们的查询(无论如何)将也必须搜索有效载荷.

Note that we are using a TypeAsPayloadTokenFilter wrapped around OpenNLPPOSFilter. This means, our POS tags will be indexed as payloads, and our query - however it'll look like - will have to search on payloads as well.

查询这就是我被困住的地方.我不知道如何查询有效负载,无论我尝试什么都行不通.请注意,我使用的是Lucene 7,似乎在旧版本中,对有效负载的查询已更改了数次.文档非常稀缺.现在甚至不清楚要查询什么正确的字段名称-是单词"还是类型"还是其他?例如,我尝试了以下代码,该代码不返回任何搜索结果:

QueryingThis is where I am stuck. I have no clue how to query on payloads, and whatever I try does not work. Note that I am using Lucene 7, it seems that in older versions querying on payload has changed several times. Documentation is extremely scarce. It's not even clear what the proper field name is now to query - is it "word" or "type" or anything else? For example, I tried this code which does not return any search results:

    // Step 1: Indexing
    final String body = "The quick brown fox jumped over the lazy dogs.";
    Directory index = new RAMDirectory();
    OpenNLPAnalyzer analyzer = new OpenNLPAnalyzer();
    IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
    IndexWriter writer = new IndexWriter(index, indexWriterConfig);
    Document document = new Document();
    document.add(new TextField("body", body, Field.Store.YES));
    writer.addDocument(document);
    writer.close();


    // Step 2: Querying
    final int topN = 10;
    DirectoryReader reader = DirectoryReader.open(index);
    IndexSearcher searcher = new IndexSearcher(reader);

    final String fieldName = "body"; // What is the correct field name here? "body", or "type", or "word" or anything else?
    final String queryText = "JJ";
    Term term = new Term(fieldName, queryText);
    SpanQuery match = new SpanTermQuery(term);
    BytesRef pay = new BytesRef("type"); // Don't understand what to put here as an argument
    SpanPayloadCheckQuery query = new SpanPayloadCheckQuery(match, Collections.singletonList(pay));

    System.out.println(query.toString());

    TopDocs topDocs = searcher.search(query, topN);

非常感谢您的帮助.

推荐答案

为什么不使用 TypeAsSynonymFilter 而不是 TypeAsPayloadTokenFilter 而只是进行普通查询.因此,在您的分析器中:

Why don't you use TypeAsSynonymFilter instead of TypeAsPayloadTokenFilter and just make a normal query. So in your Analyzer:

:
TokenFilter posFilter = new OpenNLPPOSFilter(source, posTaggerOp);
TypeAsSynonymFilter typeAsSynonymFilter = new TypeAsSynonymFilter(posFilter);
return new TokenStreamComponents(source, typeAsSynonymFilter);

索引端:

static Directory index() throws Exception {
  Directory index = new RAMDirectory();
  OpenNLPAnalyzer analyzer = new OpenNLPAnalyzer();
  IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
  IndexWriter writer = new IndexWriter(index, indexWriterConfig);
  writer.addDocument(doc("The quick brown fox jumped over the lazy dogs."));
  writer.addDocument(doc("Give it to me, baby!"));
  writer.close();

  return index;
}

static Document doc(String body){
  Document document = new Document();
  document.add(new TextField(FIELD, body, Field.Store.YES));
  return document;
}

搜索端:

static void search(Directory index, String searchPhrase) throws Exception {
  final int topN = 10;
  DirectoryReader reader = DirectoryReader.open(index);
  IndexSearcher searcher = new IndexSearcher(reader);

  QueryParser parser = new QueryParser(FIELD, new WhitespaceAnalyzer());
  Query query = parser.parse(searchPhrase);
  System.out.println(query);

  TopDocs topDocs = searcher.search(query, topN);
  System.out.printf("%s => %d hits\n", searchPhrase, topDocs.totalHits);
  for(ScoreDoc scoreDoc: topDocs.scoreDocs){
    Document doc = searcher.doc(scoreDoc.doc);
    System.out.printf("\t%s\n", doc.get(FIELD));
  }
}

然后像这样使用它们:

public static void main(String[] args) throws Exception {
  Directory index = index();
  search(index, "\"JJ NN VBD\"");    // search the sequence of POS tags
  search(index, "\"brown fox\"");    // search a phrase
  search(index, "\"fox brown\"");    // search a phrase (no hits)
  search(index, "baby");             // search a word
  search(index, "\"TO PRP\"");       // search the sequence of POS tags
}

结果如下:

body:"JJ NN VBD"
"JJ NN VBD" => 1 hits
    The quick brown fox jumped over the lazy dogs.
body:"brown fox"
"brown fox" => 1 hits
    The quick brown fox jumped over the lazy dogs.
body:"fox brown"
"fox brown" => 0 hits
body:baby
baby => 1 hits
    Give it to me, baby!
body:"TO PRP"
"TO PRP" => 1 hits
    Give it to me, baby!

这篇关于使用Lucene 7 OpenNLP查询词性标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-17 05:43