本文介绍了为什么 Lucene 算法不适用于 Java 中的 Exact String?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究 Java 中的 Lucene 算法.我们在 MySQL 数据库中有 10 万个停止名称.停止名称就像

I am working on Lucene Algorithm in Java.We have 100K stop names in MySQL Database.The stop names are like

NEW YORK PENN STATION, 
NEWARK PENN STATION,
NEWARK BROAD ST,
NEW PROVIDENCE
etc

当用户提供像 NEW YORK 这样的搜索输入时,我们会在结果中得到 NEW YORK PENN STATION 停止,但是当用户提供准确的 NEW YORK PENNSTATION,然后返回结果.

When user gives a search input like NEW YORK, we get the NEW YORK PENN STATION stop in a result, but when user gives exact NEW YORK PENN STATION in a search input then it returns zero results.

我的代码是 -

public ArrayList<String> getSimilarString(ArrayList<String> source, String querystr)
  {
      ArrayList<String> arResult = new ArrayList<String>();

        try
        {
            // 0. Specify the analyzer for tokenizing text.
            //    The same analyzer should be used for indexing and searching
            StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);

            // 1. create the index
            Directory index = new RAMDirectory();

            IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);

            IndexWriter w = new IndexWriter(index, config);

            for(int i = 0; i < source.size(); i++)
            {
                addDoc(w, source.get(i), "1933988" + (i + 1) + "z");
            }

            w.close();

            // 2. query
            // the "title" arg specifies the default field to use
            // when no field is explicitly specified in the query.
            Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr + "*");

            // 3. search
            int hitsPerPage = 20;
            IndexReader reader = DirectoryReader.open(index);
            IndexSearcher searcher = new IndexSearcher(reader);
            TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
            searcher.search(q, collector);
            ScoreDoc[] hits = collector.topDocs().scoreDocs;

            // 4. Get results
            for(int i = 0; i < hits.length; ++i) 
            {
                  int docId = hits[i].doc;
                  Document d = searcher.doc(docId);
                  arResult.add(d.get("title"));
            }

            // reader can only be closed when there
            // is no need to access the documents any more.
            reader.close();

        }
        catch(Exception e)
        {
            System.out.println("Exception (LuceneAlgo.getSimilarString()) : " + e);
        }

        return arResult;

  }

  private static void addDoc(IndexWriter w, String title, String isbn) throws IOException 
  {
        Document doc = new Document();
        doc.add(new TextField("title", title, Field.Store.YES));

        // use a string field for isbn because we don't want it tokenized
        doc.add(new StringField("isbn", isbn, Field.Store.YES));
        w.addDocument(doc);
  }

在这段代码中,source 是停止名称列表,query 是用户给定的搜索输入.

In this code source is list of Stop Names and query is user given search input.

Lucene 算法是否适用于大字符串?

为什么 Lucene 算法不适用于 Exact String?

推荐答案

代替

1) Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr + "*");

例如:new york station"将被解析为title:new title:york title:station".此查询将返回包含上述any 的所有文档.

Ex: "new york station" will be parsed to "title:new title:york title:station". This query will return all the docs containing any of the above terms.

试试这个..

2) Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse("+(" + querystr + ")");

Ex1: "new york" 将被解析为 "+(title:new title:york)"

Ex1: "new york" will be parsed to "+(title:new title:york)"

上面的+"表示该词在结果文档中必须"出现.它将匹配包含new york"和new york station"的文档

The above '+' indicates 'must' occurrence of the term in the result document.It will match both the docs containing "new york" and "new york station"

Ex2:new york station"将被解析为+(title:new title:york title:station).查询将只匹配纽约站",而不仅仅是纽约",因为车站不存在.

Ex2: "new york station" will be parsed to +(title:new title:york title:station). The query will match only "new york station" and not just "new york" since station is not present.

请确保字段名称title"是您要查找的名称.

Please make sure that the field name 'title' is what you're looking for.

您的问题.

Lucene 算法是否适用于大字符串?

您必须定义什么是大字符串.您真的在寻找短语搜索.一般来说,是的,Lucene 适用于大字符串.

You've got to define what a large string is. Are you actually looking for Phrase Search. In general, Yes, Lucene works for large strings.

为什么 Lucene 算法不能处理精确字符串?

因为解析 ("querystr" + "* ") 将生成单独的术语查询,其中 OR 运算符将它们连接起来.例如:'new york*' 将被解析为:"title:new OR title:york*

Because parsing ("querystr" + "* ") will generate individual term queries with OR operator connecting them.Ex: 'new york*' will be parsed to: "title:new OR title:york*

如果您期待找到new york station",上面的通配符查询不是您应该寻找的.这是因为您传入的 StandardAnalyser 在编制索引时会将纽约站标记(分解术语)为 3 个术语.

If you are looking forward to find "new york station", the above wildcard query is not what you should be looking for. This is because the StandardAnalyser you passed in, while indexing, will tokenize (break down terms) new york station to 3 terms.

所以,查询york*"会找到york station",只是因为它有york",而不是因为通配符,因为york"不知道station",因为它们是不同的术语,即索引中的不同条目.

你真正需要的是一个PhraseQuery 用于查找精确字符串,查询字符串应为new york"WITH 引号

What you actually need is a PhraseQuery for finding exact string, for which the query string should be "new york" WITH the quotes

这篇关于为什么 Lucene 算法不适用于 Java 中的 Exact String?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-17 01:33