

在 Lucene 查询语法中,我想将 * 和 ~ 组合在一个有效的查询中,类似于:bla~*//无效查询

In the Lucene query syntax I'd like to combine * and ~ in a valid query similar to:bla~* //invalid query


Meaning: Please match words that begin with "bla" or something similar to "bla".

更新:我现在所做的,适用于少量输入,使用以下(SOLR 模式的片段):

Update:What I do now, works for small input, is use the following (snippet of SOLR schema):

<fieldtype name="text_ngrams" class="solr.TextField">
  <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
  <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>

如果您不使用 SOLR,则执行以下操作.

In case you don't use SOLR, this does the following.


Indextime: Index data by creating a field containing all prefixes of my (short) input.

搜索时间:仅使用 ~ 运算符,因为前缀明确存在于索引中.

Searchtime: only use the ~ operator, as prefixes are explicitly present in the index.


我不相信 Lucene 支持这样的东西,也不相信它有一个简单的解决方案.

I do not believe Lucene supports anything like this, nor do I believe it has a trivial solution.

模糊"搜索不会对固定数量的字符进行操作.bla~ 可能例如匹配 blah,因此它必须考虑整个术语.

"Fuzzy" searches do not operate on a fixed number of characters. bla~ may for example match blah and so it must consider the entire term.

你可以做的是实现一个查询扩展算法,将查询 bla~* 转换为一系列 OR 查询

What you could do is implement a query expansion algorithm that took the query bla~* and converted it into a series of OR queries

bla* OR blb* OR blc OR .... etc.


But that is really only viable if the string is very short or if you can narrow the expansion based on some rules.


Alternatively if the length of the prefix is fixed you could add a field with the substrings and perform the fuzzy search on that. That would give you what you want, but will only work if your use case is sufficiently narrow.


You don't specify exactly why you need this, perhaps doing so will elicit other solutions.

我能想到的一个场景是处理不同形式的单词.例如.找到 carcars.

One scenario I can think of is dealing with different form of words. E.g. finding car and cars.


This is easy in English as there are word stemmers available. In other languages it can be quite difficult to implement word stemmers, if not impossible.


In this scenario you can however (assuming you have access to a good dictionary) look up the search term and expand the search programmatically to search for all forms of the word.

例如cars 的搜索被翻译成 car OR cars.这已在至少一个搜索引擎中成功应用于我的语言,但显然实现起来并非易事.

E.g. a search for cars is translated into car OR cars. This has been applied successfully for my language in at least one search engine, but is obviously non-trivial to implement.


