而在编制索引时不会丢失字段的内容

而在编制索引时不会丢失字段的内容

本文介绍了休眠搜索:搜索字段的任何部分,而在编制索引时不会丢失字段的内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望能够根据其索引字段的任何部分找到一个实体,并且该字段在索引过程中一定不要松动任何内容.

I would like to be able to find an entity based on any part of its indexed fields, and the fields must not loose any content while indexing.

可以说我有以下示例实体类:

Lets say I have the following sample entity class:

@Entity
public class E {
    private String f;
    // ...
}

如果一个实体中f的值是"This is a nice field!",我希望能够通过以下任一查询找到它:

And if the value of f in one entity is "This is a nice field!", I would like to be able to find it by any of these queries:

  • 此"
  • "a"
  • "IC"
  • !"
  • 这是一个不错的领域!"

最明显的决定是以这种方式注释实体:

The most obvious decision is to annotate the entity this way:

@Entity
@Indexed
@AnalyzerDef(name = "a",
        tokenizer = @TokenizerDef(factory = KeywordTokenizerFactory.class),
        filters = @TokenFilterDef(factory = LowerCaseFilterFactory.class)
)
@Analyzer(definition = "a")
public class E {
    @Field
    private String f;
    // ...
}

然后按以下方式搜索:

String queryString;
// ...
org.apache.lucene.search.Query query = queryBuilder
        .keyword()
        .wildcard()
        .onField("f")
        .matching("*" + queryString.toLowerCase() + "*")
        .createQuery();

但是文档中指出为了提高性能,建议查询不要以?开头.或* .

据我了解,这种方法无效.

So as I understand, this method is ineffective.

另一个想法是使用这样的n-gram:

The other idea is to use n-grams like this:

@Entity
@Indexed
@AnalyzerDef(name = "a",
        tokenizer = @TokenizerDef(factory = KeywordTokenizerFactory.class),
        filters = {
                @TokenFilterDef(factory = LowerCaseFilterFactory.class),
                @TokenFilterDef(factory = NGramFilterFactory.class,
                        params = {
                                @Parameter(name = "minGramSize", value = "1"),
                                @Parameter(name = "maxGramSize", value = E.MAX_LENGTH)
                        })
        }
)
@Analyzer(definition = "a")
public class E {
    static final String MAX_LENGTH = "42";
    @Field
    private String f;
    // ...
}

并以这种方式创建查询:

And create queries this way:

String queryString;
// ...
org.apache.lucene.search.Query query = queryBuilder
                .keyword()
                .onField("f")
                .ignoreAnalyzer()
                .matching(queryString.toLowerCase())
                .createQuery();

这一次不使用通配符查询,并且查询中的分析器将被忽略.我不确定忽略分析仪是好是坏,但是在忽略分析仪的情况下可以使用.

This time no wildcard queries are used and the analyzer in the query is ignored. I'm not sure whether ignoring the analyzer is good or bad, but it works with analyzer ignored.

其他可能的解决方案是在使用n-gram时使用WhitespaceTokenizerFactory而不是KeywordTokenizerFactory,然后将queryString用空格分开,并使用必须.据我了解,采用这种方法,如果f中包含的字符串的长度为E.MAX_LENGTH,那么我得到的n-gram将会少很多,那么对于性能而言,这一定是好的.我还可以通过例如"hield" 查询找到先前描述的实体.那将是理想的.

Other possible solution would be to use WhitespaceTokenizerFactory instead of KeywordTokenizerFactory when using n-grams, then split queryString by spaces and combine searches for each substring using MUST.In this approach, as I understand, I will get a lot less n-grams built, if the length of the string contained in f is E.MAX_LENGTH, what must be good for performance. And I will also be able to find the previously described entity by, for example, "hi ield" query. And that would be ideal.

那么解决我的问题的最佳方法是什么?还是我所有的想法都不好?

So what would be the best way to deal with my problem? Or are all my ideas bad?

P.S.使用n-gram时,应该忽略查询中的分析器吗?

P.S. Should one ignore analyzer in queries when using n-grams?

推荐答案

这或多或少是理想的解决方案,除了以下几点:在查询时,您不应忽略分析器.您应该做的是定义另一个没有ngram过滤器但带有令牌化器,小写过滤器等的分析器,并明确指示Hibernate Search在查询时使用该分析器.

This is more or less the ideal solution, except for one thing: you shouldn't ignore the analyzer when querying. What you should do is define another analyzer without the ngram filter, but with the tokenizer, lowercase filter, etc., and explicitly instruct Hibernate Search to use that analyzer at query time.

其他解决方案过于昂贵,无论是在查询时在I/O和CPU中(第一个解决方案)还是在存储空间中(第二个解决方案).请注意,根据E.MAX_LENGTH的值,此第三种解决方案在存储空间上可能仍然相当昂贵.通常建议minGramSizemaxGramSize之间仅相差一两个,以免索引过多克.

The other solutions are too expensive, either in I/O and CPU at query time (first solution) or in storage space (second solution). Note that this third solution may still be rather expensive in storage space, depending on the value of E.MAX_LENGTH. It's generally recommended to only have a difference of one or two between minGramSize and maxGramSize, to avoid the indexing of too many grams.

只需定义另一个分析器,将其命名为"ngram_query",然后在需要构建查询时,按以下方式创建查询构建器:

Just define another analyzer, name it something like "ngram_query", and when you need to build the query, create the query builder like this:

    QueryBuilder queryBuilder = fullTextEntityManager.getSearchFactory().buildQueryBuilder().forEntity(EPCAsset.class)
        .overridesForField( "f" /* name of the field */, "ngram_query" )
        .get();

然后照常创建查询.

请注意,如果您依靠Hibernate Search将索引架构和分析器推送到Elasticsearch,则必须使用hack才能推送仅查询分析器:默认情况下,仅实际使用的分析器在索引期间被推送.参见 https://discourse .hibernate.org/t/在使用overridesforfield/1043/4

Note that, if you rely on Hibernate Search to push the index schema and analyzers to Elasticsearch, you will have to use a hack in order for the query-only analyzer to be pushed: by default only the analyzers that are actually used during indexing are pushed. See https://discourse.hibernate.org/t/cannot-find-the-overridden-analyzer-when-using-overridesforfield/1043/4

这篇关于休眠搜索:搜索字段的任何部分,而在编制索引时不会丢失字段的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-14 22:39