如何在Lucene文档中定义主键字段以获得最佳查找性能

如何在Lucene文档中定义主键字段以获得最佳查找性能

本文介绍了如何在Lucene文档中定义主键字段以获得最佳查找性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Lucene索引(v7.2)中创建文档时,我向其中添加一个uid字段,其中包含唯一的id/key(字符串):

When creating a document in my Lucene index (v7.2), I add a uid field to it which contains a unique id/key (string):

doc.add(new StringField("uid",uid,Field.Store.YES))

要在以后检索该文档,我为给定的唯一ID创建TermQuery并使用IndexSearcher进行搜索:

To retrieve that document later on, I create a TermQuery for the given unique id and search for it with an IndexSearcher:

searcher.search(new TermQuery(new Term("uid",uid)),1)

作为Lucene的新手",我想知道以下内容:

Being a Lucene "novice", I would like to know the following:

  1. 我应该如何改进这种方法以获得最佳的查询性能?例如,如果我将唯一ID存储为字节数组而不是字符串?还是有一些特殊的编解码器或过滤器可以使用?

  1. How should I improve this approach to get the best lookup performance?Would it, for example, make a difference if I store the unique id asa byte array instead of as a string? Or are there some special codecs or filters that can be used?

通过文档的唯一ID查找文档的时间复杂度是多少?由于索引中每个文档至少包含一个唯一术语,因此查找时间将随着数量的增加而线性增加文档(O(n)),对吗?

What is the time complexity of looking up a document by its unique id? Since the index contains at least one unique term for each document, the lookup times will increase linearly with the number of documents (O(n)), right?

推荐答案

理论

有一个博客文章有关Lucene术语索引和查找性能的信息.它清楚地揭示了通过id查找文档的复杂性的所有细节.这篇文章很老,但是从那以后没有任何改变.

Theory

There is a blog post about Lucene term index and lookup performance. It clearly reveals all the details of complexity of looking up a document by id. This post is quite old, but nothing was changed since then.

以下是与您的问题有关的重点内容:

Here is some highlights related to your question:

  • Lucene是一个搜索引擎,其中检索的最小元素是一个文本项,因此这意味着: BlockTree术语词典.
  • 通常,查找的复杂度取决于术语长度:Lucene使用内存中的前缀三叉戟索引结构来执行术语查找.由于现实世界中硬件和软件实现的限制(为了避免进行过多的磁盘读取和过多的尝试而导致内存溢出),Lucene使用BlockTree结构.这意味着它将前缀尝试存储在磁盘上的小块中,并且一次仅加载一个块.这就是为什么以易于阅读的顺序生成密钥如此重要的原因.因此,让我们根据影响因素的程度来排列它们:
    • 术语的长度-要加载的块更多
    • term的模式-避免多余的读取
    • 术语数-减少块数
    • Lucene is a search engine where the minimum element of retrieval is a text term, so this means: binary, number and string fields are represented as strings in the BlockTree terms dictionary.
    • In general, the complexity of lookup depends on the term length: Lucene uses an in-memory prefix-trie index structure to perform a term lookup. Due to restrictions of real-world hardware and software implementations (in order to avoid superfluous disk reads and memory overflow for extremely large tries), Lucene uses a BlockTree structure. This means it stores prefix-trie in small chunks on disk and loads only one chunk at time. This is why it's so important to generate keys in an easy-to-read order. So let's arrange the factors according to the degree of their influence:
      • term's length - more chunks to load
      • term's pattern - to avoid superfluous reads
      • terms count - to reduce chunks count

      让term为单个字符串,而让term dictionary为大量术语.如果我们有一个术语词典,并且我们需要知道单个术语是否在词典中,则trie(以及最小确定性非循环有限状态自动机(DAFSA)作为子类)是可以帮助我们的数据结构.关于您的问题:为什么使用尝试是否可以执行哈希查找?",这有几个原因:

      Let term be a single string and let term dictionary be a large set of terms. If we have a term dictionary, and we need to know whether a single term is inside the dictionary, the trie (and minimal deterministic acyclic finite state automaton (DAFSA) as a subclass) is the data structure that can help us. On your question: "Why use tries if a hash lookup can do the same?", here are a few reasons:

      • 尝试可以在O(L)时间中找到字符串(其中L表示单个术语的长度).与最坏情况下的哈希表(哈希表需要线性扫描,以防发生哈希冲突和诸如MurmurHash3之类的复杂哈希算法)相比,速度要快一些,或者与完美情况下的哈希表类似.
      • 哈希表只能找到与我们要查找的单个术语完全匹配的字典中的术语;而通过trie,我们可以找到具有单个不同字符,共同前缀,缺少字符等的词语.
      • trie可以按键提供条目的字母顺序,因此我们可以按字母顺序枚举所有术语.
      • 特里(尤其是DAFSA)提供了具有重复数据删除功能的术语的非常紧凑的表示形式.

      以下是DAFSA的三个术语示例:浴,蝙蝠和批处理:

      Here is an example of DAFSA for 3 terms: bath, bat and batch:

      在进行键查找的情况下,请注意,降低自动机(或trie)中的单个级别是在固定时间内完成的,并且每次算法降低自动机(trie)中的单个级别时,都会剪切单个字符从该术语中可以得出结论,可以在O(L)时间内在自动机(trie)中找到一个术语.

      In case of key lookup, notice that lowering a single level in the automata (or trie) is done in constant time, and every time that the algorithm lowers a single level in the automata (trie), a single character is cut from the term, so we can conclude that finding a term in a automata (trie) can be done in O(L) time.

      这篇关于如何在Lucene文档中定义主键字段以获得最佳查找性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-21 07:47