问题描述
我想知道lucene索引中每个文档的术语数.我一直在API和互联网中搜索,但没有结果.你能帮我吗?
I want to know the number of terms for each document in a lucene index. I've been searching in API and in internet with no result. Can you help me?
推荐答案
如果您没有在索引时存储术语向量,那么在Lucene中实际上很难做到这一点. Lucene的基础数据结构是一个反向索引,该索引将术语存储为键,将文档ID列表存储为值.这就是为什么API中没有"getNumTerms()"方法的原因,因为Lucene使用的基本数据结构不支持它.
This is actually kind of difficult to do in Lucene if you did not store term vectors at index time. Lucene's underlying data structure is an inverted index, which stores terms as keys and document ID lists as values. That's why there isn't a "getNumTerms()" method in the API, because the fundamental data structures that Lucene employs don't support it.
也就是说,您可以将术语向量存储在索引中,您可以在搜索时通过文档ID进行查找.这些向量实际上是该文档中所有术语的完整列表,您可以对其进行计数以获得术语数量.
That being said, you can store term vectors in the index, which you can look up by document ID at search time. These vectors are essentially a complete list of all the terms in that document, which you can then count to get your # of terms.
请参见
http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/document/Field.TermVector.html
或者,您可以预先计算所有术语并将其存储为索引中的字段.
Alternatively, you can count all the terms beforehand and store it as a field in your index.
这篇关于如何计算Lucene索引中每个文档的术语数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!