本文介绍了获取Lucene中的词频的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有一种从Lucene索引中获取词频的快速而简单的方法,而不需要通过 TermVectorFrequencies 类来完成,因为这需要大量的时间来处理大量我的意思是,是否有像 TermEnum 之类的东西,它不仅具有文档频率,而且具有词频以及?

更新:
使用TermDocs的速度太慢了。

解决方案

使用来获取给定文档的术语频率。与文档频率一样,您可以使用感兴趣的术语从 IndexReader 获取术语文档。






在不失一般性的情况下,找不到比 TermDocs 更快的方法。 TermDocs 直接从索引段中的.frq文件中读取,其中每个词的频率按文档顺序列出。



如果这太慢,请确保您已优化索引以将多个细分受众群合并为一个细分受众群。按顺序迭代文档(跳过正常,但不能在文档列表中来回跳转)。



您的下一步可能是额外的处理创建一个更加专业化的文件结构,省略 SkipData 。就个人而言,我会寻找一个更好的算法来实现我的目标,或者提供更好的硬件 - 大量内存,既可以存放 RAMDirectory ,也可以提供给操作系统使用它自己的文件缓存系统。


Is there a fast and easy way of getting term frequencies from a Lucene index, without doing it through the TermVectorFrequencies class, since that takes an awful lot of time for large collections?

What I mean is, is there something like TermEnum which has not just the document frequency but term frequency as well?

UPDATE:Using TermDocs is way too slow.

解决方案

Use TermDocs to get the term frequency for a given document. Like the document frequency, you get the term documents from an IndexReader, using the term of interest.


You won't find a faster method than TermDocs without losing some generality. TermDocs reads directly from the ".frq" file in an index segment, where each term frequency is listed in document order.

If that's "too slow", make sure that you've optimized your index to merge multiple segments into a single segment. Iterate over the documents in order (skips are alright, but you can't jump back and forth in the document list efficiently).

Your next step might be additional processing to create an even more specialized file structure that leaves out the SkipData. Personally I would look for a better algorithm to achieve my objective, or provide better hardware—lots of memory, either to hold a RAMDirectory, or to give to the OS for use on its own file-caching system.

这篇关于获取Lucene中的词频的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-08 03:31