问题描述
喜
我有Lucene索引是经常与新记录更新,我有500万条记录在我的索引,我缓存我使用FieldCache数字领域之一。但在更新索引之后,需要时间来重新载入FieldCache(IM重新加载缓存的原因文档说DOCID是不可靠的),所以我怎么能只加入新添加的DocIDs到FieldCache尽量减少这种开销,造成这种能力变成了以瓶颈我应用程序。
Hi
I have lucene index that is frequently updating with new records, I have 5,000,000 records in my index and I'm caching one of my numeric fields using FieldCache. but after updating index it takes time to reload the FieldCache again (im reloading the cache cause documentation said DocID is not reliable) so how can I minimize this overhead by adding only newly added DocIDs to the FieldCache, cause this capability turns to bottleneck in my application.
IndexReader reader = IndexReader.Open(diskDir);
int[] dateArr = FieldCache_Fields.DEFAULT.GetInts(reader, "newsdate"); // This line takes 4 seconds to load the array
dateArr = FieldCache_Fields.DEFAULT.GetInts(reader, "newsdate"); // this line takes 0 second as we expected
// HERE we add some document to index and we need to reload the index to reflect changes
reader = reader.Reopen();
dateArr = FieldCache_Fields.DEFAULT.GetInts(reader, "newsdate"); // This takes 4 second again to load the array
我想通过添加只新添加的文件,该指数在我们的数组有这样的技术的以提高性能,但它仍然加载,我们已经把所有的文件,我认为没有必要重新加载它们,如果我们找到一种方法,只需要添加新添加文件到数组
I want a mechanism that minimize this time by adding only newly added documents to the index in our array there is a technique like this http://invertedindex.blogspot.com/2009/04/lucene-dociduid-mapping-and-payload.htmlto improve the performance but it still load all documents that we already have and i think there is no need to reload them all if we find a way to only adding newly added documents to the array
推荐答案
该FieldCache使用索引读者弱引用,作为他们的缓存键。 (通过调用 IndexReader.GetCacheKey
这一直未过时。)与标准调用 IndexReader.Open
FSDirectory
将使用池的读者,一个是每一个部分。
The FieldCache uses weak references to index readers as keys for their cache. (By calling IndexReader.GetCacheKey
which has been un-obsoleted.) A standard call to IndexReader.Open
with a FSDirectory
will use a pool of readers, one for every segment.
您应该总是通过最里面的读者的FieldCache。看看 ReaderUtil
对于一些辅助的东西来检索文件包含在单独的阅读器。文档ID不会在段内的变化,它们的含义描述为未predictable时/挥发性的是,它会改变之间的两个指标承诺。删除的文件可能已被proned,部分已被合并,而这种行动。
You should always pass the innermost reader to the FieldCache. Check out ReaderUtil
for some helper stuff to retrieve the individual reader a document is contained within. Document ids wont change within a segment, what they mean when describing it as unpredictable/volatile is that it will change between two index commits. Deleted documents could have been proned, segments have been merged, and such actions.
一个承诺需要删除的磁盘段(合并/优化掉),这意味着新的读者不会有汇集段的读者,以及垃圾收集将尽快全部老年读者被关闭删除它。
A commit needs to remove the segment from disk (merged/optimized away), which means that new readers wont have the pooled segment reader, and the garbage collection will remove it as soon as all older readers are closed.
永远,永远,调用 FieldCache.PurgeAllCaches()
。它的意思进行测试,而不是生产中使用。
Never, ever, call FieldCache.PurgeAllCaches()
. It's meant for testing, not production use.
补充2011-04-03;例如:code。使用subreaders。
Added 2011-04-03; example code using subreaders.
var directory = FSDirectory.Open(new DirectoryInfo("index"));
var reader = IndexReader.Open(directory, readOnly: true);
var documentId = 1337;
// Grab all subreaders.
var subReaders = new List<IndexReader>();
ReaderUtil.GatherSubReaders(subReaders, reader);
// Loop through all subreaders. While subReaderId is higher than the
// maximum document id in the subreader, go to next.
var subReaderId = documentId;
var subReader = subReaders.First(sub => {
if (sub.MaxDoc() < subReaderId) {
subReaderId -= sub.MaxDoc();
return false;
}
return true;
});
var values = FieldCache_Fields.DEFAULT.GetInts(subReader, "newsdate");
var value = values[subReaderId];
这篇关于FieldCache与频繁更新的索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!