我想从索引器文件中读取索引。
因此,我想要的结果是每个文档的所有条款和TF-IDF的数量。
请为我建议一些示例代码。谢谢 :)
最佳答案
首先要获得文件清单。一种替代方法可能是遍历索引项,但是方法IndexReader.terms()
似乎已从4.0中删除(尽管它存在于AtomicReader
中,值得一看)。我知道获取所有文档的最好方法是简单地通过文档ID遍历文档:
//where reader is your IndexReader, however you go about opening/managing it
for (int i=0; i<reader.maxDoc(); i++) {
if (reader.isDeleted(i))
continue;
//operate on the document with id = i ...
}
然后,您需要列出所有索引词。我假设我们对存储的字段不感兴趣,因为所需的数据对它们没有意义。要检索这些术语,可以使用
IndexReader.getTermVectors(int)
。注意,由于我们不需要直接访问它,因此我实际上并没有检索该文档。从我们中断的地方继续:String field;
FieldsEnum fieldsiterator;
TermsEnum termsiterator;
//To Simplify, you can rely on DefaultSimilarity to calculate tf and idf for you.
DefaultSimilarity freqcalculator = new DefaultSimilarity()
//numDocs and maxDoc are not the same thing:
int numDocs = reader.numDocs();
int maxDoc = reader.maxDoc();
for (int i=0; i<maxDoc; i++) {
if (reader.isDeleted(i))
continue;
fieldsiterator = reader.getTermVectors(i).iterator();
while (field = fieldsiterator.next()) {
termsiterator = fieldsiterator.terms().iterator();
while (terms.next()) {
//id = document id, field = field name
//String representations of the current term
String termtext = termsiterator.term().utf8ToString();
//Get idf, using docfreq from the reader.
//I haven't tested this, and I'm not quite 100% sure of the context of this method.
//If it doesn't work, idfalternate below should.
int idf = termsiterator.docfreq();
int idfalternate = freqcalculator.idf(reader.docFreq(field, termsiterator.term()), numDocs);
}
}
}