我想使用Lucene的IndexSearcher
来计算文档之间的相似度。确切地说,我有一个输入文档,想计算与索引中所有其他文档的相似度。我已经掌握了基本功能,但是现在我遇到了一些尚未在线找到答案的问题。
为什么当我呼叫IndexSearcher
时is.search(query, Integer.MAX_VALUE)
仅返回两个结果?我本来希望得到三个结果。
我的方法中是否存在一些我目前看不到的错误?
如何处理多种语言?据我所知,IndexWriter
和QueryParser
应该都具有相同的分析器(在我的示例中为StandardAnalyzer
)。如果我使用三种不同的语言,是否必须创建三个索引?
SSCCE(我使用的是Lucene 5.3.0):
public class Main {
public static void main(String[] args) throws Exception {
Path path = Paths.get("temp_directoty");
// create index
createIndexAndAddDocuments(path);
// open index reader and create index searcher
IndexReader ir = DirectoryReader.open(FSDirectory.open(path));
IndexSearcher is = new IndexSearcher(ir);
is.setSimilarity(new BM25Similarity());
// document which is used to create the query
Document doc = ir.document(1);
// create query parser
QueryParser queryParser = new QueryParser("Abstract", new StandardAnalyzer());
// create query
Query query = queryParser.parse(doc.get("Abstract"));
// search
for (ScoreDoc result : is.search(query, Integer.MAX_VALUE).scoreDocs) {
System.out.println(result.doc + "\t" + result.score);
}
}
private static void createIndexAndAddDocuments(Path indexPath) throws IOException {
// create documents
Document doc1 = new Document();
doc1.add(new TextField("Title", "Apparatus for manufacturing green bricks for the brick manufacturing industry",
Store.YES));
doc1.add(new TextField("Abstract",
"The invention relates to an apparatus (1) for manufacturing green bricks from clay for the brick manufacturing industry, comprising a circulating conveyor (3) carrying mould containers combined to mould container parts (4), a reservoir (5) for clay arranged above the mould containers, means for carrying clay out of the reservoir (5) into the mould containers, means (9) for pressing and trimming clay in the mould containers, means (11) for supplying and placing take-off plates for the green bricks (13) and means for discharging green bricks released from the mould containers, characterized in that the apparatus further comprises means (22) for moving the mould container parts (4) filled with green bricks such that a protruding edge is formed on at least one side of the green bricks",
Store.YES));
Document doc2 = new Document();
doc2.add(new TextField("Title",
"Some other title, for example: Apparatus for manufacturing green bricks for the brick manufacturing industry",
Store.YES));
doc2.add(new TextField("Abstract",
"Some other abstract, for example: The invention relates to an apparatus (1) for manufacturing green bricks from clay for the brick manufacturing industry, comprising a circulating conveyor (3) carrying mould containers combined to mould container parts (4), a reservoir (5) for clay arranged above the mould containers, means for carrying clay out of the reservoir (5) into the mould containers, means (9) for pressing and trimming clay in the mould containers, means (11) for supplying and placing take-off plates for the green bricks (13) and means for discharging green bricks released from the mould containers, characterized in that the apparatus further comprises means (22) for moving the mould container parts (4) filled with green bricks such that a protruding edge is formed on at least one side of the green bricks",
Store.YES));
Document doc3 = new Document();
doc3.add(new TextField("Title", "A document with a competely different title", Store.YES));
doc3.add(new TextField("Abstract",
"This document also has a completely different abstract which is in no way similar to the abstract of the previous documents.",
Store.YES));
IndexWriter iw = new IndexWriter(FSDirectory.open(indexPath), new IndexWriterConfig(new StandardAnalyzer()));
iw.deleteAll();
iw.addDocument(doc1);
iw.addDocument(doc2);
iw.addDocument(doc3);
iw.close();
}
}
最佳答案
我发现您只有2个结果的问题。您仅在createIndexAndAddDocuments
中创建了doc1和doc2,然后覆盖了doc2而未初始化doc3。
关于我将要回答的语言的问题:这取决于您要单独搜索语句还是全部搜索。如果要分隔语言,则需要不同的索引。
希望对您有所帮助。