问题描述
我是 Lucene 的新手.我有两个文档,并且我希望对名为关键字"的文档字段进行完全匹配(该字段可能在文档中出现多次).
I'm new to Lucene. I have two documents and I would like to have an exact match for the document field called "keyword" (the field may occur multiple times within a document).
第一个文档包含关键字注释很酷".第二个文档包含关键字注释也很酷".当我搜索Annotation is cool"时,如何构建查询以便只找到第一个文档?
The first document contains the keyword "Annotation is cool". The second document contains the keyword "Annotation is cool too". How do I have to build the query such that only the first document is found, when I search for "Annotation is cool"?
我读过一些关于StringField"的内容,并且它没有被标记化.如果我在addDoc"方法中将关键字"字段从TextField"更改为StringField",则将找不到任何内容.
I read something about "StringField" and that it is not tokenized. If I change the "keyword" field from "TextField" to "StringField" in the method "addDoc" then nothing will be found.
这是我的代码:
private IndexWriter writer;
public void lucene() throws IOException, ParseException {
// Build the index
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_42);
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_42,
analyzer);
this.writer = new IndexWriter(index, config);
// Add documents to the index
addDoc("Spring", new String[] { "Java", "JSP",
"Annotation is cool" });
addDoc("Java", new String[] { "Oracle", "Annotation is cool too" });
writer.close();
// Search the index
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
BooleanQuery qry = new BooleanQuery();
qry.add(new TermQuery(new Term("keyword", ""Annotation is cool"")), BooleanClause.Occur.MUST);
System.out.println(qry.toString());
Query q = new QueryParser(Version.LUCENE_42, "title", analyzer).parse(qry.toString());
int hitsPerPage = 10;
TopScoreDocCollector collector = TopScoreDocCollector.create(
hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document doc = searcher.doc(docId);
System.out.println((i + 1) + ". " + doc.get("title"));
}
reader.close();
}
private void addDoc(String title, String[] keywords) throws IOException {
// Create new document
Document doc = new Document();
// Add title
doc.add(new TextField("title", title, Field.Store.YES));
// Add keywords
for (int i = 0; i < keywords.length; i++) {
doc.add(new TextField("keyword", keywords[i], Field.Store.YES));
}
// Add document to index
this.writer.addDocument(doc);
}
推荐答案
你的问题不在于你如何索引字段.字符串字段是将整个输入索引为单个标记的正确方法.问题是你如何搜索.我真的不知道你打算用这个逻辑来完成什么,真的.
You problem is not in how you are indexing the field. The string field is the correct way to index the entire input as a single token. The problem is how you are searching. I really don't know what you are intending to accomplish with this logic, really.
BooleanQuery qry = new BooleanQuery();
qry.add(new TermQuery(new Term("keyword", ""Annotation is cool"")), BooleanClause.Occur.MUST);
//Great! You have a termQuery added to the parent BooleanQuery which should find your keyword just fine!
Query q = new QueryParser(Version.LUCENE_42, "title", analyzer).parse(qry.toString());
//Now all bets are off.
Query.toString()
是一种方便的调试方法,但假设通过 QueryParser 运行输出文本查询将重新生成相同的查询是不安全的.标准查询解析器确实没有太多能力将多个单词表示为一个术语.我相信,您看到的 String 版本看起来像:
Query.toString()
is a handy method of debugging, but it is not safe to assume that running the output text query through a QueryParser will regenerate the same query. The standard query parser really doesn't have much capability to express multiple words as a single term. The String version of this that you see will, I believe, look like:
keyword:"Annotation is cool"
这将被解释为 PhraseQuery.一个 PhraseQuery 将查找三个连续的词条,Annotation、is 和 cool,但是按照您对此进行索引的方式,您只有一个词条注释很酷".
Which will be interpreted as a PhraseQuery. A PhraseQuery will look for three consecutive terms, Annotation, is, and cool, But the way you have indexed this, you have a single term "Annotation is cool".
解决方案是永远不要使用像
The solution is don't ever use logic like
Query nuttyQuery = queryParser.parse(perfectlyGoodQuery.toString());
searcher.search(nuttyQuery);
相反,只需使用您已经创建的 BooleanQuery 进行搜索.
Instead, just search with the BooleanQuery you already created.
searcher.search(perfectlyGoodQuery);
这篇关于Lucene 4.2 字符串字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!