问题描述
我想使用正则表达式与Lucene一起找到错误报告",但是无论何时尝试都行不通.
I would like to find "Bug reports" with Lucene using a regular expression, but whenever I try it doesn't work.
我使用了 Lucene页面,以避免设置错误.
I used the code from the Lucene page to avoid a bad setup.
这是我的代码:
import java.util.regex.Pattern;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.regex.JakartaRegexpCapabilities;
import org.apache.lucene.search.regex.RegexCapabilities;
import org.apache.lucene.search.regex.RegexQuery;
import org.apache.lucene.store.RAMDirectory;
public class Rege {
private static IndexSearcher searcher;
private static final String FN = "field";
public static void main(String[] args) throws Exception {
RAMDirectory directory = new RAMDirectory();
try {
IndexWriter writer = new IndexWriter(directory,
new SimpleAnalyzer(), true,
IndexWriter.MaxFieldLength.LIMITED);
Document doc = new Document();
doc
.add(new Field(
FN,
"[Phpmyadmin-devel] Commits against bug 601721 (Cookie auth mode faulty with IIS)",
Field.Store.NO, Field.Index.ANALYZED));
writer.addDocument(doc);
writer.optimize();
writer.close();
searcher = new IndexSearcher(directory, true);
} catch (Exception e) {
e.printStackTrace();
}
System.err.println(regexQueryNrHits("bug [0-9]+",null));
}
private static Term newTerm(String value) {
return new Term(FN, value);
}
private static int regexQueryNrHits(String regex,
RegexCapabilities capability) throws Exception {
RegexQuery query = new RegexQuery(newTerm(regex));
if (capability != null)
query.setRegexImplementation(capability);
return searcher.search(query, null, 1000).totalHits;
}
}
我希望bug [0-9]+
返回1
,但不会.我还使用Java测试了正则表达式,并且可以正常工作.
I would expect bug [0-9]+
to return 1
but it doesn't. I also tested the regex with Java and it worked.
推荐答案
谢谢,但这本身并不能解决问题.问题是Field.Index.ANALYZED
标志:
Thanks, but this alone didn't solve the problem. The problem is the Field.Index.ANALYZED
flag:
Lucene似乎没有以正确的方式索引数字,因此正则表达式可以与它们一起使用.
It seems that lucene doesn't index numbers in a proper way so that a regex could be used with them.
我更改了:
doc.add(new Field(
FN,"[Phpmyadmin-devel] Commits against bug 601721 (Cookie auth mode faulty with IIS)",Field.Store.NO, Field.Index.ANALYZED));
到
doc.add(new Field(
FN,"[Phpmyadmin-devel] Commits against bug 601721 (Cookie auth mode faulty with IIS)",Field.Store.NO, Field.Index.NOT_ANALYZED));
以及改进的正则表达式:
and with your improved regex:
System.err.println(regexQueryNrHits("^.*bug #+[0-9]+.*$",
new JavaUtilRegexCapabilities()));
终于奏效了! :)
这篇关于使用Lucene进行RegEx匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!