问题描述
鉴于文档中的术语匹配,访问该匹配周围的单词的最佳方法是什么?我已经阅读了这篇文章 http: //searchhub.org//2009/05/26/access-words-around-a-positional-match-in-lucene/,但是问题在于,自从这篇文章(2009年)以来,Lucene API完全改变了,有人可以指出我如何在更新版本的Lucene(如Lucene 4.6.1)中执行此操作吗?
Given a term match in a document, what’s the best way to access words around that match? I have read this article http://searchhub.org//2009/05/26/accessing-words-around-a-positional-match-in-lucene/,but the problem is that the Lucene API completely changed since this post(2009), could someone point to me how to do this in newer version of Lucene, such as Lucene 4.6.1?
编辑:
我现在已经解决了这个问题(已删除了发布API(TermEnum,TermDocsEnum,TermPositionsEnum),而使用了新的灵活索引(flex)API(Fields,FieldsEnum,Terms,TermsEnum,DocsEnum,DocsAndPositionsEnum).一个很大的不同是字段和术语现在分别枚举:TermsEnum在单个字段(而不是Term)中为每个术语提供BytesRef(包装byte []),另一个是当您请求Docs/AndPositionsEnum时,您可以指定显式地跳过skipDocs(通常这将是已删除的文档,但是通常您可以提供任何位).):
I figure this out now (The postings APIs (TermEnum, TermDocsEnum, TermPositionsEnum) have been removed in favor of the new flexible indexing (flex) APIs (Fields, FieldsEnum, Terms, TermsEnum, DocsEnum, DocsAndPositionsEnum). One big difference is that field and terms are now enumerated separately: a TermsEnum provides a BytesRef (wraps a byte[]) per term within a single field, not a Term. Another is that when asking for a Docs/AndPositionsEnum, you now specify the skipDocs explicitly (typically this will be the deleted docs, but in general you can provide any Bits).):
public class TermVectorFun {
public static String[] DOCS = {
"The quick red fox jumped over the lazy brown dogs.",
"Mary had a little lamb whose fleece was white as snow.",
"Moby Dick is a story of a whale and a man obsessed.",
"The robber wore a black fleece jacket and a baseball cap.",
"The English Springer Spaniel is the best of all dogs.",
"The fleece was green and red",
"History looks fondly upon the story of the golden fleece, but most people don't agree"
};
public static void main(String[] args) throws IOException {
RAMDirectory ramDir = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_46, new StandardAnalyzer(Version.LUCENE_46));
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
//Index some made up content
IndexWriter writer = new IndexWriter(ramDir, config);
for (int i = 0; i < DOCS.length; i++) {
Document doc = new Document();
Field id = new Field("id", "doc_" + i, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
doc.add(id);
//Store both position and offset information
Field text = new Field("content", DOCS[i], Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
doc.add(text);
writer.addDocument(doc);
}
writer.close();
//Get a searcher
DirectoryReader dirReader = DirectoryReader.open(ramDir);
IndexSearcher searcher = new IndexSearcher(dirReader);
// Do a search using SpanQuery
SpanTermQuery fleeceQ = new SpanTermQuery(new Term("content", "fleece"));
TopDocs results = searcher.search(fleeceQ, 10);
for (int i = 0; i < results.scoreDocs.length; i++) {
ScoreDoc scoreDoc = results.scoreDocs[i];
System.out.println("Score Doc: " + scoreDoc);
}
IndexReader reader = searcher.getIndexReader();
Spans spans = fleeceQ.getSpans(reader.leaves().get(0), null, new LinkedHashMap<Term, TermContext>());
int window = 2;//get the words within two of the match
while (spans.next() == true) {
int start = spans.start() - window;
int end = spans.end() + window;
Map<Integer, String> entries = new TreeMap<Integer, String>();
System.out.println("Doc: " + spans.doc() + " Start: " + start + " End: " + end);
Fields fields = reader.getTermVectors(spans.doc());
Terms terms = fields.terms("content");
TermsEnum termsEnum = terms.iterator(null);
BytesRef text;
while((text = termsEnum.next()) != null) {
//could store the BytesRef here, but String is easier for this example
String s = new String(text.bytes, text.offset, text.length);
DocsAndPositionsEnum positionsEnum = termsEnum.docsAndPositions(null, null);
if (positionsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
int i = 0;
int position = -1;
while (i < positionsEnum.freq() && (position = positionsEnum.nextPosition()) != -1) {
if (position >= start && position <= end) {
entries.put(position, s);
}
i++;
}
}
}
System.out.println("Entries:" + entries);
}
}
}
推荐答案
使用 Highlighter
. Highlighter.getBestFragment
可用于获取包含最佳匹配的文本部分.像这样:
Use Highlighter
. Highlighter.getBestFragment
can be used to get a portion of the text containing the best match. Something like:
TopDocs docs = searcher.search(query, maxdocs);
Document firstDoc = search.doc(docs.scoreDocs[0].doc);
Scorer scorer = new QueryScorer(query)
Highlighter highlighter = new Highlighter(scorer);
highlighter.GetBestFragment(myAnalyzer, fieldName, firstDoc.get(fieldName));
这篇关于在Lucene中访问位置匹配周围的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!