问题描述
我正在尝试学习和理解 lucene 的工作原理,以及 lucene 索引中的内容.基本上我想看看数据在 lucene 索引中是如何表示的?
I am trying to learn and understand how lucene works, what is inside lucene index. Basically i would want to see how the data is represented inside lucene index?
我使用 lucene-core 8.6.0
作为依赖项
下面是我非常基本的 Lucene 代码
Below is my very basic Lucene code
private Document create(File file) throws IOException {
Document document = new Document();
Field field = new Field("contents", new FileReader(file), TextField.TYPE_NOT_STORED);
Field fieldPath = new Field("path", file.getAbsolutePath(), TextField.TYPE_STORED);
Field fieldName = new Field("name", file.getName(), TextField.TYPE_STORED);
document.add(field);
document.add(fieldPath);
document.add(fieldName);
//Create analyzer
Analyzer analyzer = new StandardAnalyzer();
//Create IndexWriter pass the analyzer
Path indexPath = Files.createTempDirectory("tempIndex");
Directory directory = FSDirectory.open(indexPath);
IndexWriterConfig indexWriterCOnfig = new IndexWriterConfig(analyzer);
IndexWriter iwriter = new IndexWriter(directory, indexWriterCOnfig);
iwriter.addDocument(document);
iwriter.close();
return document;
}
注意:我了解 Lucene 背后的知识 - 倒排索引,但我缺乏对 lucene 库使用此概念以及如何创建文件以便使用 lucene 使搜索变得容易和可行的理解.
Note : I understand the knowledge behind Lucene - the inverted index, but i lack the understanding of the lucene library uses this concept and how the files are created so that search was made easy and feasible using lucene.
我试过豪华轿车,但没有用.即使我在 web.xml 中给出了索引位置,它也不起作用
I tried Limo, but of no use. Its just did not work even though i gave the index location in the web.xml
推荐答案
如果您想看一个好的介绍性代码示例,请使用当前版本的 Lucene(构建索引然后使用它),您可以从 基本演示开始.演示的源代码可以在 在 Github 上.
If you would like to see a good introductory code example, using the current version of Lucene (building an index and then using it), you can start with the basic demo. The source code for the demo can be found here on Github.
如果您想探索您的索引数据,一旦它被创建,您可以使用 Luke.如果您以前没有使用过:要运行 Luke,您需要从 binary 版本rel="nofollow noreferrer">主下载页面.解压缩文件,然后导航到 luke
目录.然后运行相关脚本(luke.bat
或 luke.sh
).
If you would like to explore your indexed data, once it has been created, you can use Luke. In case you have not used it before: To run Luke, you need to download a binary release from the main download page. Unzip the file, and then navigate to the luke
directory. Then run the relevant script (luke.bat
or luke.sh
).
(我能找到的唯一版本的 LIMO
工具是 thisSourceforge 上的一个.鉴于它是从 2007 年开始的,几乎可以肯定它不再与最新的 Lucene 索引文件兼容.也许某处有更新的版本.)
(The only version of the LIMO
tool I could find is this one on Sourceforge. Given it is from 2007, it is almost certainly no longer compatible with the latest Lucene index files. Maybe there is a more updated version somewhere.)
如果您想要一个典型 Lucene 索引中的文件概览,您可以从这里开始.
If you would like an overview of the files in a typical Lucene index, you can start here.
许多具体问题可以通过查看 API 文档来回答相关的包和类.
Many specifc questions can be answered by looking at the API documentation for relevant packages and classes.
就个人而言,我还发现了 Solr 和 ElasticSearch 文档对于解释特定概念非常有用,这些概念通常是直接的与 Lucene 相关.
Personally, I have also found the Solr and ElasticSearch documentation to be very useful for explaining specific concepts, which are often directly relevant to Lucene.
除此之外,我不太担心 Lucene 如何管理其内部索引数据结构.相反,我专注于可用于访问该数据的不同类型的分析器和查询.
Beyond that, I don't worry too much about how Lucene manages its internal index data structures. Instead I focus on the different types of analyzer and query which can be used to access that data.
更新:SimpleTextCodec
现在已经过去几个月了,但这里还有一种探索 Lucene 索引数据的方法:SimpleTextCodec
.标准 Lucene 编解码器(如何将数据写入索引文件并从中读取)使用二进制格式 - 因此人类不可读.你不能只打开一个索引文件就看看里面有什么.
It is now a few months later, but here is one more way to explore Lucene's index data: SimpleTextCodec
. The standard Lucene codec (how data is written to index files and read from them) uses a binary format - and is therefore not human readable. You can't just open an index file and see what's in there.
但是,如果您将编解码器更改为 SimpleTextCodec
,那么 Lucene 将创建纯文本索引文件,您可以在其中更清楚地看到结构.
However, if you change the codec to SimpleTextCodec
, then Lucene will create plain-text index files, where you can see the structure more clearly.
此编解码器仅用于信息/教育,不应在生产中使用.
要使用编解码器,首先需要包含相关的依赖项——例如,像这样:
To use the codec, you first need to include the relevant dependency - for example, like this:
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-codecs</artifactId>
<version>8.7.0</version>
</dependency>
现在您可以按如下方式使用这个新的编解码器:
Now you can use this new codec as follows:
iwc.setCodec(new SimpleTextCodec());
所以,例如:
final String indexPath = "/path/to/index_dir";
final String docsPath = "/path/to/inputs_dir";
final Path docDir = Paths.get(docsPath);
Directory dir = FSDirectory.open(Paths.get(indexPath));
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE);
iwc.setCodec(new SimpleTextCodec());
System.out.println(iwc.getCodec().getName());
try ( IndexWriter writer = new IndexWriter(dir, iwc)) {
// read documents, and write index data:
indexDocs(writer, docDir);
}
您现在可以在文本阅读器(例如 Notepad++)中自由检查生成的索引文件.
You are now free to inspect the resulting index files in a text reader (e.g. Notepad++).
在我的例子中,索引数据产生了几个文件——但我在这里感兴趣的是我的 *.scf
文件——一个复合"文件.文件,包含各种虚拟文件"部分,其中存储了人类可读的索引数据.
In my case, the index data resulted in several files - but the one I was interested in here was my *.scf
file - a "compound" file, containing various "virtual file" sections, where the human-readable index data was stored.
这篇关于如何查看 Lucene 索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!