问题描述
我正在从事有关信息检索的项目.我已经使用Hadoop/Python建立了完全反向索引.Hadoop将索引输出为(单词,文档列表)对,并将其写在文件上.为了快速访问,我使用上述文件创建了一个字典(哈希表).我的问题是,如何在具有快速访问时间的磁盘上存储这样的索引.目前,我正在使用python pickle模块存储字典并从中加载但是它会将整个索引立即带入内存(或者是吗?).请提出一种有效的索引存储和搜索方法.
I am working on a project on Info Retrieval.I have made a Full Inverted Index using Hadoop/Python.Hadoop outputs the index as (word,documentlist) pairs which are written on the file.For a quick access, I have created a dictionary(hashtable) using the above file.My question is, how do I store such an index on disk that also has quick access time.At present I am storing the dictionary using python pickle module and loading from itbut it brings the whole of index into memory at once (or does it?).Please suggest an efficient way of storing and searching through the index.
我的字典结构如下(使用嵌套字典)
My dictionary structure is as follows (using nested dictionaries)
{word:{doc1:[位置],doc2:[位置],....}}
{word : {doc1:[locations], doc2:[locations], ....}}
这样我就可以得到包含一个单词的文档dictionary [word] .keys()...等等.
so that I can get the documents containing a word bydictionary[word].keys() ... and so on.
推荐答案
是的,它确实可以将所有内容都包含在内.
Yes it does bring it all in.
有问题吗?如果这不是实际问题,请坚持下去.
Is that a problem? If it's not an actual problem, then stick with it.
如果有问题,您有什么问题?太慢了?太快?太丰富多彩了吗?占用的内存过多?你有什么问题吗?
If it's a problem, what kind of problem do you have? Too slow? Too fast? Too colorful? Too much memory used? What problem do you have?
这篇关于存储倒排索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!