LMDB is the database of choice when using Caffe with large datasets. This is a tutorial of how to create an LMDB database from Python. First, let’s look at the pros and cons of using LMDB over HDF5.
Reasons to use HDF5:
- Simple format to read/write.
Reasons to use LMDB:
- LMDB uses memory-mapped files, giving much better I/O performance.
- Works well with really large datasets. The HDF5 files are always read entirely into memory, so you can’t have any HDF5 file exceed your memory capacity. You can easily split your data into several HDF5 files though (just put several paths to
h5
files in your text file). Then again, compared to LMDB’s page caching the I/O performance won’t be nearly as good.
LMDB from Python
You will need the Python package lmdb as well as Caffe’s python package (make pycaffe
in Caffe). LMDB provides key-value storage, where each <key, value> pair will be a sample in our dataset. The key will simply be a string version of an ID value, and the value will be a serialized version of the Datum
class in Caffe (which are built using protobuf).
import numpy as np
import lmdb
import caffe
N = 1000
# Let's pretend this is interesting data
X = np.zeros((N, 3, 32, 32), dtype=np.uint8)
y = np.zeros(N, dtype=np.int64)
# We need to prepare the database for the size. We'll set it 10 times
# greater than what we theoretically need. There is little drawback to
# setting this too big. If you still run into problem after raising
# this, you might want to try saving fewer entries in a single
# transaction.
map_size = X.nbytes * 10
env = lmdb.open('mylmdb', map_size=map_size)
with env.begin(write=True) as txn:
# txn is a Transaction object
for i in range(N):
datum = caffe.proto.caffe_pb2.Datum()
datum.channels = X.shape[1]
datum.height = X.shape[2]
datum.width = X.shape[3]
datum.data = X[i].tobytes() # or .tostring() if numpy < 1.9
datum.label = int(y[i])
str_id = '{:08}'.format(i)
# The encode is only essential in Python 3
txn.put(str_id.encode('ascii'), datum.SerializeToString())
You can also open up and inspect an existing LMDB database from Python:
import numpy as np
import lmdb
import caffe
env = lmdb.open('mylmdb', readonly=True)
with env.begin() as txn:
raw_datum = txn.get(b'00000000')
datum = caffe.proto.caffe_pb2.Datum()
datum.ParseFromString(raw_datum)
flat_x = np.fromstring(datum.data, dtype=np.uint8)
x = flat_x.reshape(datum.channels, datum.height, datum.width)
y = datum.label
Iterating <key, value> pairs is also easy:
with env.begin() as txn:
cursor = txn.cursor()
for key, value in cursor:
print(key, value)
这是用python将数据转为lmdb的代码,但是我用这个处理完数据再使用caffe会出现std::bad_alloc错误,后来经过艰苦地奋斗,查阅了大量资料,我发现了问题所在:
1.caffe的数据格式默认为四维(n_samples, n_channels, height, width) .所以必须把我的数据处理成这种格式 2.最后一行txn.put(str_id.encode('ascii'), datum.SerializeToString())一定要加上,我一开始一维python2不用写这个,结果老是出错,后来才发现这行必须写! 3.如果出现mdb_put: MDB_MAP_FULL: Environment mapsize limit reached的错误,是因为lmdb默认的map_size比较小,我把lmdb/cffi.py里面的map_size 默认值改了一下,改成了
1099511627776(也就是1Tb),我也不知道是不是这么改,然后我又把上面python程序里map_size = X.nbytes
这句改成了map_size = X.nbytes * 10,然后就成功了! 关于caffe用lmdb的优势:
1. caffe先支持leveldb,后支持lmdb的,lmdb读取的效率更高,而且支持不同程序同时读取,而leveldb只允许一个程序读取。这一点在使用同样的数据跑不同的配置程序时很重要。
2. 关于key的问题,图像数据label(默认支持的label是一个整数,表示类别)就那么多,用label作为key肯定要重复了,故不能用label作为key。
3. 关系数据库不是很了解。不过训练过程是不断的按序读取一个一个batch的数据,不需要复杂的数据存储格式吧,这样线性存储读取的效率也高吧。