问题描述
我正在编写一个脚本,该脚本需要在启动时处理相当大的(620 000 字)词典.输入的词典被逐字处理成defaultdict(list)
,键是字母bi和trigrams,值是包含关键字母n-gram的单词列表
I am working on a script which needs to process a rather large (620 000 words) lexicon on startup. The input lexicon is processed word-by-word into a defaultdict(list)
, with keys being letter bi and trigrams and values being lists of words that contain the key letter n-gram using
for word in lexicon_file:
word = word.lower()
for letter n-gram in word:
lexicon[n-gram].append(word)
比如
> lexicon["ab"]
["abracadabra", "abbey", "abnormal"]
结果结构包含 25 000 个键,每个键包含一个列表,其中包含 1 到 133 000 个字符串(平均 500,中位数 20).所有字符串都采用 windows-1250
编码.
The resulting structure contains 25 000 keys, each key contains a list with between 1 and 133 000 strings (average 500, median 20). All strings are in windows-1250
encoding.
这个处理需要很多时间(考虑到脚本的预期实际运行时间可以忽略不计,但在测试时通常很费力)并且由于词典本身永远不会改变,我认为序列化结果 defaultdict(list)
,然后在每次后续启动时反序列化它.
This processing takes a lot of time (negligible considering the expected real runtime of the script, but generally taxing when testing) and since the lexicon itself never changes, I figured it might be faster to serialize the resulting defaultdict(list)
and then deserialize it on every subsequent startup.
我发现即使使用 cPickle
,反序列化过程也比简单处理词典慢大约两倍,平均值接近:
What I found out is that even when using cPickle
, the deserialization process is about twice as slow as simply processing the lexicon, with average values being close to:
> normal lexicon creation
45 seconds
> cPickle deserialization
80 seconds
我没有任何序列化经验,但我希望反序列化比正常处理更快,至少对于 cPickle
模块而言是这样.
I don't have any experience with serialization, but I was expecting deserialization to be faster than normal processing, at least for the cPickle
module.
我的问题是,这个结果可以预期吗?为什么?有什么方法可以更快地存储/加载我的结构?
My question is, is this result expectable? Why? Are there any ways to store/load my structure faster?
推荐答案
解决此类问题的最佳方法是编写一堆测试并使用 timeit
看看哪个更快.我在下面进行了一些测试,但你应该用你的词典来试试这个,因为你的结果可能会有所不同.
The best way to figure something like this out is to just write a bunch of tests and use timeit
to see which is faster. I ran some tests below, but you should try this with your lexicon dict, as your results may vary.
如果您希望时间更稳定(准确),您可以将 number
参数增加到 timeit
- 它只会使测试花费更长的时间.另请注意,timeit
返回的值是总执行时间,而不是每次运行的时间.
If you want the times to be more stable (accurate), you can increase the number
argument to timeit
- it will just make the test take longer. Also, note that the value returned by timeit
is the total execution time, not the time per run.
testing with 10 keys...
serialize flat: 2.97198390961
serialize eval: 4.60271120071
serialize defaultdict: 20.3057091236
serialize dict: 20.2011070251
serialize defaultdict new pickle: 14.5152060986
serialize dict new pickle: 14.7755970955
serialize json: 13.5039670467
serialize cjson: 4.0456969738
unserialize flat: 1.29577493668
unserialize eval: 25.6548647881
unserialize defaultdict: 10.2215960026
unserialize dict: 10.208122015
unserialize defaultdict new pickle: 5.70747089386
unserialize dict new pickle: 5.69750404358
unserialize json: 5.34811091423
unserialize cjson: 1.50241613388
testing with 100 keys...
serialize flat: 2.91076397896
serialize eval: 4.72978711128
serialize defaultdict: 21.331786871
serialize dict: 21.3218340874
serialize defaultdict new pickle: 15.7140991688
serialize dict new pickle: 15.6440980434
serialize json: 14.3557379246
serialize cjson: 5.00576901436
unserialize flat: 1.6677339077
unserialize eval: 22.9142649174
unserialize defaultdict: 10.7773029804
unserialize dict: 10.7524499893
unserialize defaultdict new pickle: 6.13370203972
unserialize dict new pickle: 6.18057107925
unserialize json: 5.92281794548
unserialize cjson: 1.91151690483
代码:
import cPickle
import json
try:
import cjson # not Python standard library
except ImportError:
cjson = False
from collections import defaultdict
dd1 = defaultdict(list)
dd2 = defaultdict(list)
for i in xrange(1000000):
dd1[str(i % 10)].append(str(i))
dd2[str(i % 100)].append(str(i))
dt1 = dict(dd1)
dt2 = dict(dd2)
from timeit import timeit
def testdict(dd, dt):
def serialize_defaultdict():
with open('defaultdict.pickle', 'w') as f:
cPickle.dump(dd, f)
def serialize_p2_defaultdict():
with open('defaultdict.pickle2', 'w') as f:
cPickle.dump(dd, f, -1)
def serialize_dict():
with open('dict.pickle', 'w') as f:
cPickle.dump(dt, f)
def serialize_p2_dict():
with open('dict.pickle2', 'w') as f:
cPickle.dump(dt, f, -1)
def serialize_json():
with open('dict.json', 'w') as f:
json.dump(dt, f)
if cjson:
def serialize_cjson():
with open('dict.cjson', 'w') as f:
f.write(cjson.encode(dt))
def serialize_flat():
with open('dict.flat', 'w') as f:
f.write('\n'.join([' '.join([k] + v) for k, v in dt.iteritems()]))
def serialize_eval():
with open('dict.eval', 'w') as f:
f.write('\n'.join([k + '\t' + repr(v) for k, v in dt.iteritems()]))
def unserialize_defaultdict():
with open('defaultdict.pickle') as f:
assert cPickle.load(f) == dd
def unserialize_p2_defaultdict():
with open('defaultdict.pickle2') as f:
assert cPickle.load(f) == dd
def unserialize_dict():
with open('dict.pickle') as f:
assert cPickle.load(f) == dt
def unserialize_p2_dict():
with open('dict.pickle2') as f:
assert cPickle.load(f) == dt
def unserialize_json():
with open('dict.json') as f:
assert json.load(f) == dt
if cjson:
def unserialize_cjson():
with open('dict.cjson') as f:
assert cjson.decode(f.read()) == dt
def unserialize_flat():
with open('dict.flat') as f:
dtx = {}
for line in f:
vals = line.split()
dtx[vals[0]] = vals[1:]
assert dtx == dt
def unserialize_eval():
with open('dict.eval') as f:
dtx = {}
for line in f:
vals = line.split('\t')
dtx[vals[0]] = eval(vals[1])
assert dtx == dt
print 'serialize flat:', timeit(serialize_flat, number=10)
print 'serialize eval:', timeit(serialize_eval, number=10)
print 'serialize defaultdict:', timeit(serialize_defaultdict, number=10)
print 'serialize dict:', timeit(serialize_dict, number=10)
print 'serialize defaultdict new pickle:', timeit(serialize_p2_defaultdict, number=10)
print 'serialize dict new pickle:', timeit(serialize_p2_dict, number=10)
print 'serialize json:', timeit(serialize_json, number=10)
if cjson:
print 'serialize cjson:', timeit(serialize_cjson, number=10)
print 'unserialize flat:', timeit(unserialize_flat, number=10)
print 'unserialize eval:', timeit(unserialize_eval, number=10)
print 'unserialize defaultdict:', timeit(unserialize_defaultdict, number=10)
print 'unserialize dict:', timeit(unserialize_dict, number=10)
print 'unserialize defaultdict new pickle:', timeit(unserialize_p2_defaultdict, number=10)
print 'unserialize dict new pickle:', timeit(unserialize_p2_dict, number=10)
print 'unserialize json:', timeit(unserialize_json, number=10)
if cjson:
print 'unserialize cjson:', timeit(unserialize_cjson, number=10)
print 'testing with 10 keys...'
testdict(dd1, dt1)
print 'testing with 100 keys...'
testdict(dd2, dt2)
这篇关于Python defaultdict(list) 反序列化性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!