问题描述
在 Python 中进行快速 YAML 解析的最新和最棒的是什么?Syck 过时并推荐使用 PyYaml,但 PyYaml 速度很慢,并且存在 GIL 问题:
>>>def xit(f, x):进口螺纹对于 xrange(x) 中的 i:threading.Thread(target=f).start()>>>定义压力():开始 = time.time()res = yaml.load(open(path_to_11000_byte_yaml_file))打印 "Took %.2fs" % (time.time() - start,)>>>xit(stressit, 1)耗时 0.37 秒>>>xit(stressit, 2)花了 1.40 秒耗时 1.41 秒>>>xit(stressit, 4)花了 2.98 秒花了 2.98 秒耗时 2.99 秒花了 3.00 秒鉴于我的用例,我可以缓存解析的对象,但我仍然更喜欢更快的解决方案.
链接的 wiki 页面在警告使用 libyaml (c) 和 PyYaml (python)"之后声明.尽管该注释确实有一个错误的维基链接(应该是 PyYAML
而不是 PyYaml
).
至于性能,根据您安装 PyYAML 的方式,您应该有 CParser 类可用,它实现了用优化的 C 编写的 YAML 解析器.虽然我认为这不能解决 GIL 问题,但它明显更快.以下是我在我的机器上运行的一些粗略基准测试(AMD Athlon II X4 640、3.0GHz、8GB RAM):
首先使用默认的纯 Python 解析器:
$/usr/bin/python2 -m timeit -s 'import yaml;y=file("large.yaml", "r").read()' \'yaml.load(y)'10 个循环,最好的 3 个:每个循环 405 毫秒
使用 CParser:
$/usr/bin/python2 -m timeit -s 'import yaml;y=file("large.yaml", "r").read()' \'yaml.load(y, Loader=yaml.CLoader)'10 个循环,最好的 3 个:每个循环 59.2 毫秒
为了比较,使用纯 Python 解析器的 PyPy.
$ pypy -m timeit -s 'import yaml;y=file("large.yaml", "r").read()' \'yaml.load(y)'10 个循环,最好的 3 个:每个循环 101 毫秒
对于 large.yaml
我只是在谷歌上搜索大 yaml 文件"并发现了这个:
https://gist.github.com/nrh/667383/raw/1b3ba75c939f2886f63291528d.yaml
(我必须删除前几行以使其成为单文档 YAML 文件,否则 yaml.load 会抱怨.)
另一件需要考虑的事情是使用 multiprocessing
模块而不是线程.这解决了 GIL 问题,但确实需要更多样板代码来在进程之间进行通信.尽管有许多好的库可以使多处理更容易.这里有一个很好的列表.
What's the latest and greatest for fast YAML parsing in Python? Syck is out of date and recommends using PyYaml, yet PyYaml is pretty slow, and suffers from the GIL problem:
>>> def xit(f, x):
import threading
for i in xrange(x):
threading.Thread(target=f).start()
>>> def stressit():
start = time.time()
res = yaml.load(open(path_to_11000_byte_yaml_file))
print "Took %.2fs" % (time.time() - start,)
>>> xit(stressit, 1)
Took 0.37s
>>> xit(stressit, 2)
Took 1.40s
Took 1.41s
>>> xit(stressit, 4)
Took 2.98s
Took 2.98s
Took 2.99s
Took 3.00s
Given my use case I can cache the parsed objects, but I'd still prefer a faster solution even for that.
The linked wiki page states after the warning "Use libyaml (c), and PyYaml (python)". Although the note does have a bad wikilink (should be PyYAML
not PyYaml
).
As for performance, depending on how you installed PyYAML you should have the CParser class available which implements a YAML parser written in optimized C. While I don't think this gets around the GIL issue, it is markedly faster. Here are a few cursory benchmarks I ran on my machine (AMD Athlon II X4 640, 3.0GHz, 8GB RAM):
First with the default pure-Python parser:
$ /usr/bin/python2 -m timeit -s 'import yaml; y=file("large.yaml", "r").read()' \
'yaml.load(y)'
10 loops, best of 3: 405 msec per loop
With the CParser:
$ /usr/bin/python2 -m timeit -s 'import yaml; y=file("large.yaml", "r").read()' \
'yaml.load(y, Loader=yaml.CLoader)'
10 loops, best of 3: 59.2 msec per loop
And, for comparison, with PyPy using the pure-Python parser.
$ pypy -m timeit -s 'import yaml; y=file("large.yaml", "r").read()' \
'yaml.load(y)'
10 loops, best of 3: 101 msec per loop
For large.yaml
I just googled for "large yaml file" and came across this:
https://gist.github.com/nrh/667383/raw/1b3ba75c939f2886f63291528df89418621548fd/large.yaml
(I had to remove the first couple of lines to make it a single-doc YAML file otherwise yaml.load complains.)
EDIT:
Another thing to consider is using the multiprocessing
module instead of threads. This gets around GIL problems, but does require a bit more boiler-plate code to communicate between the processes. There are a number of good libraries available though to make multiprocessing easier. There's a pretty good list of them here.
这篇关于是否有最新的带有 python 绑定的快速 YAML 解析器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!