问题描述
字典在 Python 3.6 中排序(至少在 CPython 实现下)与以前的化身不同.这似乎是一个重大变化,但它只是 中的一小段文档.它被描述为 CPython 实现细节而不是语言特性,但也暗示这可能在未来成为标准.
新的字典实现如何在保持元素顺序的同时比旧的执行得更好?
以下是文档中的文字:
dict()
现在使用紧凑"表示 由 PyPy 开创.与 Python 3.5 相比,新 dict() 的内存使用量减少了 20% 到 25%.PEP 468(在函数中保留 **kwargs 的顺序.)已实现这样.这个新实现的顺序保留方面被认为是一个实现细节,不应依赖(这可能会在未来发生变化,但在更改语言规范之前,希望在几个版本的语言中使用这个新的 dict 实现为所有当前和未来的 Python 实现强制要求保留顺序的语义;这也有助于保持与旧版本的语言的向后兼容性,其中随机迭代顺序仍然有效,例如 Python 3.5).(由 INADA Naoki 在 issue 27350 中贡献.想法 最初由 Raymond Hettinger 建议.)
2017 年 12 月更新:dict
的保留插入顺序是 保证适用于 Python 3.7
它们是插入顺序.从 Python 3.6 开始,对于 Python 的 CPython 实现,字典记住插入项目的顺序.这被认为是 Python 3.6 中的一个实现细节;如果您希望在其他 Python 实现中保证的插入排序(以及其他有序行为?
本质上,通过保留两个数组.
第一个数组,
dk_entries
,保存条目(of按插入顺序为字典键入PyDictKeyEntry
).保持顺序是通过这是一个仅附加数组来实现的,其中新项目总是插入到最后(插入顺序).第二个,
dk_indices
,保存dk_entries
数组的索引(即,表示dk_entries
中相应条目位置的值).该数组充当哈希表.当一个键被散列时,它会导致存储在dk_indices
中的索引之一,并且通过索引dk_entries
获取相应的条目.由于仅保留索引,因此该数组的类型取决于字典的整体大小(从 typeint8_t
(1
字节) 到int32_t
/int64_t
(4
/8
字节)32
/64
位构建)
在之前的实现中,必须分配一个类型为 PyDictKeyEntry
和大小为 dk_size
的稀疏数组;不幸的是,它也导致了很多空白空间,因为该数组不允许超过 2/3 * dk_size
满 出于性能原因.(并且空白区域仍然具有PyDictKeyEntry
大小!).
现在情况并非如此,因为只存储了必需的条目(那些已插入的条目)和一个 intX_t
类型的稀疏数组(X
取决于 dict 大小) 2/3 * dk_size
s 保持完整.空格从类型 PyDictKeyEntry
更改为 intX_t
.
因此,显然,创建 PyDictKeyEntry
类型的稀疏数组比用于存储 int
s 的稀疏数组需要更多内存.
您可以在 Python-Dev 上查看完整对话 关于这个功能,如果有兴趣,这是一个很好的阅读.
在 Raymond Hettinger 提出的原始提案中,可以看到所用数据结构的可视化,它抓住了这个想法的要点.
例如字典:
d = {'timmy': 'red', 'barry': 'green', 'guido': 'blue'}
当前存储为[keyhash, key, value]:
entries = [['--', '--', '--'],[-8522787127447073495,'巴里','绿色'],['--', '--', '--'],['--', '--', '--'],['--', '--', '--'],[-9092791511155847987,'蒂米','红色'],['--', '--', '--'],[-6480567542315338377,'guido','蓝色']]
相反,数据应按如下方式组织:
indices = [None, 1, None, None, None, 0, None, 2]条目 = [[-9092791511155847987,'蒂米','红色'],[-8522787127447073495,'巴里','绿色'],[-6480567542315338377,'guido','蓝色']]
正如您现在可以直观地看到的那样,在最初的提案中,很多空间基本上是空的,以减少冲突并加快查找速度.使用新方法,您可以通过在索引中将稀疏移动到真正需要的地方来减少所需的内存.
[1]:我说插入有序"而不是有序",因为随着 OrderedDict 的存在,有序"暗示了`dict` 对象*不提供*的进一步行为.OrderedDicts 是可逆的,提供顺序敏感的方法,并且主要提供顺序敏感的相等性测试(`==`、`!=`).`dict`s 目前不提供任何这些行为/方法.
[2]:通过更紧凑的设计,新的字典实现在**内存方面**性能更好;这是这里的主要好处.速度明智,差异并不那么大,新字典可能会引入轻微回归的地方(key-lookups,例如),而在其他情况下(想到迭代和调整大小),应该存在性能提升.总体而言,由于引入了紧凑性,字典的性能,尤其是在现实生活中的性能有所提高.
Dictionaries are ordered in Python 3.6 (under the CPython implementation at least) unlike in previous incarnations. This seems like a substantial change, but it's only a short paragraph in the documentation. It is described as a CPython implementation detail rather than a language feature, but also implies this may become standard in the future.
How does the new dictionary implementation perform better than the older one while preserving element order?
Here is the text from the documentation:
Update December 2017: dict
s retaining insertion order is guaranteed for Python 3.7
They are insertion ordered. As of Python 3.6, for the CPython implementation of Python, dictionaries remember the order of items inserted. This is considered an implementation detail in Python 3.6; you need to use OrderedDict
if you want insertion ordering that's guaranteed across other implementations of Python (and other ordered behavior).
As of Python 3.7, this is no longer an implementation detail and instead becomes a language feature. From a python-dev message by GvR:
This simply means that you can depend on it. Other implementations of Python must also offer an insertion ordered dictionary if they wish to be a conforming implementation of Python 3.7.
Essentially, by keeping two arrays.
The first array,
dk_entries
, holds the entries (of typePyDictKeyEntry
) for the dictionary in the order that they were inserted. Preserving order is achieved by this being an append only array where new items are always inserted at the end (insertion order).The second,
dk_indices
, holds the indices for thedk_entries
array (that is, values that indicate the position of the corresponding entry indk_entries
). This array acts as the hash table. When a key is hashed it leads to one of the indices stored indk_indices
and the corresponding entry is fetched by indexingdk_entries
. Since only indices are kept, the type of this array depends on the overall size of the dictionary (ranging from typeint8_t
(1
byte) toint32_t
/int64_t
(4
/8
bytes) on32
/64
bit builds)
In the previous implementation, a sparse array of type PyDictKeyEntry
and size dk_size
had to be allocated; unfortunately, it also resulted in a lot of empty space since that array was not allowed to be more than 2/3 * dk_size
full for performance reasons. (and the empty space still had PyDictKeyEntry
size!).
This is not the case now since only the required entries are stored (those that have been inserted) and a sparse array of type intX_t
(X
depending on dict size) 2/3 * dk_size
s full is kept. The empty space changed from type PyDictKeyEntry
to intX_t
.
So, obviously, creating a sparse array of type PyDictKeyEntry
is much more memory demanding than a sparse array for storing int
s.
You can see the full conversation on Python-Dev regarding this feature if interested, it is a good read.
In the original proposal made by Raymond Hettinger, a visualization of the data structures used can be seen which captures the gist of the idea.
As you can visually now see, in the original proposal, a lot of space is essentially empty to reduce collisions and make look-ups faster. With the new approach, you reduce the memory required by moving the sparseness where it's really required, in the indices.
这篇关于字典是否在 Python 3.6+ 中排序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!