本文介绍了为什么Python集不保留插入顺序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

限时删除!!

最近我很惊讶地发现,虽然保证字典可以保留Python 3.7+中的插入顺序,但集合却没有:

I was surprised to discover recently that while dicts are guaranteed to preserve insertion order in Python 3.7+, sets are not:

>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> d
{'a': 1, 'b': 2, 'c': 3}
>>> d['d'] = 4
>>> d
{'a': 1, 'b': 2, 'c': 3, 'd': 4}



>>> s = {'a', 'b', 'c'}
>>> s
{'b', 'a', 'c'}
>>> s.add('d')
>>> s
{'d', 'b', 'a', 'c'}

这种区别的理由是什么?导致Python团队更改dict实现的效率提高是否也不适用于集合?

What is the rationale for this difference? Do the same efficiency improvements that led the Python team to change the dict implementation not apply to sets as well?

我不是在寻找指向有序集合实现的指针,还是将dict用作集合的替代品的方法。我只是想知道为什么Python团队为什么没有同时为字典使用内置集来保留顺序。

I'm not looking for pointers to ordered-set implementations or ways to use dicts as stand-ins for sets. I'm just wondering why the Python team didn't make built-in sets preserve order at the same time they did so for dicts.

推荐答案

集合和字典针对不同的用例进行了优化。 集合的主要用途是快速的成员资格测试,该测试与订单无关。对于命令而言,查找成本是最关键的操作,并且更可能出现密钥。对于集合,元素的存在或不存在是事先未知的,因此集合实现需要针对发现和未发现的情况进行优化。此外,对联合集和交集之类的常见集合操作进行了一些优化,使其难以保留集合顺序而不降低性能。

Sets and dicts are optimized for different use-cases. The primary use of a set is fast membership testing, which is order agnostic. For dicts, cost of the lookup is the most critical operation, and the key is more likely to be present. With sets, the presence or absence of an element is not known in advance, and so the set implementation needs to optimize for both the found and not-found case. Also, some optimizations for common set operations such as union and intersection make it difficult to retain set ordering without degrading performance.

虽然两个数据结构均基于哈希,但这是常见的误解是,集合只是实现为具有空值的字典。甚至在CPython 3.6中的之前紧凑型dict实现中,set和dict实现也已经存在很大差异,几乎没有代码重用。例如,字典使用随机探测,而集合使用线性探测和开放式寻址的组合来改善缓存局部性。初始线性探针(在CPython中默认为)将检查一系列相邻的密钥/哈希对,通过降低哈希冲突处理的成本来提高性能-连续的内存访问比分散的探针便宜。

While both data structures are hash based, it's a common misconception that sets are just implemented as dicts with null values. Even before the compact dict implementation in CPython 3.6, the set and dict implementations already differed significantly, with little code reuse. For example, dicts use randomized probing, but sets use a combination of linear probing and open addressing, to improve cache locality. The initial linear probe (default 9 steps in CPython) will check a series of adjacent key/hash pairs, improving performance by reducing the cost of hash collision handling - consecutive memory access is cheaper than scattered probes.


  • dictobject.c -,,

  • -变更集以减少在Python 3.4中设置对象的哈希冲突的代价。

  • dictobject.c - master, v3.5.9
  • setobject.c - master, v3.5.9
  • issue18771 - changeset to reduce the cost of hash collisions for set objects in Python 3.4.

从理论上讲,可能进行更改CPython的set实现类似于紧凑型dict,但实际上存在缺陷,并且著名的核心开发人员反对进行此类更改。

It would be possible in theory to change CPython's set implementation to be similar to the compact dict, but in practice there are drawbacks, and notable core developers were opposed to making such a change.

关于是否为3.7压缩集合的详细讨论,并回答了为什么

A detailed discussion of whether to compactify sets for 3.7, and answers about why it was decided against, can be found in the python-dev mailing lists.

总而言之,主要要点是用法模式不同(插入顺序dict ** kwargs是,对于集合则少一些),节省压缩集的空间不太重要(因为与键,哈希和值相反,因为只有键和哈希数组才能进行致密化),并且上述集合中的线性探测优化与紧凑的实现不兼容。

In summary, the main points are that the usage patterns are different (insertion ordering dicts such as **kwargs is useful, less so for sets), the space savings for compacting sets are less significant (because there are only key and hash array to densify, as opposed to keys, hashes and values), and the aforementioned linear probing optimization in sets is incompatible with a compact implementation.

我将在下面重述Raymond的帖子,其中涵盖了最重要的观点。

I will reproduce Raymond's post below which covers the most important points.

除非我误解了,雷蒙德还是反对对
进行类似的更改。

Unless I've misunderstood, Raymond was opposed to making a similar change to set.

是的。在人们
开始疯狂之前,这里有一些关于该主题的想法。

That's right. Here are a few thoughts on the subject before people starting running wild.


  • 对于紧凑型词典来说,空格是节省是净赢,索引消耗的额外空间以及键/值/哈希数组的
    的超额分配被键/值/哈希数组提高的
    密度所抵消。但是,对于集合而言,净值要差
    ,因为我们仍然需要索引和超额分配的
    ,但是只能通过压缩三个
    阵列中的两个来抵消空间成本。换句话说,当您浪费
    键,值和散列的空间时,压缩更有意义。如果您丢失了这三个
    中的一个,它就不再引人注目。

  • For the compact dict, the space savings was a net win with the additional space consumed by the indices and the overallocation for the key/value/hash arrays being more than offset by the improved density of key/value/hash arrays. However for sets, the net was much less favorable because we still need the indices and overallocation but can only offset the space cost by densifying only two of the three arrays. In other words, compacting makes more sense when you have wasted space for keys, values, and hashes. If you lose one of those three, it stops being compelling.

集合的使用方式不同于字典。前者有更多的命中或未命中查询。后者往往缺少关键的
查找。同样,对设置到设置的操作
的一些优化使得很难在不影响
性能的情况下保留设置顺序。

The use pattern for sets is different from dicts. The former has more hit or miss lookups. The latter tends to have fewer missing key lookups. Also, some of the optimizations for the set-to-set operations make it difficult to retain set ordering without impacting performance.

我寻求替代的途径来改善布景表现。我没有进行压缩(这不会节省太多空间,并且不会带来
的额外间接成本),而是添加了线性探测以减少
冲突的成本并提高缓存性能。这种改进是
与我为
字典提倡的压缩方法不兼容。

I pursued alternative path to improve set performance. Instead of compacting (which wasn't much of space win and incurred the cost of an additional indirection), I added linear probing to reduce the cost of collisions and improve cache performance. This improvement is incompatible with the compacting approach I advocated for dictionaries.

目前,对字典的排序副作用是非保证的,因此现在开​​始坚持要求这些集合也变得为时过早。
文档已经链接到用于创建OrderedSet的配方(
),但似乎
的吸收几乎为零。同样,既然埃里克·斯诺(Eric Snow)给了我们
快速的OrderedDict,从
MutableSet和OrderedDict建立一个OrderedSet比以往任何时候都容易,但是我再也没有观察到任何真正的
兴趣,因为典型的按组设置数据分析实际上并不需要
或关心订购。同样,快速成员资格
测试的主要用途是不可知的。

For now, the ordering side-effect on dictionaries is non-guaranteed, so it is premature to start insisting the sets become ordered as well. The docs already link to a recipe for creating an OrderedSet ( https://code.activestate.com/recipes/576694/ ) but it seems like the uptake has been nearly zero. Also, now that Eric Snow has given us a fast OrderedDict, it is easier than ever to build an OrderedSet from MutableSet and OrderedDict, but again I haven't observed any real interest because typical set-to-set data analytics don't really need or care about ordering. Likewise, the primary use of fast membership testings is order agnostic.

也就是说,我确实认为有余地可以添加替代集实现PyPI。特别是,对于可订购数据有一些有趣的
特殊情况,其中可以通过比较整个键范围来加快
的设置操作(请参阅

作为起点)。在IIRC中,PyPI已经具有用于类似集合的绽放
过滤器和布谷鸟哈希的代码。

That said, I do think there is room to add alternative set implementations to PyPI. In particular, there are some interesting special cases for orderable data where set-to-set operations can be sped-up by comparing entire ranges of keys (see https://code.activestate.com/recipes/230113-implementation-of-sets-using-sorted-lists for a starting point). IIRC, PyPI already has code for set-like bloom filters and cuckoo hashing.

我知道拥有一个主要的块很令人兴奋代码已被Python核心接受,但除非我们确定
是有保证的,否则不应向
进行更多其他主要数据类型重写的闸门。

I understanding that it is exciting to have a major block of code accepted into the Python core but that shouldn't open to floodgates to engaging in more major rewrites of other datatypes unless we're sure that it is warranted.

– Raymond Hettinger

– Raymond Hettinger

来自,2016年9月。

From [Python-Dev] Python 3.6 dict becomes compact and gets a private version; and keywords become ordered, Sept 2016.

这篇关于为什么Python集不保留插入顺序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

1403页,肝出来的..

09-06 20:58