本文介绍了在python中获得排序的唯一列表的最快方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在python中获取排序的唯一列表的斋戒方法是什么? (我有一个可哈希处理的东西的列表,并且希望有一个我可以迭代的东西-列表是否被修改,或者我得到一个新列表,还是一个可迭代的对象都没关系.在我的具体用例中,我是我会使用一次性列表来执行此操作,这样就可以提高内存效率.)

What is the fasted way to get a sorted, unique list in python? (I have a list of hashable things, and want to have something I can iterate over - doesn't matter whether the list is modified in place, or I get a new list, or an iterable. In my concrete use case, I'm doing this with a throwaway list, so in place would be more memory efficient.)

我见过类似的解决方案

input = [5, 4, 2, 8, 4, 2, 1]
sorted(set(input))

但是在我看来,先检查唯一性然后进行排序是浪费的(因为当您对列表进行排序时,您基本上必须确定插入点,并因此获得唯一性测试作为副作用).也许还有一些与Unix相似的东西

but it seems to me that first checking for uniqueness and then sorting is wasteful (since when you sort the list, you basically have to determine insertion points, and thus get the uniqueness test as a side effect). Maybe there is something more along the lines of unix's

cat list | sort | uniq

只是在已经排序的列表中挑选出连续的重复项?

that just picks out consecutive duplications in an already sorted list?

请注意问题'在Python中唯一化列表的最快方法'列表未排序,''要求最干净/最pythonic的方法,并且可接受的答案提示sorted(set(input)),我正在尝试改进的地方.

Note in the question ' Fastest way to uniqify a list in Python ' the list is not sorted, and ' What is the cleanest way to do a sort plus uniq on a Python list? ' asks for the cleanest / most pythonic way, and the accepted answer suggests sorted(set(input)), which I'm trying to improve on.

推荐答案

我相信sorted(set(sequence))是最快的方法.是的,set遍历整个序列,但这是一个C级循环,比您在python级进行的任何循环都要快 .

I believe sorted(set(sequence)) is the fastest way of doing it.Yes, set iterates over the sequence but that's a C-level loop, which is a lot faster than any looping you would do at python level.

请注意,即使使用groupby,您仍然拥有O(n) + O(nlogn) = O(nlogn),最糟糕的是groupby将需要python级循环,这会大大增加O(n)中的常量,因此最终您会获得最差的结果

Note that even with groupby you still have O(n) + O(nlogn) = O(nlogn) and what's worst is that groupby will require a python-level loop, which increases dramatically the constants in that O(n) thus in the end you obtain worst results.

在谈到CPython时,优化事物的方法是在C级别上尽力而为(请参见答案还有另一个违反直觉的表现示例).为了获得更快的解决方案,您必须在C扩展名中重新实现排序.即便如此,也希望能获得与python的Timsort一样快的东西!

When speaking of CPython the way to optimize things is to do as much as you can at C-level (see this answer to have an other example of counter-intuitive performance). To have a faster solution you must reimplement a sort, in a C-extensions. And even then, good luck with obtaining something as fast as python's Timsort!

规范解决方案"与groupby解决方案的小比较:

A small comparison of the "canonical solution" versus the groupby solution:

>>> import timeit
>>> sequence = list(range(500)) + list(range(700)) + list(range(1000))
>>> timeit.timeit('sorted(set(sequence))', 'from __main__ import sequence', number=1000)
0.11532402038574219
>>> import itertools
>>> def my_sort(seq):
...     return list(k for k,_ in itertools.groupby(sorted(seq)))
...
>>> timeit.timeit('my_sort(sequence)', 'from __main__ import sequence, my_sort', number=1000)
0.3162040710449219

如您所见,它的速度慢了 3倍.

As you can see it's 3 times slower.

jdm提供的版本实际上更糟:

The version provided by jdm is actually even worse:

>>> def make_unique(lst):
...     if len(lst) <= 1:
...         return lst
...     last = lst[-1]
...     for i in range(len(lst) - 2, -1, -1):
...         item = lst[i]
...         if item == last:
...             del lst[i]
...         else:
...             last = item
...
>>> def my_sort2(seq):
...     make_unique(sorted(seq))
...
>>> timeit.timeit('my_sort2(sequence)', 'from __main__ import sequence, my_sort2', number=1000)
0.46814608573913574

慢5倍.请注意,使用seq.sort()然后使用make_unique(seq)make_unique(sorted(seq))实际上是同一回事,因为Timsort使用O(n)空间,您始终会有一些重新分配,因此使用sorted(seq)实际上并不会改变很多时间.

Almost 5 times slower.Note that using seq.sort() and then make_unique(seq) and make_unique(sorted(seq)) are actually the same thing, since Timsort uses O(n) space you always have some reallocation, so using sorted(seq) does not actually change much the timings.

jdm的基准测试得出不同的结果,因为他正在使用的输入量太小,因此所有时间都被time.clock()调用占用.

The jdm's benchmarks give different results because the input he is using are way too small and thus all the time is taken by the time.clock() calls.

这篇关于在python中获得排序的唯一列表的最快方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-27 14:14
查看更多