问题描述
切片 Python 字符串的时间复杂度是多少?鉴于 Python 字符串是不可变的,我可以想象将它们切片为 O(1)
或 O(n)
取决于切片的实现方式.
What's the time complexity of slicing a Python string? Given that Python strings are immutable, I can imagine slicing them being either O(1)
or O(n)
depending upon how slicing is implemented.
我需要编写一个函数来遍历(可能很大)字符串的所有后缀.我可以通过将后缀表示为整个字符串的元组加上一个索引来开始读取字符来避免对字符串进行切片,但这很丑陋.如果我天真地写我的函数是这样的:
I need to write a function that iterates over all suffixes of a (potentially big) string. I could avoid slicing the string by representing a suffix as a tuple of the entire string plus an index to start reading characters from, but that's ugly. If instead I naively write my function like this:
def do_something_on_all_suffixes(big_string):
for i in range(len(big_string)):
suffix = big_string[i:]
some_constant_time_operation(suffix)
...它的时间复杂度是O(n)
还是O(n)
,其中n
是 len(big_string)
?
... will its time complexity be O(n)
or O(n)
, where n
is len(big_string)
?
推荐答案
简短回答:str
切片,一般来说,复制.这意味着为每个字符串的 n
后缀做切片的函数正在执行 O(n)
工作.也就是说,如果您可以使用 bytes 的对象,则可以避免复制="noreferrer">memoryview
s 以获取原始字节数据的零拷贝视图.请参阅下面的如何进行零拷贝切片,了解如何使其工作.
Short answer: str
slices, in general, copy. That means that your function that does a slice for each of your string's n
suffixes is doing O(n)
work. That said, you can avoid copies if you can work with bytes
-like objects using memoryview
s to get zero-copy views of the original bytes data. See How you can do zero copy slicing below for how to make it work.
长答案:(C)Python str
不要通过引用数据子集的视图来切片.str
切片正好有三种操作模式:
Long answer: (C)Python str
do not slice by referencing a view of a subset of the data. There are exactly three modes of operation for str
slicing:
- 完整切片,例如
mystr[:]
:返回对完全相同的str
的引用(不仅仅是共享数据,相同的实际对象,mystr 是 mystr[:]
> 因为str
是不可变的,所以这样做没有风险) - 零长度切片和(取决于实现)缓存长度为 1 的切片;空字符串是单例(
mystr[1:1] is mystr[2:2] is ''
),长度为 1 的低序数字符串也是缓存的单例(在 CPython 3.5.0 上),看起来所有可以用 latin-1 表示的字符,即range(256)
中的 Unicode 序数,都被缓存了) - 所有其他切片:切片的
str
在创建时被复制,此后与原始str
无关
- Complete slice, e.g.
mystr[:]
: Returns a reference to the exact samestr
(not just shared data, the same actual object,mystr is mystr[:]
sincestr
is immutable so there is no risk to doing so) - The zero length slice and (implementation dependent) cached length 1 slices; the empty string is a singleton (
mystr[1:1] is mystr[2:2] is ''
), and low ordinal strings of length one are cached singletons as well (on CPython 3.5.0, it looks like all characters representable in latin-1, that is Unicode ordinals inrange(256)
, are cached) - All other slices: The sliced
str
is copied at creation time, and thereafter unrelated to the originalstr
#3 是一般规则的原因是为了避免将大 str
保存在内存中的问题,因为它的一小部分.如果你有一个 1GB 的文件,读入它并像这样切片(是的,当你可以搜索时这是浪费,这是为了说明):
The reason why #3 is the general rule is to avoid issues with large str
being kept in memory by a view of a small portion of it. If you had a 1GB file, read it in and sliced it like so (yes, it's wasteful when you can seek, this is for illustration):
with open(myfile) as f:
data = f.read()[-1024:]
那么您将有 1 GB 的数据保存在内存中以支持显示最后 1 KB 的视图,这是一种严重的浪费.由于切片通常很小,因此在切片上复制而不是创建视图几乎总是更快.这也意味着 str
可以更简单;它需要知道它的大小,但它也不需要跟踪数据中的偏移量.
then you'd have 1 GB of data being held in memory to support a view that shows the final 1 KB, a serious waste. Since slices are usually smallish, it's almost always faster to copy on slice instead of create views. It also means str
can be simpler; it needs to know its size, but it need not track an offset into the data as well.
有的方法可以在 Python 中执行基于视图的切片,而在 Python 2 中,它将适用于 str
(因为 str
是字节- 就像在 Python 2 中一样,支持 缓冲区协议).使用 Py2 str
和 Py3 bytes
(以及许多其他数据类型,如 bytearray
、array.array
、numpy
数组、mmap.mmap
s 等),您可以创建一个 memoryview
即原对象的零拷贝视图,可以切片不拷贝数据.因此,如果您可以使用(或编码)到 Py2 str
/Py3 bytes
,并且您的函数可以使用任意 bytes
-like 对象,那么你可以这样做:
There are ways to perform view based slicing in Python, and in Python 2, it will work on str
(because str
is bytes-like in Python 2, supporting the buffer protocol). With Py2 str
and Py3 bytes
(as well as many other data types like bytearray
, array.array
, numpy
arrays, mmap.mmap
s, etc.), you can create a memoryview
that is a zero copy view of the original object, and can be sliced without copying data. So if you can use (or encode) to Py2 str
/Py3 bytes
, and your function can work with arbitrary bytes
-like objects, then you could do:
def do_something_on_all_suffixes(big_string):
# In Py3, may need to encode as latin-1 or the like
remaining_suffix = memoryview(big_string)
# Rather than explicit loop, just replace view with one shorter view
# on each loop
while remaining_suffix: # Stop when we've sliced to empty view
some_constant_time_operation(remaining_suffix)
remaining_suffix = remaining_suffix[1:]
memoryview
的切片确实创建了新的视图对象(它们只是具有固定大小的超轻量级,与它们查看的数据量无关),只是没有任何数据,所以 some_constant_time_operation
可以根据需要存储一个副本,以后我们将其切片时不会更改它.如果您需要一个正确的副本作为 Py2 str
/Py3 bytes
,您可以调用 .tobytes()
来获取原始 bytes
obj,或(在 Py3 中仅出现),将其直接解码为从缓冲区复制的 str
,例如str(remaining_suffix[10:20], 'latin-1')
.
The slices of memoryview
s do make new view objects (they're just ultra-lightweight with fixed size unrelated to the amount of data they view), just not any data, so some_constant_time_operation
can store a copy if needed and it won't be changed when we slice it down later. Should you need a proper copy as Py2 str
/Py3 bytes
, you can call .tobytes()
to get the raw bytes
obj, or (in Py3 only it appears), decode it directly to a str
that copies from the buffer, e.g. str(remaining_suffix[10:20], 'latin-1')
.
这篇关于字符串切片的时间复杂度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!