两个具有相同字符的Python字符串,a == b,可能共享内存,id(a)== id(b),或者可能在内存中两次,id(a)!= id(b).试试
Two Python strings with the same characters, a == b,may share memory, id(a) == id(b),or may be in memory twice, id(a) != id(b).Try
ab = "ab"
print id( ab ), id( "a"+"b" )
此处Python认识到新创建的"a" +"b"是相同的就像已经在内存中的"ab"一样-不错.
Here Python recognizes that the newly created "a"+"b" is the sameas the "ab" already in memory -- not bad.
现在考虑状态名称的N长列表 [亚利桑那州",阿拉斯加",阿拉斯加",加利福尼亚" ...](在我的情况下为N〜500000).
我看到50个不同的id()s⇒每个字符串"Arizona" ...仅存储一次,很好.
Now consider an N-long list of state names [ "Arizona", "Alaska", "Alaska", "California" ... ](N ~ 500000 in my case).
I see 50 different id() s ⇒ each string "Arizona" ... is stored only once, fine.
BUT write the list to disk and read it back in again:the "same" list now has N different id() s, way more memory, see below.
How come -- can anyone explain Python string memory allocation ?
""" when does Python allocate new memory for identical strings ?
ab = "ab"
print id( ab ), id( "a"+"b" ) # same !
list of N names from 50 states: 50 ids, mem ~ 4N + 50S, each string once
but list > file > mem again: N ids, mem ~ N * (4 + S)
from __future__ import division
from collections import defaultdict
from copy import copy
import cPickle
import random
import sys
states = dict(
AL = "Alabama",
AK = "Alaska",
AZ = "Arizona",
AR = "Arkansas",
CA = "California",
CO = "Colorado",
CT = "Connecticut",
DE = "Delaware",
FL = "Florida",
GA = "Georgia",
def nid(alist):
""" nr distinct ids """
return "%d ids %d pickle len" % (
len( set( map( id, alist ))),
len( cPickle.dumps( alist, 0 ))) # rough est ?
# cf http://stackoverflow.com/questions/2117255/python-deep-getsizeof-list-with-contents
N = 10000
exec( "\n".join( sys.argv[1:] )) # var=val ...
# big list of random names of states --
names = []
for j in xrange(N):
name = copy( random.choice( states.values() ))
print "%d strings in mem: %s" % (N, nid(names) ) # 10 ids, even with copy()
# list to a file, back again -- each string is allocated anew
joinsplit = "\n".join(names).split() # same as > file > mem again
assert joinsplit == names
print "%d strings from a file: %s" % (N, nid(joinsplit) )
# 10000 strings in mem: 10 ids 42149 pickle len
# 10000 strings from a file: 10000 ids 188080 pickle len
# Python 2.6.4 mac ppc
Added 25jan:
There are two kinds of strings in Python memory (or any program's):
- 使用唯一字符串的Ucache中的Ustrings:可以节省内存,并且如果两个都在Ucache中,则可以使a == b更快
- 其他类型的字符串,可以存储多次.
将字符串放入Ucache(Alex +1);除此之外,我们对Python如何将Ostrings移至Ucache一无所知-在"ab"之后,"a" +"b"是如何进入的?(文件中的字符串"没有意义-无法知道.)
puts astring in the Ucache (Alex +1);other than that we know nothing at all about how Python moves Ostrings to the Ucache --how did "a"+"b" get in, after "ab" ?("Strings from files" is meaningless -- there's no way of knowing.)
In short, Ucaches (there may be several) remain murky.
历史脚注: SPITBOL 统一所有字符串1970年.
A historical footnote:SPITBOLuniquified all strings ca. 1970.
Each implementation of the Python language is free to make its own tradeoffs in allocating immutable objects (such as strings) -- either making a new one, or finding an existing equal one and using one more reference to it, are just fine from the language's point of view. In practice, of course, real-world implementation strike reasonable compromise: one more reference to a suitable existing object when locating such an object is cheap and easy, just make a new object if the task of locating a suitable existing one (which may or may not exist) looks like it could potentially take a long time searching.
So, for example, multiple occurrences of the same string literal within a single function will (in all implementations I know of) use the "new reference to same object" strategy, because when building that function's constants-pool it's pretty fast and easy to avoid duplicates; but doing so across separate functions could potentially be a very time-consuming task, so real-world implementations either don't do it at all, or only do it in some heuristically identified subset of cases where one can hope for a reasonable tradeoff of compilation time (slowed down by searching for identical existing constants) vs memory consumption (increased if new copies of constants keep being made).
我不知道Python的任何实现(或与此有关的其他具有常量字符串的语言,例如Java)在从中读取数据时难以识别可能的重复项(以通过多个引用重用单个对象)的麻烦.一个文件-似乎似乎不是一个有前途的权衡(这里您要花的是 runtime ,而不是 compile ,所以这种权衡的吸引力更小了).当然,如果您知道(由于应用程序级别的考虑)这样的不可变对象很大并且很容易出现很多重复,则可以很容易地实现自己的常量池"策略( intern 可以帮助您完成字符串操作,但是不难为您自己编写字符串,例如,使用不可变项,巨大的长整数等).
I don't know of any implementation of Python (or for that matter other languages with constant strings, such as Java) that takes the trouble of identifying possible duplicates (to reuse a single object via multiple references) when reading data from a file -- it just doesn't seem to be a promising tradeoff (and here you'd be paying runtime, not compile time, so the tradeoff is even less attractive). Of course, if you know (thanks to application level considerations) that such immutable objects are large and quite prone to many duplications, you can implement your own "constants-pool" strategy quite easily (intern can help you do it for strings, but it's not hard to roll your own for, e.g., tuples with immutable items, huge long integers, and so forth).