


I was trying to figure out which integers python only instantiates once (-6 to 256 it seems), and in the process stumbled on some string behaviour I can't see the pattern in. Sometimes, equal strings created in different ways share the same id, sometimes not. This code:

A = "10000"
B = "10000"
C = "100" + "00"
D = "%i"%10000
E = str(10000)
F = str(10000)
G = str(100) + "00"
H = "0".join(("10","00"))

for obj in (A,B,C,D,E,F,G,H):
    print obj, id(obj), obj is A


10000 4959776 True
10000 4959776 True
10000 4959776 True
10000 4959776 True
10000 4959456 False
10000 4959488 False
10000 4959520 False
10000 4959680 False


I don't even see the pattern - save for the fact that the first four don't have an explicit function call - but surely that can't be it, since the "+" in C for example implies a function call to add. I especially don't understand why C and G are different, seeing as that implies that the ids of the components of the addition are more important than the outcome.


So, what is the special treatment that A-D undergo, making them come out as the same instance?



In terms of language specification, any compliant Python compiler and runtime is fully allowed, for any instance of an immutable type, to make a new instance OR find an existing instance of the same type that's equal to the required value and use a new reference to that same instance. This means it's always incorrect to use is or by-id comparison among immutables, and any minor release may tweak or change strategy in this matter to enhance optimization.


In terms of implementations, the tradeoff are pretty clear: trying to reuse an existing instance may mean time spent (perhaps wasted) trying to find such an instance, but if the attempt succeeds then some memory is saved (as well as the time to allocate and later free the memory bits needed to hold a new instance).


How to solve those implementation tradeoffs is not entirely obvious -- if you can identify heuristics that indicate that finding a suitable existing instance is likely and the search (even if it fails) will be fast, then you may want to attempt the search-and-reuse when the heuristics suggest it, but skip it otherwise.


In your observations you seem to have found a particular dot-release implementation that performs a modicum of peephole optimization when that's entirely safe, fast, and simple, so the assignments A to D all boil down to exactly the same as A (but E to F don't, as they involve named functions or methods that the optimizer's authors may reasonably have considered not 100% safe to assume semantics for -- and low-ROI if that was done -- so they're not peephole-optimized).


Thus, A to D reusing the same instance boils down to A and B doing so (as C and D get peephole-optimized to exactly the same construct).

反过来,这种重用显然暗示了编译器策略/优化器试探法,即将同一函数的本地名称空间中不变类型的相同文字常量折叠为仅引用该函数的.func_code.co_consts中的一个实例(以使用当前CPython的术语) (针对函数和代码对象的属性)-合理的策略和启发式方法,因为在一个函数中重复使用相同的不变常量文字有些频繁,并且价格仅需支付一次(在编译时),而优势却可以多次获得(每次该函数运行时,可能在循环等中).

That reuse, in turn, clearly suggests compiler tactics/optimizer heuristics whereby identical literal constants of an immutable type in the same function's local namespace are collapsed to references to just one instance in the function's .func_code.co_consts (to use current CPython's terminology for attributes of functions and code objects) -- reasonable tactics and heuristics, as reuse of the same immutable constant literal within one function are somewhat frequent, AND the price is only paid once (at compile time) while the advantage is accrued many times (every time the function runs, maybe within loops etc etc).


(It so happens that these specific tactics and heuristics, given their clearly-positive tradeoffs, have been pervasive in all recent versions of CPython, and, I believe, IronPython, Jython, and PyPy as well;-).

如果您打算为Python本身或类似语言编写编译器,运行时环境,窥孔优化器等,这是一个值得研究的有趣话题.我猜想对内部结构进行深入研究(当然,理想情况下是许多不同的正确实现,以便不着眼于特定的怪癖——Python的好处是,目前至少有4种独立的值得生产的实现,更不用说了每个版本都有多个版本!)还可以间接地帮助一个更好的Python程序员-但特别要注意的是语言本身对保证的的内容,这要比您要讲的要少一些.可以在不同的实现中找到共同点,因为正好发生"的部分现在是共同点(语言规范并不需要要求)在下一点可能会完全改变发布一个或另一个实现,并且,如果您的生产代码错误地依赖于此类详细信息,则可能会导致令人讨厌的意外;-).另外-几乎不必依赖于这样的可变实现细节而不是依赖于语言规定的行为(除非您正在编写诸如优化器,调试器,分析器之类的代码;- ).

This is a somewhat worthy and interesting are of study if you're planning to write compilers, runtime environments, peephole optimizers, etc etc, for Python itself or similar languages. I guess that deep study of the internals (ideally of many different correct implementations, of course, so as not to fixate on the quirks of a specific one -- good thing Python currently enjoys at least 4 separate production-worthy implementations, not to mention several versions of each!) can also help, indirectly, make one a better Python programmer -- but it's particularly important to focus on what's guaranteed by the language itself, which is somewhat less than what you'll find in common among separate implementations, because the parts that "just happen" to be in common right now (without being required to be so by the language specs) may perfectly well change under you at the next point release of one or another implementation and, if your production code was mistakenly relying on such details, that might cause nasty surprises;-). Plus -- it's hardly ever necessary, or even particularly helpful, to rely on such variable implementation details rather than on language-mandated behavior (unless you're coding something like an optimizer, debugger, profiler, or the like, of course;-).


07-31 16:42