<<

Python: The Full Monty                      

  

 

A Tested Semantics for the 

 

    

    

Python Programming Language 

Joe Gibbs Politz Alejandro Martinez Matthew Milano Providence, RI, USA La Plata, BA, Argentina Providence, RI, USA [email protected] [email protected] [email protected]

Sumner Warren Daniel Patterson Junsong Li Providence, RI, USA Providence, RI, USA Beijing, China [email protected] [email protected] ljs.darkfi[email protected]

Anand Chitipothu Shriram Krishnamurthi Bangalore, India Providence, RI, USA [email protected] [email protected]

Abstract ity it now has several third-party tools, including analyzers We present a small-step operational semantics for the Python that check for various potential error patterns [2, 5, 11, 13]. programming language. We present both a core language It also features interactive development environments [1, 8, for Python, suitable for tools and proofs, and a translation 14] that offer a variety of features such as variable-renaming process for converting Python source to this core. We have refactorings and code completion. Unfortunately, these tools tested the composition of translation and evaluation of the are unsound: for instance, the simple eight-line program core for conformance with the primary Python implementa- shown in the appendix uses no “dynamic” features and con- tion, thereby giving confidence in the fidelity of the seman- fuses the variable renaming feature of these environments. tics. We briefly report on the engineering of these compo- The difficulty of reasoning about Python becomes even nents. Finally, we examine subtle aspects of the language, more pressing as the language is adopted in increasingly im- identifying scope as a pervasive concern that even impacts portant domains. For instance, the US Securities and Ex- features that might be considered orthogonal. change Commission has proposed using Python as an ex- ecutable specification of financial contracts [12], and it is Categories and Subject Descriptors J.3 [Life and Medical now being used to script new network paradigms [10]. Thus, Sciences]: Biology and Genetics it is vital to have a precise semantics available for analyzing Keywords serpents programs and proving properties about them. This paper presents a semantics for much of Python (sec- 1. Motivation and Contributions tion 5). To make the semantics tractable for tools and proofs, we divide it into two parts: a core language, λ , with a small The Python programming language is currently widely used π number of constructs, and a desugaring that trans- in industry, science, and education. Because of its popular- lates source programs into the core.1 The core language is a mostly traditional stateful lambda-calculus augmented with Permission to make digital or hard copies of all or part of this work for personal or features to represent the essence of Python (such as method classroom use is granted without fee provided that copies are not made or distributed lookup order and primitive lists), and should thus be familiar for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM to its potential users. must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a 1 fee. Request permissions from [email protected]. The term desugaring is evocative but slightly misleading, because ours is OOPSLA ’13, October 29-31 2013, Indianapolis, IN, USA. really a compiler to a slightly different language. Nevertheless, it is more Copyright © 2013 ACM 978-1-4503-2374-1/13/10. . . $15.00. suggestive than a general term like “compiler”. We blame Arjun Guha for http://dx.doi.org/10.1145/2509136.2509536 the confusing terminology.

Because desugaring converts Python surface syntax to the Σ ::= ((ref v+undef) ...) core language, when it is composed with an interpreter for ref ::= natural λπ (which is easy to write), we have another implementation v, val ::= 〈val,mval,{string:ref,...}〉 of Python. We can then ask how this implementation com- | 〈x,mval,{string:ref,...}〉 pares with the traditional CPython implementation, which | @ref | (sym string) represents a form of ground truth. By carefully adjusting v+undef ::= v | ☠ the core and desugaring, we have achieved high fidelity with e+undef ::= e | ☠ CPython. Therefore, users can built tools atop λπ, confident t ::= global | local that they are conformant with the actual language. mval ::= (no-meta) | number | string | meta-none [ ...] ( ...) { ...} In the course of creating this high-fidelity semantics, we | val | val | val | (meta-classx) | (meta-code (x ...)xe) identified some peculiar corners of the language. In partic- | λ(x ...) opt-var.e ular, scope is non-trivial and interacts with perhaps unex- opt-var ::= (x) | (no-var) pected features. Our exposition focuses on these aspects. e ::= v | ref | (fetche) | (set!ee) | (alloce) In sum, this paper makes the following contributions: | e[e] | e[e :=e] if • a core semantics for Python, dubbed λπ, which is defined | eee | e e as a reduction semantics using PLT Redex [3]; | let x= e+undef ine | x | e :=e | (deletee) • an interpreter, dubbed λπ↓, implemented in 700LOC of | e(e ...) | e(e ...)*e | (framee) | (returne) Racket, that has been tested against the Redex model; | (whileeee) | (loopee) | break | continue | (builtin-prim op(e ...)) • a desugaring translation from Python programs to λπ, | fun ( ...) implemented in Racket; x opt-var e | 〈e,mval〉 | list〈e,[e ...]〉 • a demonstration of conformance of the composition of | 〈e,(e ...)〉 | set〈e,(e ...)〉 desugaring with λπ↓ to CPython; and, | (tryexceptexee) | (tryfinallyee) | (raise ) | (err ) • insights about Python gained from this process. e val | (moduleee) | (construct-modulee) Presenting the semantics in full is neither feasible, given | (in-moduleeε) space limits, nor especially enlightening. We instead focus on the parts that are important or interesting. We first give Figure 1: λπ expressions an overview of λπ’s value and object model. We then in- troduce desugaring through classes. We then discuss gen- erators, classes, and their interaction with scope. Finally, we describe the results of testing our semantics against either another λπ value or the name of a built-in class. The CPython. All of our code is available online at https: primitive content, or meta-val, position holds special kinds //www.github.com/brownplt/lambda-py. of builtin data, of which there is one per builtin type that λπ models: numbers, strings, the distinguished meta-none 2 2. Warmup: A Quick Tour of λπ value, lists, , sets, classes, and functions. The distinguished ☠ (“skull”) form represents uninitial- We provide an overview of the object model of λπ and ized heap locations whose lookup should raise an excep- Python, some of the basic operations on objects, and the tion. Within expressions, this form can only appear in let- shape of our small step semantics. This introduces notation bindings whose binding position can contain both expres- and concepts that will be used later to explain the harder sions and . The evaluation of in a -binding allocates parts of Python’s semantics. it on the heap. Thereafter, it is an error to look up a store location containing ; that location must have a value 2.1 λπ Values assigned into it for lookup to succeed. Figure 2 shows the Figure 1 shows all the values and expressions of λπ. The behavior of lookup in heaps Σ for values and for . This metavariables v and val range over the values of the lan- notion of undefined locations will come into play when we guage. All values in λπ are either objects, written as triples discuss scope in section 4.2. in 〈〉, or references to entries in the store Σ, written @ref. Python programs cannot manipulate object values di- Each λπ object is written as a triple of one of the forms: rectly; rather, they always work with references to objects. 〈v,mval,{string:ref,...}〉 Thus, many of the operations in λπ involve the heap, and 〈x,mval,{string:ref,...}〉 few are purely functional. As an example of what such an looks like, take constructing a list. This takes the These objects have their class in the first position, their primitive content in the second, and the dictionary of string- 2 We express dictionaries in terms of lists and tuples, so we do not need to indexed fields that they hold in the third. The class value is introduce a special mval form for them.

(E[ let x= v+undef ine]εΣ) [E-LetLocal]

(E[[x/ref]e]εΣ 1)

where (Σ1 ref) = alloc(Σ,v+undef)

(E[ref]εΣ) (E[val]εΣ) [E-GetVar]

where Σ = ((ref1 v+undef1) ... (ref val) (refn v+undefn) ...) (E[ref]εΣ) [E-GetVarUndef] (E[(raise〈%str,“Uninitialized local”,{}〉)]εΣ)

where Σ = ((ref1 v+undef1) ... (ref ☠)(refn v+undefn) ...)

Figure 2: Let-binding identifiers, and looking up references

(E[〈val,mval〉]εΣ) [E-Object] (E[(fetch @ref )]εΣ) [E-Fetch]

(E[@refnew ]εΣ 1) (E[Σ(ref)]εΣ)

where (Σ1 refnew) = alloc(Σ,〈val,mval,{}〉) (E[(set! @ref val)]εΣ) [E-Set!]

(E[ tuple〈valc,(val ...)〉]εΣ) [E-Tuple] (E[val]εΣ 1) ( [@ ]εΣ ) E refnew 1 where Σ1 = Σ[ref:=val]

where (Σ1 refnew) = alloc(Σ,〈valc,(val ...),{}〉) (E[(alloc val)]εΣ) [E-Alloc]

(E[ set〈valc,(val ...)〉]εΣ) [E-Set] (E[@refnew ]εΣ 1) ( [@ ]εΣ ) E refnew 1 where (Σ1 refnew) = alloc(Σ,val)

where (Σ1 refnew) = alloc(Σ,〈valc,{val ...},{}〉)

Figure 5: λπ reduction rules for references Figure 3: λπ reduction rules for creating objects

2.2 Accessing Built-in Values values that should populate the list, store them in the heap, Now that we’ve created a list, we should be able to perform and return a pointer to the newly-created reference: some operations on it, like look up its elements. λπ defines a number of builtin primitives that model Python’s internal (E[ list〈valc,[val ...]〉]εΣ) [E-List] operators for manipulating data, and these are used to access

(E[@refnew ]εΣ 1) the contents of a given object’s mval. We formalize these

where (Σ1 refnew) = alloc(Σ,〈valc,[val ...],{}〉) builtin primitives in a metafunction δ. A few selected cases of the δ function are shown in figure 4. This metafunction E-List is a good example for understanding the shape of lets us, for example, look up values on builtin lists: evaluation in λπ. The general form of the reduction is over expressions e, global environments ε, and heaps Σ: (prim “list-getitem” (〈%list,[〈%str,“first-elt”,{}〉],{}〉 〈%int,0,{}〉)) (eεΣ) ==> 〈%str,“first-elt”,{}〉 In the E-List rule, we also use evaluation contexts E to en- force an order of operations and eager calling semantics. Since δ works over object values themselves, and not over This is a standard application of Felleisen-Hieb-style small- references, we need a way to access the values in the store. step semantics [4]. Saliently, a new list value is populated λπ has the usual set of operators for accessing and updating from the list expression via the alloc metafunction, this is mutable references, shown in figure 5. Thus, the real λπ allocated in the store, and the resulting value of the expres- program corresponding to the one above would be: sion is a pointer refnew to that new list. (prim “list-getitem” Similar rules for objects in general, tuples, and sets are ((fetch list〈%list,[〈%str,“first-elt”,{}〉]〉) (fetch %int,0 ))) shown in figure 3. Lists, tuples, and sets are given their 〈 〉 own expression forms because they need to evaluate their Similarly, we can use set! and to update the values in subexpressions and have corresponding evaluation contexts. lists, and to allocate the return values of primitive operations.

δ(“list-getitem”〈any c1,[val0 ... val1 val2 ...],any1〉〈anyc2,number2,any2〉) = val1

where (equal? (length (val0 ...)) number2)

δ(“list-getitem”〈any c1,[val1 ...],any1〉〈anyc2,number2,any2〉) = 〈%none, meta-none ,{}〉

δ(“list-setitem”〈any c1,[val0 ... val1 val2 ...],any1〉〈x2,number2,any2〉 val3 val4) = 〈val 4,[val0 ... val3 val2 ...],{}〉

where (equal? (length (val0 ...)) number2)

δ(“num+”〈any cls,number1,any1〉〈anycls2,number2,any2〉) = 〈any cls,(+ number1 number2),{}〉

Figure 4: A sample of λπ primitives

We desugar to patterns like the above from Python’s actual 2.5 Loops, Exceptions, and Modules surface operators for accessing the elements of a list in We defer a full explanation of the terms in figure 1, and expressions like mylist[2]. the entire reduction relation, to the appendix. This includes 2.3 Updating and Accessing Fields a mostly-routine encoding of control operators via special evaluation contexts, and a mechanism for loading new code So far, the dictionary part of λ objects have always been π via modules. We continue here by focusing on cases in λπ empty. Python does, however, support arbitrary field assign- that are unique in Python. ments on objects. The expression

eobj[estr :=e val] 3. Classes, Methods, and Desugaring has one of two behaviors, defined in figure 6. Both behaviors Python has a large system with first-class methods, implicit work over references to objects, not over objects themselves, receiver binding, multiple inheritance, and more. In this sec- tion we discuss what parts of the class system we put in λ , in contrast to δ. If estr is a string object that is already a π and which parts we choose to eliminate by desugaring. member of eobj, that field is imperatively updated with eval. If the string is not present, then a new field is added to 3.1 Lookup in Classes the object, with a newly-allocated store position, and the object’s location in the heap is updated. In the last section, we touched on field lookup in an object’s The simplest rule for accessing fields simply checks in the local dictionary, and didn’t discuss the purpose of the class , ,d [ ] object’s dictionary for the provided name and returns the ap- position at all. When an object lookup 〈valc mval 〉 estr propriate value, shown in E-GetField in figure 6. E-GetField doesn’t find in the local dictionary d, it defers to a lookup also works over reference values, rather than objects directly. algorithm on the class value valc. More specifically, it uses the “__mro__” (short for method resolution order) field of the 2.4 First-class Functions class to determine which class dictionaries to search for the Functions in Python are objects like any other. They are field. This field is visible to the Python programmer: defined with the keyword def, which produces a callable class (object): object with a mutable set of fields, whose class is the built-in pass # a class that does nothing function class. For example a programmer is free to write: print(C.__mro__) def f(): # (, ) return f.x Field lookups on objects whose class value is C will first f.x = -1 look in the dictionary of , and then in the dictionary of f() # ==> -1 the built-in class object. We define this lookup algorithm within λπ as class-lookup, shown in figure 7 along with the We model functions as just another of object value, with reduction rule for field access that uses it. a type of mval that looks like the usual functional λ: This rule allows us to model field lookups that defer to a λ(x ...) opt-var.e superclass (or indeed, a list of them). But programmers don’t explicitly define fields; rather, they use higher- The opt-var indicates whether the function is variable- level language constructs to build up the inheritance hier- arity: if is of the form (y), then if the function is archy the instances eventually use. called with more arguments than are in its list of variables 3.2 Desugaring Classes (x ...), they are allocated in a new tuple and bound to y in the body. Since these rules are relatively unexciting and Most Python programmers use the special class form to verbose, we defer their explanation to the appendix. create classes in Python. However, class is merely syntac-

(E[@refobj [@refstr := val1]]εΣ) [E-SetFieldUpdate]

(E[val1]εΣ[ref 1:=val1])

where 〈anycls1,mval,{string2:ref2,...,string1:ref1,string3:ref3,...}〉 = Σ(refobj),

〈anycls2,string1,anydict〉 = Σ(refstr)

(E[@refobj [@refstr := val1]]εΣ) [E-SetFieldAdd]

(E[val1]εΣ 2)

where 〈anycls1,mval,{string:ref,...}〉 = Σ(refobj),

(Σ1 refnew) = alloc(Σ,val1),

〈anycls2,string1,anydict〉 = Σ(refstr),

Σ2 = Σ1[refobj:=〈anycls1,mval,{string1:refnew,string:ref,...}〉],

(not (member string1 (string ...)))

(E[@ref [@refstr ]]εΣ) [E-GetField]

(E[Σ(ref1)]εΣ)

where 〈anycls1,string1,anydict〉 = Σ(refstr),

〈anycls2,mval,{string2:ref2,...,string1:ref1,string3:ref3,...}〉 = Σ(ref)

Figure 6: Simple field access and update in λπ

(E[@refobj [@refstr ]]εΣ) [E-GetField-Class]

(E[valresult]εΣ result)

where 〈anycls,string,anydict〉 = Σ(refstr),

〈@ref,mval,{string 1:ref2,...}〉 = Σ(refobj),

(Σresult valresult) = class-lookup[[@refobj ,Σ(ref), string,Σ ]],

(not (member string (string1 ...)))

class-lookup[[@refobj ,〈any c,anymval,{string1:ref1,...,“__mro__”:ref,string2:ref2,...}〉, = (Σ valresult) string,Σ ]]

where 〈any1,(valcls ...),any3〉 = fetch-pointer[[Σ(ref),Σ ]],

valresult = class-lookup-mro[[(valcls ...), string,Σ ]]

class-lookup-mro[[(@refc valrest ...), string,Σ ]] = Σ(ref)

where 〈any1,any2,{string1:ref1,...,string:ref,string2:ref2,...}〉 = Σ(refc)

class-lookup-mro[[(@refc valrest ...), string,Σ ]] = class-lookup-mro[[(valrest ...), string,Σ ]]

where 〈any1,any2,{string1:ref1,...}〉 = Σ(refc), (not (member string (string1 ...))) fetch-pointer[[@ref,Σ ]] = Σ(ref)

Figure 7: Class lookup tic sugar for a use of the builtin Python function type.3 The new classes on the fly. Then it is a simple matter to desugar documentation states explicitly that the two following forms class forms to this function call. [sic] produce identical type objects: The implementation of type creates a new object value for the class, allocates it, sets the “__mro__” field to be the class X: computed inheritance graph,4 and sets the fields of the class a = 1 to be the bindings in the dictionary. We elide some of the verbose detail in the iteration over dict by using the for syn- X = type(’X’, (object,), dict(a=1)) tactic abbreviation, which expands into the desired iteration: This means that to implement classes, we merely need to understand the built-in function type, and how it creates 4 This uses an algorithm that is implementable in pure Python: 3 http://docs.python.org/3/library/functions.html#type http://www.python.org/download/releases/2.3/mro/.

Thus, very few object-based primitives are needed to create %type := fun (cls bases dict) static class methods and instance methods. let newcls = (alloc〈%type,(meta-class cls),{}〉) in Python has a number of other special method names that newcls[〈%str,“__mro__”〉 := can be overridden to provide specialized behavior. λπ tracks (builtin-prim “type-buildmro” (newcls bases))] Python this regard; it desugars surface expressions into calls (for (key elt) in dict[〈%str,“__items__”〉] () newcls[key := elt]) to methods with particular names, and provides built-in im- (return newcls) plementations of those methods for arithmetic, dictionary This function, along with the built-in type class, suffices access, and a number of other operations. Some examples: for bootstrapping the object system in λπ. o[p] desugars to... o[〈%str,“__getitem__”〉](p) 3.3 Python Desugaring Patterns n + m desugars to... n[〈%str,“__add__”〉] (m) fun(a) desugars to... fun[〈%str,“__call__”〉] (a) Python objects can have a number of so-called magic fields that allow for overriding the behavior of built-in syntactic With the basics of type and object lookup in place, forms. These magic fields can be set anywhere in an object’s getting these operations right is just a matter of desugaring inheritance hierarchy, and provide a lot of the flexibility for to the right method calls, and providing the right built-in which Python is well-known. versions for primitive values. This is the form of much of For example, the field accesses that Python programmers our desugaring, and though it is labor-intensive, it is also the write are not directly translated to the rules in λπ. Even straightforward part of the process. the execution of o.x depends heavily on its inheritance hierarchy. This program desugars to: 4. Python: the Hard Parts o[〈%str,“__getattribute__”〉] (o〈%str,“x”〉) Not all of Python has a semantics as straightforward as For objects that don’t override the “__getattribute__” field, that presented so far. Python has a unique notion of scope, the built-in object class’s implementation does more than with new scope operators added in Python 3 to provide simply look up the “x” property using the field access rules some features of more traditional static scoping. It also has we presented earlier. Python allows for attributes to imple- powerful control flow constructs, notably generators. ment special accessing functionality via properties,5 which 4.1 Generators can cause special functions to be called on property access. The function of object checks if the Python has a built-in notion of generators, which provide a value of the field it accesses has a special “__get__” method, control-flow construct, yield, that can implement lazy or and if it does, calls it: generative sequences and coroutines. The programmer in- terface for creating a generator in Python is straightforward: object[〈%str,“__getattribute__”〉 := fun (obj field) any function definition that uses the yield keyword in its let value = obj[field] in body is automatically converted into an object with a gener- if (builtin-prim “has-field?” (value ator interface. To illustrate the easy transition from function 〈%str,“__get__”〉)) (return value[〈%str,“__get__”〉] ()) to generator, consider this simple program: (return value) ] def f(): This pattern is used to implement a myriad of features. x = 0 For example, when accessing function values on classes, the while True: method of the function binds the self argument: x += 1 class C(object): return x def f(self): return self f() # ==> 1 f() # ==> 1 c = C() # constructs a new C instance # ... g = c.f # accessing c.f creates a When called, this function always returns 1. # method object closed over c Changing return to yield turns this into a generator. g() is c # ==> True As a result, applying f() no longer immediately evaluates the body; instead, it creates an object whose next method # We can also bind self manually: evaluates the body until the next yield statement, stores its self_is_5 = C.f.__get__(5) state for later resumption, and returns the yielded value: self_is_5() # ==> 5

5 http://docs.python.org/3/library/functions.html#property

def f(): of the above control operators, and turns uses of control op- x = 0 erators into applications of the appropriate continuation in- while True: side the function. By passing in different continuation argu- x += 1 ments, the caller of the resulting function has complete con- yield x trol over the behavior of control operators. For example, we might rewrite a try-except block from gen = f() gen.__next__() # ==> 1 try: gen.__next__() # ==> 2 raise Exception() gen.__next__() # ==> 3 except e: # ... print(e) Contrast this with the following program, which seems to to perform a simple abstraction over the process of yielding: def except_handler(e): print(e) def f(): except_handler(Exception()) def do_yield(n): yield n In the case of generators, rewriting yield with CPS x = 0 would involve creating a handler that stores a function hold- while True: ing the code for what to do next, and rewriting yield ex- x += 1 pressions to call that handler: do_yield(x) def f(): Invoking f() results in an infinite loop. That is because x = 1 Python strictly converts only the innermost function with a yield x yield into a generator, so only do_yield is a generator. x += 1 Thus, the generator stores only the execution context of yield x do_yield, not of f. g = f() Failing to Desugar Generators with (Local) CPS g.__next__() # ==> 1 The experienced linguist will immediately see what is going g.__next__() # ==> 2 on. Clearly, Python has made a design decision to store g.__next__() # throws "StopIteration" only local continuations. This design can be justified on the 6 grounds that converting a whole program to continuation- would be rewritten to something like: passing style (CPS) can be onerous, is not modular, can def f(): impact performance, and depends on the presence of tail- calls (which Python does not have). In contrast, it is natural def yielder(val, rest_of_function): to envision a translation strategy that performs only a local next.to_call_next = rest_of_function conversion to CPS (or, equivalently, stores the local stack return val frames) while still presenting a continuation-free interface to the rest of the program. def next(): From the perspective of our semantics, this is a potential return next.to_call_next() boon: we don’t need to use a CPS-style semantics for the whole language! Furthermore, perhaps generators can be def done(): raise StopIteration() handled by a strictly local rewriting process. That is, in def start(): the core language generators can be reified into first-class x = 1 functions and applications that use a little state to record def rest(): which function is the continuation of the yield point. Thus, x += 1 generators seem to fit perfectly with our desugaring strategy. return yielder(x, done) To convert programs to CPS, we take operators that can return yielder(x, rest) cause control-flow and reify each into a continuation func- tion and appropriate application. These operators include next.to_call_next = start simple sequences, loops combined with break and con- tinue, try-except and try-finally combined with return { ’next’ : next } raise, and return. Our CPS transformation turns every expression into a function that accepts an argument for each 6 This being a sketch, we have taken some liberties for simplicity.

We proceed by describing Python’s handling of scope g = f() for local variables, the extension to nonlocal, and the g[’next’]() # ==> 1 interaction of both of these features with classes. g[’next’]() # ==> 2 g[’next’]() # throws "StopIteration" 4.2.1 Declaring and Updating Local Variables In Python, the operator = performs local variable binding: We build the yielder function, which takes a value, which it returns after storing a continuation argument in the to_call_next field. The next function always returns def f(): the result of calling this stored value. Each yield statement x = ’local variable’ is rewritten to put everything after it into a new function def- return x inition, which is passed to the call to yielder. In other words, this is the canonical CPS transformation, applied in f() # ==> ’local variable’ the usual fashion. This rewriting is well-intentioned but does not work. If The syntax for updating and creating a local variable are this program is run under Python, it results in an error: identical, so subsequent = statements mutate the variable created by the first. x += 1 UnboundLocalError: local variable ’x’ def f(): x = ’local variable’ This is because Python creates a new scope for each func- x = ’reassigned’ tion definition, and assignments within that scope create new x = ’reassigned again’ variables. In the body of rest, the assignment x += 1 return x refers to a new x, not the one defined by x = 1 in start. This runs counter to traditional notions of functions that can f() # ==> ’reassigned again’ close over mutable variables. And in general, with multiple assignment statements and branching control flow, it is en- tirely unclear whether a CPS transformation to Python func- Crucially, there is no syntactic difference between a state- tion definitions can work. ment that updates a variable and one that initially binds it. The lesson from this example is that the interaction of Because bindings can also be introduced inside branches of non-traditional scope and control flow made a traditional conditionals, it isn’t statically determinable if a variable will translation not work. The straightforward CPS solution is be defined at certain program points. Note also that vari- thus incorrect in the presence of Python’s mechanics of able declarations are not scoped to all blocks—here they are variable binding. We now move on to describing how we scoped to function definitions: can express Python’s scope in a more traditional lexical model. Then, in section 4.3 we will demonstrate a working def f(y): transformation for Python’s generators. if y > .5: x = ’big’ else : pass 4.2 Scope return x Python has a rich notion of scope, with several types of variables and implicit binding semantics that depend on the f(0) # throws an exception block structure of the program. Most identifiers are local; f(1) # ==> "big" this includes function and variables defined with the = operator. There are also global and nonlocal vari- Handling simple declarations of variables and updates ables, with their own special semantics within closures, and to variables is straightforward to translate into a lexically- interaction with classes. Our core contribution to explaining scoped language. λ has a usual let form that allows for Python’s scope is to give a desugaring of the nonlocal and π lexical binding. In desugaring, we scan the body of the global keywords, along with implicit local, global function and accumulate all the variables on the left-hand and instance identifiers, into traditional lexically scoped side of assignment statements in the body. These are let- closures. Global scope is still handled specially, since it ex- bound at the top of the function to the special ☠ form, which hibits a form of dynamic scope that isn’t straightforward to evaluates to an exception in any context other than a let- capture with traditional let-bindings.7 binding context (section 2). We use x := e as the form for variable assignment, which is not a binding form in the core. 7 We actually exploit this dynamic scope in bootstrapping Python’s object system, but defer an explanation to the appendix. Thus, in λπ, the example above rewrites to:

let f= ☠ in from outside. This was the underlying problem with the f := attempted CPS translation from the last section, highlighting fun (y) (no-var) the consequences of using the same syntactic form for both let x= ☠ in variable binding and update. if (builtin-prim “num>” (y〈%float,0.5〉)) Closing over a mutable variable is, however, a common x :=〈%str,“big”〉 and useful pattern. Perhaps recognizing this, Python added none a new keyword in Python 3.0 to allow this pattern, called (returnx) nonlocal. A function definition can include a nonlocal f(〈%int,0〉) declaration at the top, which allows mutations within the f(〈%int,1〉) function’s body to refer to variables in enclosing scopes on a per-variable basis. If we add such a declaration to the In the first application (to 0) the assignment will never hap- previous example, we get a different answer: pen, and the attempt to look up the ☠-valued x in the return statement will fail with an exception. In the second applica- def g(): tion, the assignment in the then-branch will change the value x = ’not affected by h’ of in the store to a non- string value, and the string “big” def h(): will be returned. nonlocal x The algorithm for desugaring scope is so far: x = ’inner x’ • For each function body: return x return (h(), x) Collect all variables on the left-hand side of = in a set locals , stopping at other function boundaries, g() # ==> (’inner x’, ’inner x’) For each variable var in locals, wrap the function body in a let-binding of to . The nonlocal declaration allows the inner assignment to x to “see” the outer binding from g. This effect can span any This strategy works for simple assignments that may or may nesting depth of functions: not occur within a function, and maintains lexical structure for the possibly-bound variables in a given scope. Unfortu- def g(): nately, this covers only the simplest cases of Python’s scope. x = ’not affected by h’ def h(): 4.2.2 Closing Over Variables def h2(): Bindings from outer scopes can be seen by inner scopes: nonlocal x x = ’inner x’ def f(): return x x = ’closed-over’ return h2 def g(): return (h()(), x) return x return g g() # ==> (’inner x’, ’inner x’)

f()() # ==> ’closed-over’ Thus, the presence or absence of a nonlocal declara- tion can change an assignment statement from a binding oc- However, since = defines a new local variable, one cannot currence of a variable to an assigning occurrence. We aug- close over a variable and mutate it with what we’ve seen so ment our algorithm for desugaring scope to reflect this: far; = simply defines a new variable with the same name: • For each function body: def g(): Collect all variables on the left-hand side of in a set x = ’not affected’ locals, stopping at other function boundaries, def h(): x = ’inner x’ Let locals’ be locals with any variables in nonlocal return x declarations removed, return (h(), x) For each variable in locals’, wrap the function body in a -binding of to . g() # ==> (’inner x’, ’not affected’) Thus the above program would desugar to the following, This is mirrored in our desugaring: each function adds a which does not let-bind inside the body of the function new let-binding inside its own body, shadowing any bindings assigned to h.

def f(x, y): print(x); print(y); print("") class c: x = 4 print(x); print(y) print("") def g(self): print(x); print(y); print(c) return c

f("x-value", "y-value")().g()

# produces this result:

x-value y-value Figure 9: Interactions between class bodies and function scope 4 y-value variable x is not 4, the value of the most recent apparent x-value assignment to x. In fact, the body of g seems to "skip" the y-value scope in which x is bound to 4, instead closing over the outer scope in which x is bound to "x-value". At first glance this does not appear to be compatible with our previous notions of Python’s closures. We will see, however, that the Figure 8: An example of class and function scope interacting correct desugaring is capable of expressing the semantics of scope in classes within the framework we have already established for dealing with Python’s scope. let g= ☠ in Desugaring classes is substantially more complicated g := than handling simple local and nonlocal cases. Consider the fun () example from figure 8, stripped of print statements: let x= ☠ in let h= ☠ in def f(x, y): x :=〈%str,“not affected by h”,{}〉 class c: h := x = 4 fun () x :=〈%str,“inner x”,{}〉 def g(self): pass (returnx) return c (return tuple〈%tuple,(h () ()x)〉) f("x-value", "y-value")().g() g() In this example, we have three local scopes: the body of The assignment to x inside the body of h behaves as a the function f, the body of the class definition c, and the body typical assignment statement in a closure-based language of the function g. The scopes of c and g close over the same like Scheme, mutating the let-bound defined in g. scope as f, but have distinct, non-nesting scopes themselves. 4.2.3 Classes and Scope Figure 9 shows the relationship graphically. Algorithmically, we add new steps to scope desugaring: We’ve covered some subtleties of scope for local and nonlo- cal variables and their interaction with closures. What we’ve • For each function body: presented so far would be enough to recover lexical scope For each nested class body: for a CPS transformation for generators if function bodies contained only other functions. However, it turns out that we − Collect all the function definitions within the class observe a different closure behavior for variables in a class body. Let the set of their names be defnames and definition than we do for variables elsewhere in Python, and the set of the expressions be defbodies, we must address classes to wrap up the story on scope. − Generate a fresh variable deflifted for each vari- Consider the example in figure 8. Here we observe an able in . Add assignment statements to the interesting phenomenon: in the body of g, the value of the function body just before the class definition as-

signing each deflifted to the corresponding ex- let f= ☠ in pression in defbodies, f := − Change each function definition in the body of the fun (x y) let extracted-g = ☠ in class to defname := deflifted let c = ☠ in Collect all variables on the left-hand side of = in a set extracted-g := fun () none locals, stopping at other function boundaries, c := (class “c”) let g= ☠ in Let locals’ be locals with any variables in nonlocal let x= ☠ in declarations removed, c[〈%str,“x”〉 :=〈%int,4〉] :=〈%int,4〉 var locals’ x For each variable in , wrap the function c[〈%str,“g”〉 := extracted-g] body in a let-binding of to ☠. g := extracted-g (return c) Recalling that the instance variables of a class desugar roughly to assignments to the class object itself, the function desugars to the following: Figure 10: Full class scope desugaring

let f= ☠ in f := We have now covered Python classes’ scope seman- fun ( ) x y tics: function bodies do not close over the class body’s let extracted-g = ☠ in scope, class bodies create their own local scope, state- let c = ☠ in extracted-g := fun () none ments in class bodies are executed sequentially, and defi- c := (class “c”) nitions/assignments in class bodies result in the creation of c[〈%str,“x”〉 :=〈%int,4〉] class members. The nonlocal keyword does not require c[〈%str,“g”〉 := extracted-g] (return c) further special treatment, even when present in class bodies.

This achieves our desired semantics: the bodies of func- 4.3 Generators Redux tions defined in the class C will close over the x and y from With the transformation to a lexical core in hand, we can the function definition, and the statements written in c-scope return to our discussion of generators, and their implemen- can still see those bindings. We note that scope desugaring tation with a local CPS transformation. yields terms in an intermediate language with a class key- To implement generators, we first desugar Python down word. In a later desugaring step, we remove the class key- to a version of λπ with an explicit yield statement, passing word as we describe in section 3.2. yields through unchanged. As the final stage of desugar- ing, we identify functions that contain yield, and convert 4.2.4 Instance Variables them to generators via local CPS. We show the desugaring When we introduced classes we saw that there is no apparent machinery around the CPS transformation in figure 11. To difference between classes that introduce identifiers in their desugar them, in the body of the function we construct a body and classes that introduce identifiers by field assign- generator object and store the CPS-ed body as a ___re- ment. That is, either of the following forms will produce the sume attribute on the object. The __next__ method on same class C: the generator, when called, will call the ___resume clo- sure with any arguments that are passed in. To handle yield- class C: ing, we desugar the core yield expression to update the x = 3 ___resume attribute to store the current normal continua- # or ... tion, and then return the value that was yielded. class C: pass Matching Python’s operators for control flow, we have C.x = 3 five continuations, one for the normal completion of a state- ment or expression going onto the next, one for a return We do, however, have to still account for uses of the in- statement, one each for break and continue, and one stance variables inside the class body, which are referred to for the exception throwing raise. This means that each with the variable name, not with a field lookup like c.x. CPSed expression becomes an anonymous function of five We perform a final desugaring step for instance variables, arguments, and can be passed in the appropriate behavior where we let-bind them in a new scope just for evaluating for each control operator. the class body, and desugar each instance variable assign- We use this configurability to handle two special cases: ment into both a class assignment and an assignment to the variable itself. The full desugaring of the example is shown • Throwing an exception while running the generator in figure 10. • Running the generator to completion

In the latter case, the generator raises a StopItera- tion exception. We encode this by setting the initial “nor- mal” continuation to a function that will update ___re- sume to always raise StopIteration, and then to raise that exception. Thus, if we evaluate the entire body of the def f(x ...): body-with-yield generator, we will pass the result to this continuation, and the proper behavior will occur. Similarly, if an uncaught exception occurs in a generator, desugars to... the generator will raise that exception, and any subsequent fun (x ...) calls to the generator will result in StopIteration. We let done = fun () (raise StopIteration ()) in handle this by setting the initial raise continuation to be let initializer = code that updates ___resume to always raise StopIt- fun (self) eration let end-of-gen-normal = , and then we raise the exception that was passed fun (last-val) to the continuation. Since each try block in CPS installs a self[〈%str,“__next__”〉 := done] new exception continuation, if a value is passed to the top- (raise StopIteration ()) in let end-of-gen-exn = level exception handler it means that the exception was not fun (exn-val) caught, and again the expected behavior will occur. self[〈%str,“__next__”〉 := done] (raise exn-val) in let unexpected-case = fun (v) 5. Engineering & Evaluation (raise SyntaxError ()) in Our goal is to have desugaring and λ enjoy two properties: let resumer = π fun (yield-send-value) (return • Desugaring translates all Python source programs to λπ (cps-of body-with-yield) (totality). (end-of-gen-normal unexpected-case • Desugared programs evaluate to the same value as the end-of-gen-exn source would in Python (conformance). unexpected-case unexpected-case)) in self[〈%str,“___resume”〉 := resumer] in The second property, in particular, cannot be proven because %generator (initializer) there is no formal specification for what Python does. We therefore tackle both properties through testing. We discuss where... various aspects of implementing and testing below. class generator(object): 5.1 Desugaring Phases def __init__(self, init): init(self) Though we have largely presented desugaring as an atomic activity, the paper has hinted that it proceeds in phases. def __next__(self): Indeed, there are four: return self.___resume(None) • Lift definitions out of classes (section 4.2.3). def send(self, arg): • Let-bind variables (section 4.2.2). This is done sec- return self.___resume(arg) ond to correctly handle occurrences of nonlocal and global in class methods. The result of these first two def __iter__(self): steps is an intermediate language between Python and the return self core with lexical scope, but still many surface constructs. • Desugar classes, turn Python operators into method calls, def __list__(self): turn for loops into guarded while loops, etc. return [x for x in self] • Desugar generators (section 4.3).

These four steps yield a term in our core, but it isn’t ready Figure 11: The desugaring of generators to run yet because we desugar to open terms. For instance, print(5) desugars to

print (〈%int,5〉)

which relies on free variables print and %int.

5.2 Python Libraries in Python Feature # of tests LOC We implement as many libraries as possible in Python8 aug- mented with some macros recognized by desugaring. For ex- Built-in Datatypes 81 902 ample, the builtin tuple class is implemented in Python, but Scope 39 455 getting the length of a tuple defers to the δ function: Exceptions 25 247 (Multiple) Inheritance 16 303 class tuple(object): Properties 9 184 def __len__(self): Iteration 13 214 return ___delta("tuple-len", self) Generators 9 129 ... Modules 6 58

All occurrences of ___delta(str, e, ...) are desug- Total 205 2636 ared to (builtin-prim str (e ...)) directly. We only do this for library files, so normal Python programs can use ___delta as the valid identifier it is. As another example, Figure 12: Distribution of passing tests after the class definition of tuples, we have the statement

___assign("%tuple", tuple) and into verbose but straightforward desugaring, causing se- which desugars to an assignment statement %tuple := tuple. rious performance hit; running 100 tests now takes on the Since %-prefixed variables aren’t valid in Python, this order of 20 minutes, even with the optimization. gives us an private namespace of global variables that are un-tamperable by Python. Thanks to these decisions, this project produces far more readable desugaring output than a 5.4 Testing previous effort for JavaScript [6]. Python comes with an extensive test suite. Unfortunately, this suite depends on numerous advanced features, and as 5.3 Performance such was useless as we were building up the semantics. λπ may be intended as a formal semantics, but composed We therefore went through the test suite files included with with desugaring, it also yields an implementation. While CPython, April 2012,9 and ported a representative suite of the performance does not matter for semantic reasons (other 205 tests (2600 LOC). In our selection of tests, we focused than programs that depend on time or space, which would on orthogonality and subtle corner-cases. The distribution of be ill-suited by this semantics anyway), it does greatly affect those tests across features is reported in figure 12. On all how quickly we can iterate through testing! these tests we obtain the same results as CPython. The full process for running a Python program in our It would be more convincing to eventually handle all of semantics is: Python’s own unittest infrastructure to run CPython’s 1. Parse and desugar roughly 1 KLOC of libraries imple- test suite unchanged. The unittest framework of CPython mented in Python unfortunately relies on a number of reflective features on modules, and on native libraries, that we don’t yet cover. For 2. Parse and desugar the target program now, we manually move the assertions to simpler if-based 3. Build a syntax tree of several built-in libraries, coded by tests, which also run under CPython, to check conformance. building the AST directly in Racket 4. Compose items 1-3 into a single λπ expression 5.5 Correspondence with Redex 5. Evaluate the λπ expression We run our tests against λπ↓, not against the Redex-defined Parsing and desugaring for (1) takes a nontrivial amount of reduction relation for λπ. We can run tests on λπ, but per- time (40 seconds on the first author’s laptop). Because this formance is excruciatingly slow: it takes over an hour to run work is needlessly repeated for each test, we began caching complete individual tests under the Redex reduction relation. the results of the desugared library files, which reduced the Therefore, we have been able to perform only limited test- testing time into the realm of feasibility for rapid develop- ing for conformance by hand-writing portions of the envi- ment. When we first performed this optimization, it made ronment and heap (as Redex terms) that the Python code in running 100 tests drop from roughly 7 minutes to 22 sec- the test uses. Fortunately, executing against Redex should be onds. Subsequently, we moved more functionality out of λπ parallelizable, so we hope to increase confidence in the Re- dex model as well. 8 We could not initially use existing implementations of these in Python for bootstrapping reasons: they required more of the language than we supported. 9 http://www.python.org/getit/releases/3.2.3/

6. Future Work and Perspective these remaining esoteric features. It’s also useful to simply As section 5 points out, there are some more parts of Python call out the weirdness of these operators, which are liable to we must reflect in the semantics before we can run Python’s violate the soundness of otherwise-sound program tools. test cases in their native form. This is because Python is a Overall, what we have learned most from this effort is large language with extensive libraries, a foreign-function how central scope is to understanding Python. Many of its interface, and more. features are orthogonal, but they all run afoul on the shoals of Libraries aside, there are some interesting features in scope. Whether this is intentional or an accident of the con- Python left to tackle. These include special fields, such as torted history of Python’s scope is unclear (for example, see the properties of function objects that compute the content the discussion around the proposal to add nonlocal [15]), of closures, complex cases of destructuring assignments, a but also irrelevant. Those attempting to improve Python or few reflective features like the metaclass form, and others. create robust sub-languages of it—whether for teaching or More interestingly, we are not done with scope! Consider for specifying asset-backed securities—would do well to put locals, which returns a dictionary of the current variable their emphasis on scope first, because this is the feature most bindings in a given scope: likely to preclude sound analyses, correctness-preserving refactorings, and so on. def f(x): y = 3 7. Related Work return locals() We are aware of only one other formalization for Python: Smeding’s unpublished and sadly unheralded master’s the- f("val") # ==> {’x’: ’val’, ’y’: 3} sis [9]. Smeding builds an executable semantics and tests it This use of locals can be desugared to a clever combi- against 134 hand-written tests. The semantics is for Python nation of assignments into a dictionary along with variable 2.5, a language version without the complex scope we han- assignments, which we do. However, this desugaring of lo- dle. Also, instead of defining a core, it directly evaluates (a cals relies on it being a strictly local operation (for lack of subset of) Python terms. Therefore, it offers a weaker ac- a better word). But worse, locals is a value! count of the language and is also likely to be less useful for certain kinds of tools and for foundational work. def f(x, g): There are a few projects that analyze Python code. They y = 3 are either silent about the semantics or explicitly eschew return g() defining one. We therefore do not consider these related. Our work follows the path laid out by λJS [7] and its f("x-val", locals) follow-up [6], both for variants of JavaScript. # ==> {’x’: ’x-val’, ’y’: 3, # ’g’: } Acknowledgments Thus, any application could invoke locals. We would We thank the US NSF and Google for their support. We therefore need to deploy our complex desugaring every- thank Caitlin Santone for lending her graphical expertise to where we cannot statically determine that a function is not figure 9. The paper title is entirely due to Benjamin Lerner. locals, and change every application to check for it. Other We are grateful to the Artifact Evaluation Committee for built-in values like super and dir exhibit similar behavior. their excellent and detailed feedback, especially the one re- On top of this, import can splice all identifiers (*) viewer who found a bug in iteration with generators. from a module into local scope. For now, we handle only This paper was the result of an international collaboration imports that bind the module object to a single identifier. resulting from an on-line course taught by the first and last Indeed, even Python 3 advises that import * should only authors. Several other students in the class also contributed be used at module scope. Finally, we do not handle exec, to the project, including Ramakrishnan Muthukrishnan, Bo Python’s “eval” (though the code-loading we do for modules Wang, Chun-Che Wang, Hung-I Chuang, Kelvin Jackson, comes close). Related efforts on handling similar operators Victor Andrade, and Jesse Millikan. in JavaScript [6] are sure to be helpful here. We note that most traditional analyses would be seriously Bibliography challenged by programs that use functions like locals in a [1] Appcelerator. PyDev. 2013. http://pydev.org/ higher-order way, and would probably benefit from checking that it isn’t used in the first place. We don’t see the lack of [2] Johann C. Rocholl. PEP 8 1.4.5. 2013. https:// pypi.python.org/pypi/pep8 full support for such functions as a serious weakness of λπ, or an impediment to reasoning about most Python programs. [3] Matthias Felleisen, Robert Bruce Findler, and Matthew Rather, it’s an interesting future challenge to handle a few of Flatt. Semantics Engineering with PLT Redex. 2009.

[4] Matthias Felleisen and Robert Hieb. The Revised Re- port on the Syntactic Theories of Sequential Control E ::= [] and State. Theoretical Computer Science 103(2), 1992. | (fetchE) | (set!Ee) | (set! val E) [5] Phil Frost. Pyflakes 0.6.1. 2013. https://pypi. | (allocE) python.org/pypi/pyflakes | (moduleEe) [6] Joe Gibbs Politz, Matthew J. Carroll, Benjamin S. | 〈E,mval〉 Lerner, Justin Pombrio, and Shriram Krishnamurthi. | ref :=E | x :=E | [ := ] | [ := ] | [ := ] A Tested Semantics for Getters, Setters, and Eval in E e e v E e v v E | E e JavaScript. In Proc. Dynamic Languages Symposium, | ifEee 2012. | let x=E ine [7] Arjun Guha, Claudiu Saftoiu, and Shriram Krishna- | list〈E,[e ...]〉 | list〈val,Es〉 murthi. The Essence of JavaScript. In Proc. European | tuple〈E,(e ...)〉 | tuple〈val,Es〉 Conference on Object-Oriented Programming, 2010. | set〈E,(e ...)〉 | set〈val,Es〉 [8] JetBrains. PyCharm. 2013. http://www. | E[e] | val[E] jetbrains.com/pycharm/ | (builtin-prim op Es) | (raiseE) [9] Gideon Joachim Smeding. An executable operational | (returnE) semantics for Python. Universiteit Utrecht, 2009. | (tryexceptExee) | (tryfinallyEe) [10] James McCauley. About POX. 2013. http://www. | (loopEe) | (frameE) noxrepo.org/pox/about-pox/ | E(e ...) | val Es | E(e ...)*e | val Es*e [11] Neal Norwitz. PyChecker 0.8.12. 2013. https:// | val(val ...)*E pypi.python.org/pypi/PyChecker | (construct-moduleE) | (in-moduleEε) [12] Securities and Exchange Commission. Release Es ::= (val ...Ee ...) Nos. 33-9117; 34-61858; File No. S7-08-10. 2010. https://www.sec.gov/rules/proposed/ 2010/33-9117.pdf Figure 13: Evaluation contexts [13] Sylvain Thenault. PyLint 0.27.0. 2013. https:// pypi.python.org/pypi/pylint [14] Wingware. Wingware Python IDE. 2013. http:// wingware.com/ T ::= [] [15] Ka-Ping Yee. Access to Names in Outer Spaces. | (fetchT) | (set!Te) | (set! val T) 2009. http://www.python.org/dev/peps/ | (allocT) pep-3104/ | 〈T,mval〉 | ref :=T | x :=T Appendix 1: The Rest of λπ | T[e :=e] | v[T :=e] | v[v :=T] | T e The figures in this section show the rest of the λπ semantics. | ifTee We proceed to briefly present each of them in turn. | let x=T ine | list〈T,e〉 | list〈val,Ts〉 1. Contexts and Control Flow | tuple〈T,e〉 | tuple〈val,Ts〉 | set〈T,e〉 | set〈val,Ts〉 Figures 13, 14, 15, and 16 show the different contexts | T[e] | val[T] we use to capture left-to-right, eager order of operations in | (builtin-prim op Ts) (raise ) λπ. E is the usual evaluation context that enforces left-to- | T | (loopTe) | (frameT) right, eager evaluation of expressions. T is a context for the | T(e ...) | val Ts first expression of a tryexcept block, used to catch instances | T(e ...)*e | val Ts*e of raise. Similarly, H defines contexts for loops, detecting | val(val ...)*T continue and break, and R defines contexts for return state- | (construct-moduleT) ments inside functions. Each interacts with a few expression Ts ::= (val ...Te ...) forms to handle non-local control flow.

Figure 17 shows how these contexts interact with expres- Figure 14: Contexts for try-except sions. For example, in first few rules, we see how we handle and in while statements. When takes a

H ::= [] | (fetchH) | (set!He) | (set! val H) | (allocH) | 〈H,mval〉 | ref :=H | x :=H

| H[e :=e] | v[H :=e] | v[v :=H] (whilee 1 e2 e3) [E-While] | H e ife 1 (loope 2 (whilee 1 e2 e3))e 3 | ifHee | let x=H ine (loopH[continue]e) e [E-LoopContinue] | list〈H,e〉 | list〈val,Hs〉 | tuple〈H,e〉 | tuple〈val,Hs〉 (loopH[break]e) vnone [E-LoopBreak] | set〈H,e〉 | set〈val,Hs〉 | H[e] | val[H] (loop val e) e [E-LoopNext] | (builtin-prim op Hs) | (raiseH) (tryexcept val x ecatch eelse) eelse [E-TryDone] | (tryexceptHxee) | H(e ...) | val Hs (tryexceptT[(raise val)]xe catch eelse) [E-TryCatch] | H(e ...)*e | val Hs*e | val(val ...)*H let x= val ine catch | (construct-moduleE) Hs ::= (val ...He ...) (frameR[(return val)]) val [E-Return]

(frame val) val [E-FramePop] Figure 15: Contexts for loops (tryfinallyR[(return val)]e) [E-FinallyReturn] e (return val)

(tryfinallyH[break]e) [E-FinallyBreak] e break R ::= [] | (fetchR) | (set!Re) | (set! val R) (tryfinallyT[(raise val)]e) [E-FinallyRaise] | (allocR) e (raise val) | 〈R,mval〉 | ref :=R | x :=R (tryfinallyH[continue]e) [E-FinallyContinue] | R[e :=e] | v[R :=e] | v[v :=R] e continue | R e

| ifRee (E[ if val e1 e2 ]εΣ) [E-IfTrue] | let x=R ine (E[e1]εΣ) | list〈R,e〉 | list〈val,Rs〉 where (truthy? valΣ) | tuple〈R,e〉 | tuple〈val,Rs〉 set〈 , 〉 set〈 , 〉 | R e | val Rs (E[ if val e1 e2 ]εΣ) [E-IfFalse] | R[e] | val[R] (E[e2]εΣ) | (builtin-prim op Rs) | (raiseR) where (not (truthy? valΣ)) | (loopRe) val e e [E-Seq] | (tryexceptRxee) | R(e ...) | val Rs | R(e ...)*e | val Rs*e | val(val ...)*R Figure 17: Control flow reductions | (construct-moduleE) Rs ::= (val ...Re ...)

Figure 16: Contexts for return statements step, it yields a loop form that serves as the marker for where the ability to share references between object fields and internal break and continue statements should collapse to. It variables. We maintain a strict separation in our current is for this reason that H does not descend into nested semantics, with the exception of modules, and we expect forms; it would be incorrect for a in a nested loop to that we’ll continue to need it in the future to handle pat- the outer loop. terns of exec. One interesting feature of while and tryexcept in Python is that they have distinguished “else” clauses. For Finally, we show the E-Delete operators, which allow a while loops, these else clauses run when the condition Python program to revert the value in the store at a particular is , but not when the loop is broken out of. For location back to , or to remove a global binding. tryexcept, the else clause is only visited if no exception 3. Global Scope was thrown while evaluating the body. This is reflected in E-TryDone and the else branch of the if statement produced While local variables are handled directly via substitution, by E-While. we handle global scope with an explicit environment ε that We handle one feature of Python’s exception raising im- follows the computation. We do this for two main reasons. perfectly. If a programmer uses raise without providing an First, because global scope in Python is truly dynamic in explicit value to throw, the exception bound in the most re- ways that local scope is not (exec can modify global scope), cent active catch block is thrown instead. We have a limited and we want to be open to those possibilities in the future. solution that involves raising a special designated “reraise” Second, and more implementation-specific, we use global value, but this fails to capture some subtle behavior of nested scope to bootstrap some mutual dependencies in the object catch blocks. We believe a more sophisticated desugaring system, and allow ourselves a touch of dynamism in the that uses a global stack to keep track of entrances and exits semantics. to catch blocks will work, but have yet to verify it. We still For example, when computing booleans, λπ needs to pass a number of tests that use raise with no argument. yield numbers from the δ function that are real booleans (e.g., have the built-in %bool object as their class). However, 2. Mutation we need booleans to set up if-tests while we are bootstrap- ping the creation of the boolean class itself! To handle this, There are three separate mutability operators in λ , (set! ) , π ee we allow global identifiers to appear in the class position of which mutates the value stored in a reference value, e :=e , objects. If we look for the class of an object, and the class which mutates variables, and (set-fieldeee) , which up- position is an identifier, we look it up in the global environ- dates and adds fields to objects. ment. We only use identifiers with special %-prefixed names Figure 18 shows the several operators that allocate and that aren’t visible to Python in this way. It may be possible manipulate references in different ways. We briefly catego- to remove this touch of dynamic scope from our semantics, rize the purpose for each type of mutation here: but we haven’t yet found the desugaring strategy that lets us do so. Figure 19 shows the reduction rule for field lookup in • We use , (fetche) and (alloce) to handle this case. the update and creation of objects via the δ function, which reads but does not modify the store. Thus, even 4. True, False, and None the lowly + operation needs to have its result re-allocated, The keywords True, False, and None are all singleton since programmers only see references to numbers, not references in Python. In λ , we do not have a form for object values themselves. We leave the pieces of object π True, instead desugaring it to a variable reference bound values immutable and use this general strategy for up- in the environment. The same goes for None and False. dating them, rather than defining separate mutability for We bind each to an allocation of an object: each type (e.g., lists). let True = (alloc〈%bool,1,{}〉) in • We use for assignment to both local and global let False = (alloc〈%bool,0,{}〉) in variables. We discuss global variables more in the next let None = (alloc〈%none, meta-none ,{}〉) in section. Local variables are handled at binding time by eprog allocating references and substituting the new references and these bindings happen before anything else. This pattern wherever the variable appears. Local variable accesses ensures that all references to these identifiers in the desug- and assignments thus work over references directly, since ared program are truly to the same objects. Note also that the the variables have been substituted away by the time the boolean values are represented simply as number-like val- actual assignment or access is reached. Note also that E- ues, but with the built-in class, so they can be added AssignLocal can override potential ☠ store entries. and subtracted like numbers, but perform method lookup on • We use to update and add fields to the class. This reflects Python’s semantics: objects’ dictionaries. We leave the fields of objects’ dic- tionaries as references and not values to allow ourselves isinstance(True, int) # ==> True

(E[@reffun (val ...)]εΣ) [E-App]

(E[(frame subst[[(x ...), (refarg ...),e ]])]εΣ 1)

where 〈anyc,λ(x ...) (no-var).e ,anydict〉 = Σ(reffun), (equal? (length (val ...)) (length (x ...))),

(Σ1 (refarg ...1)) = extend-store/list[[Σ,(val ...) ]] (E[ let x= v+undef ine]εΣ) [E-LetLocal]

(E[[x/ref]e]εΣ 1)

where (Σ1 ref) = alloc(Σ,v+undef) (E[ let x= v+undef ine]εΣ) [E-LetGlobal]

(E[e] extend-env[[ε,x, ref ]]Σ 1)

where (Σ1 ref) = alloc(Σ,v+undef)

(E[ref]εΣ) (E[val]εΣ) [E-GetVar]

where Σ = ((ref1 v+undef1) ... (ref val) (refn v+undefn) ...)

(E[ref]εΣ) (E[val]εΣ) [E-GetVar]

where Σ = ((ref1 v+undef1) ... (ref val) (refn v+undefn) ...)

(E[ref := val]εΣ) (E[val]εΣ[ref:=val]) [E-AssignLocal]

(E[x := val]εΣ) (E[val]εΣ[ref:=val]) [E-AssignGlobal]

where ε = ((x2 ref2) ... (x ref) (x3 ref3) ...)

(E[(alloc val)]εΣ) (E[@refnew ]εΣ 1) [E-Alloc]

where (Σ1 refnew) = alloc(Σ,val)

(E[(fetch @ref )]εΣ) (E[Σ(ref)]εΣ) [E-Fetch]

(E[(set! @ref val)]εΣ) (E[val]εΣ 1) [E-Set!]

where Σ1 = Σ[ref:=val]

(E[@refobj [@refstr := val1]]εΣ) (E[val1]εΣ 2) [E-SetFieldAdd]

where 〈anycls1,mval,{string:ref,...}〉 = Σ(refobj),

(Σ1 refnew) = alloc(Σ,val1),

〈anycls2,string1,anydict〉 = Σ(refstr),

Σ2 = Σ1[refobj:=〈anycls1,mval,{string1:refnew,string:ref,...}〉],

(not (member string1 (string ...)))

(E[@refobj [@refstr := val1]]εΣ) [E-SetFieldUpdate]

(E[val1]εΣ[ref 1:=val1])

where 〈anycls1,mval,{string2:ref2,...,string1:ref1,string3:ref3,...}〉 = Σ(refobj),

〈anycls2,string1,anydict〉 = Σ(refstr)

(E[(delete ref)]ε((ref 1 v1) ... (ref v)(refn vn) ...)) [E-DeleteLocal]

(E[v]ε((ref 1 v1) ... (ref ☠)(refn vn) ...))

(E[(deletex)] ((x 1 ref1) ... (x ref) (xn refn) ...)Σ) [E-DeleteGlobal]

(E[Σ(ref)] ((x1 ref1) ... (xn refn) ...)Σ)

Figure 18: Various operations on mutable variables and values

(E[@refobj [@refstr ]]εΣ) [E-GetField-Class/Id]

(E[valresult]εΣ result)

where 〈anycls,string,anydict〉 = Σ(refstr),

〈xcls,mval,{string1:ref2,...}〉 = Σ(refobj),

(Σresult valresult) = class-lookup[[@refobj ,Σ(env-lookup[[ε,x cls ]]), string,Σ ]],

(not (member string (string1 ...)))

env-lookup[[((x1 ref1) ... (x ref) (xn refn) ...),x ]] = ref

Figure 19: Accessing fields on a class defined by an identifier

(E[@reffun (val ...)]εΣ) [E-AppArity] (E[(err〈%str,“arity-mismatch”,{}〉)]εΣ)

where 〈anyc,λ(x ...) (no-var).e ,anydict〉 = Σ(reffun), (not (equal? (length (val ...)) (length (x ...))))

(E[@reffun (val ...)]εΣ) [E-AppVarArgsArity] (E[(err〈%str,“arity-mismatch-vargs”,{}〉)]εΣ)

where 〈anyc,λ(x ...) (y).e ,anydict〉 = Σ(reffun), (< (length (val ...)) (length (x ...)))

(E[@reffun (val ...)]εΣ) [E-AppVarArgs1]

(E[(frame subst[[(x ...y varg), (refarg ... reftupleptr),e ]])]εΣ 3)

where 〈anyc,λ(x ...) (yvarg).e ,anydict〉 = Σ(reffun), (>= (length (val ...)) (length (x ...))),

(valarg ...) = (take (val ...) (length (x ...))),

(valrest ...) = (drop (val ...) (length (x ...))),

valtuple = 〈%tuple,(valrest ...),{}〉,

(Σ1 (refarg ...)) = extend-store/list[[Σ,(valarg ...) ]],

(Σ2 reftuple) = alloc(Σ1,valtuple),

(Σ3 reftupleptr) = alloc(Σ2,@reftuple )

(E[@reffun (val ...)*@refvar ]εΣ) [E-AppVarArgs2]

(E[@reffun (val ... valextra ...)]εΣ)

where 〈anyc,(valextra ...),anydict〉 = Σ(refvar)

Figure 20: Variable-arity functions

5. Variable-arity Functions 6. Modules We implement Python’s variable-arity functions directly in We model modules with the two rules in figure 21. E- our core semantics, with the reduction rules shown in fig- ConstructModule starts the evaluation of the module, which ure 20. We show first the two arity-mismatch cases in the is represented by a meta-code structure. A con- semantics, where either no vararg is supplied and the argu- tains a list of global variables to bind for the module, a name ment count is wrong, or where a vararg is supplied but the for the module, and an expression that holds the module’s count is too low. If the count is higher than the number of body. To start evaluating the module, a new location is allo- parameters and a vararg is present, a new tuple is allocated cated for each used in the module, initialized with the extra arguments, and passed as the vararg. Finally, to ☠ in the store, and a new environment is created mapping the form e(e ...)*e allows a variable-length collection of each of these new identifiers to the new locations. arguments to be passed to the function; this mimics apply Evaluation that proceeds inside the module, replacing the in languages like Racket or JavaScript. global environment  with the newly-created environment. The old environment is stored with a new in-module form that is left in the current context. This step also sets up an

(E[(construct-module @refmod )]ε old Σ) [E-ConstructModule]

(E[(in-modulee body εold)

(alloc〈$module,(no-meta),{string arg:refnew,...}〉)]

εnew Σ1)

where 〈anycls,(meta-code (xarg ...)x name ebody),anydict〉 = Σ(refmod),

(Σ1 εnew (refnew ...)) = vars->fresh-env[[Σ,(xarg ...) ]],

(stringarg ...) = (map symbol->string (xarg ...))

(E[(in-modulevε old)]ε mod Σ) (E[vnone]ε old Σ) [E-ModuleDone] vars->fresh-env[[Σ, () ]] = (Σ () ())

vars->fresh-env[[Σ,(x) ]] = (Σ1 ((x ref)) (ref))

where (Σ1 (ref)) = (allocΣ ☠)

vars->fresh-env[[Σ,(x xr ...) ]] = (Σ2 ((x ref)(xrest refrest) ...) (ref refrest ...))

where (Σ1 ref) = alloc(Σ,☠),

(Σ2 ((xrest refrest) ...) (refrest ...)) = vars->fresh-env[[Σ1,(xr ...) ]]

Figure 21: Simple modules in λπ expression that will create a new module object, whose fields For PyCharm, if we rename the x to y, the hold the global variable references of the running module. class variable x also gets changed to y, but the access at When the evaluation of the module is complete (the C.x does not. This changes the program to throw an error. in-module term sees a value), the old global environment If we instead select the x in C.x and refactor to y, The class is reinstated. variable and use of x change, but the original definition’s To desugar to these core forms, we desugar the files to parameter does not. This changes the behavior to an error be imported, and analyze their body to find all the global again, as y is an unbound identifier in the body of meth. variables they use. The desugared expression and variables PyDev has the same problem as PyCharm with renaming are combined with the filename to create the meta-code ob- the function’s parameter. If we instead try rename the x in ject. This is roughly an implementation of Python’s com- the body of C, it gets it mostly right, but also renames all the pile, and it should be straightforward to extend it to imple- instances of x in our strings (something that even a parser, ment exec, though for now we’ve focused on specifically much less a lexical analyzer should be able to detect): the module case. def f(x): Appendix 2: Confusing Rename Refactorings class C(object): y = "C’s y" This program: # we highlighed the x before = above def f(x): # and renamed to y class C(object): def meth(self): x = "C’s x" return x + ’ ’ + C.y def meth(self): return C return x + ’, ’ + C.x return C f(’input y’)().meth() # ==> ’input y, C’s y’ f(’input x’)().meth() WingWare IDE for Python is less obviously wrong: it # ==> ’input x, C’s x’ pops up a dialog with different bindings and asks the user confuses the variable rename refactoring of all the Python to check the ones they want rewritten. However, if we try IDEs we tried. We present these weaknesses to show that to refactor based on the x inside the method, it doesn’t getting a scope analysis right in Python is quite hard! We give us an option to rename the function parameter, only found these tools by following recommendations on Stack- the class variable name and the access at C.x. In other Overflow, a trusted resource. Two of the tools we tested, Py- cases it provides a list that contains a superset of the actual Charm and WingWare IDE, are products that developers ac- identifiers that should be renamed. In other words, it not only tually purchase to do Python development (we performed overapproximates (which in Python may be inevitable), it this experiment in their free trials). also (more dangerously) undershoots.