Standard ML of New Jersey

Andrew W. Appel∗ David B. MacQueen Princeton University AT&T Bell Laboratories CS-TR-329-91, Dept. of Computer Science, Princeton University, June 1991 This paper appeared in Third Int’l Symp. on Prog. Lang. Implementation and Logic Programming, Springer-Verlag LNCS 528, pp. 1–13, August 1991.

Abstract free grammar, semantically analyzed into an an- notated abstract syntax tree, type-checked, and The Standard ML of New Jersey compiler has been translated into a lower-level intermediate language. under development for five years now. We have This is the “front end” of the compiler. Then developed a robust and complete environment for the intermediate language—Continuation-Passing Standard ML that supports the implementation of Style—is “optimized,” closures are introduced to large software systems and generates efficient code. implement lexical scoping, registers are allocated, The compiler has also served as a laboratory for de- target-machine instructions are generated, and (on veloping novel implementation techniques for a so- RISC machines) instructions are scheduled to avoid phisticated type and module system, continuation pipeline delays; these together constitute the “back based code generation, efficient pattern matching, end.” and concurrent programming features.

1 Introduction 2Parsing Early in the development of the compiler we used Standard ML of New Jersey is a compiler and a hand-written lexical analyzer and a recursive- programming environment for the Standard ML descent parser. In both of these components the language[26] that has been continuously developed code for semantic analysis was intermixed with since early 1986. Our initial goal was to produce the parsing code. This made error recovery dif- a working ML front end and interpreter for pro- ficult, and it was difficult to understand the syn- gramming language research, but the scope of the tax or semantics individually. We now have ex- project has expanded considerably. We believe that cellent tools[8, 32] for the automatic generation of Standard ML may be the best general-purpose pro- lexical analyzers and error-correcting parsers. Syn- gramming language yet developed; to demonstrate tactic error recovery is handled automatically by this, we must provide high-quality, robust, and effi- the parser generator, and semantic actions are only cient tools for software engineering. evaluated on correct (or corrected) parses. This has Along the way we have learned many useful greatly improved both the quality of the error mes- things about the design and implementation of sages and the robustness of the compiler on incor- “modern” programming languages. There were rect inputs. We remark that it would have been some unexpected interactions between the mod- helpful if the definition of Standard ML[26] had in- ule system, type system, code generator, debugger, cluded an LR(1) grammar for the language. garbage collector, runtime data format, and hard- There are two places in the ML grammar that ware; and some things were much easier than ex- appear not to be context free. One is the treat- pected. We wrote an early description of the com- ment of data constructors: according to the def- piler in the spring of 1987[7], but almost every com- inition, constructor names are in a different lexi- ponent of the compiler has since been redesigned cal class than variable names, even though the dis- and reimplemented at least once, so it is worthwhile tinction depends on the semantic analysis of pre- to provide an updated overview of the system and vious datatype definitions. However, by putting our implementation experience. constructors and variables into the same class of Our compiler is structured in a rather conven- lexical tokens, and the same name space, parsing tional way: the input stream is broken into tokens can be done correctly and the difference resolved in by a lexical analyzer, parsed according to a context- semantic analysis. ∗Supported in part by NSF grant CCR-9002786. The other context-dependent aspect of syntax is

1 the parsing of infix identifiers. ML allows the pro- to error messages. Furthermore, these line num- grammer to specify any identifier as infix, with an bers are sprinkled into the annotated abstract syn- operator precedence ranging from 0 to 9. Our solu- tax tree so that the type checker, match compiler, tion to this problem is to completely ignore operator and debugger can also give good diagnostics. precedence in writing our LALR(1) grammar; the expression a+b∗c is parsed into the list [a, +,b,∗,c] and the semantic analysis routines include a simple 3Semanticanalysis operator precedence parser (35 lines of ML). Each production of our grammar is annotated by A static environment maps each variable of the pro- a semantic action, roughly in the style made pop- gram to a binding containing its type and its runtime ular by YACC[16]. Our semantic actions are writ- access information. The type is used for compile- ten like a denotational semantics or attribute gram- time type checking, and is not used at runtime. The mar, where each fragment is a function that takes access information is (typically) the name of a low- inherited attributes as parameters and returns syn- level λ-calculus variable that will be manipulated by thesized attributes as results. Within the actions the code generator. Static environments also map there are occasional side-effects; e.g. when the type- other kinds of identifiers—data constructors, type checker performs unification by the modification of constructors, structure names, etc.—to other kinds ref-cells. of bindings. A complete parse yields a function p parameter- Our initial implementation treated environments ized by a static environment e (of identifiers defined imperatively: the operations on environments were in previous compilation units, etc.). No side-effects to add a new binding to the global environment; occur until p is applied to e,atwhichpointe is to “mark” (save) the state of the environment; to distributed by further function calls to many levels revert back to a previous mark; and, for imple- of the parse tree. In essence, before p is applied to mentation of the module system, to encapsulate e it is a tree of closures (one pointing to the other) into a special table everything added since a par- that is isomorphic to the concrete parse tree of the ticular mark. We did this even though we knew program. Yet we have not had to introduce a myr- better—denotational semantics or attribute gram- iad of data constructors to describe concrete parse mars would have us treat environments as pure val- trees! ues, to be combined to yield larger environments— Delaying the semantic actions is useful to the because we thought that imperative environments error-correcting parser. If an error in the parse oc- would be faster. curs, the parser might want to correct it at a point We have recently changed to a pure functional 10 tokens previous; this means discarding the last style of environments, in which the operations are few semantic actions. Since the actions have had no to create an environment with a single binding, and side-effects, it is easy to discard them. Then, when to layer one environment on top of another nonde- a complete correct parse is constructed, its seman- structively, yielding a new environment. The im- tic value can be applied to the environment e and plementation of this abstract data type has side all the side-effects will go off in the right order. effects, as sufficiently large environment-values are Finally, the treatment of mutually-recursive defi- represented as hash tables, etc. We made this nitions is easier with delayed semantic actions; the change to accommodate the new debugger, which newly-defined identifiers can be entered into the must allow the user to be in several environments environment before the right-hand-sides are pro- simultaneously; and to allow the implementation of cessed. “make” programs, which need explicit control over There is one disadvantage to this arrangement. the static environments of the programs being com- It turns out that the closure representation of the piled. Though we were willing to suffer a perfor- concrete parse tree is much larger than the anno- mance degradation in exchange for this flexibility, tated parse tree that results from performing the we found “pure” environments to be just as fast as semantic actions. Thus, if we had used a more con- imperative ones. ventional style in which the actions are performed This illustrates a more general principle that we as the input is parsed, the compiler would use less have noticed in ML program development. Many memory. parts of the compiler that we initially implemented Our parser-generator provides, for each nonter- in an imperative style have been rewritten piece- minal in the input, the line number (and position meal in a cleaner functional style. This is one within the line) of the beginning and end of the pro- of the advantages of ML: programs (and program- gram fragment corresponding to that nonterminal. mers) can migrate gradually to “functional” pro- These are used to add accurate locality information gramming. Type checking when it interferes with the matching of a signature specification merely because of the use of an imper- The main type-checking algorithm has changed rel- ative style within a function’s definition. Such im- atively little since our earlier description[7]. The plementation choices should be invisible in the type. representations of types, type constructors, and Research continues on this problem[17, 22, 38], but type variables have been cleaned up in various ways, there is no satisfactory solution yet. but the basic algorithm for type checking is still The interface between the type checker and the based on a straightforward unification algorithm. parser is quite simple in most respects. There is The most complex part of the type-checking al- only one entry point to the type checker, a function gorithm deals with weak polymorphism, a restricted that is called to type-check each value declaration at form of polymorphism required to handle mutable top level and within a structure. However, the inter- values (references and arrays), exception transmis- face between type checking and the parser is com- sion, and communication (in extensions like Con- plicated by the problem of determining the scope or current ML[28]). Standard ML of New Jersey im- binding point of explicit type variables that appear plements a generalization of the imperative type in a program. The rather subtle scoping rules for variables described in the Definition[26, 34]. In our these type variables[26, Section 4.6][25, Section 4.4] scheme, imperative type variables are replaced by force the parser to pass sets of type variables both weak type variables that have an associated degree upward and downward (as both synthesized and in- of weakness: a nonnegative integer. A type vari- herited attributes of phrases). Once determined, able must be weak if it is involved in the type of the set of explicit type variables to be bound at a an expression denoting a reference, and its degree definition is stored in the abstract syntax represen- of weakness roughly measures the number of func- tation of the definition to make it available to the tion applications that must take place before the typechecker. reference value is actually created. A weakness de- gree of zero is disallowed at top level, which insures that top-level reference values (i.e. those existing 4 Modules within values in the top level environment) have monomorphic types. The type-checking algorithm The implementation of modules in SML of NJ has uses an abstract type occ to keep track of the “ap- evolved through three different designs. The main plicative context” of expression occurrences, which innovation of the second version factored signatures is approximately the balance of function abstrac- into a symbol table shared among all instances, tions over function applications surrounding the ex- and a small instantiation environment for each pression, and the occ value at a variable occurrence instance[23]. Experience with this version revealed determines the weakness degree of generic type vari- problems that led to the third implementation de- ables introduced by that occurrence. The occ value veloped in collaboration with Georges Gonthier and at a let binding is also used to determine which type Damien Doligez. variables can be generalized. The weak typing scheme is fairly subtle and has been prone to bugs, so it is important that it be for- Representations malized and proven sound (as the Tofte scheme has At the heart of the module system are the internal been [Tofte-thesis]). There are several people cur- representations of signatures, structures, and func- rently working on formalizing the treatment used in tors. Based on these representations, the following the compiler[17, 38]. principal procedures must be implemented: The weak polymorphism scheme currently used in Standard ML of New Jersey is not regarded as the fi- 1. signature creation—static evaluation of signa- nal word on polymorphism and references. It shares ture expressions; with the imperative type variable scheme the fault that weak polymorphism propagates more widely 2. structure creation—static evaluation of struc- than necessary. Even purely internal and tempo- ture expressions; rary uses of references in a function definition will often “poison” the function, giving it a weak type. 3. signature matching between a signature and a An example is the definition structure, creating an instance of the signature, and a view of the structure; fun f x = !(ref x) in which f has the type 1α → 1α, but ought to have 4. definition of functors—abstraction of the func- the strong polymorphic type α → α. This inessen- tor body expression with respect to the formal tial weak polymorphism is particularly annoying parameter; 5. functor application—instantiation of the for- to refer to volatile components, we can avoid the mal parameter by matching against the ac- bound stamps by using this relativized symbol ta- tual parameter, followed by instantiation of the ble alone to represent signatures. functor body. To drop the instantiation environment part of the signature representation, leaving only the symbol It is clear that instantiation of structure tem- table part, we need to revise the details of how envi- plates (i.e. signatures and functor bodies) is a criti- ronments are represented. Formerly a substructure cal process in the module system. It is also a process specification would be represented in the symbol ta- prone to consume excessive space and time if imple- ble by a binding like mented naively. Our implementation has achieved reasonable efficiency by separating the volatile part A → INDstr i of a template, that which changes with each in- stance, from the stable part that is common to all indicating that A is the ith substructure, and the instances and whose representation may therefore rest of the specification of A (in the form of a dummy be shared by all instances. The volatile compo- structure) would be found in the ith slot of the in- nents are stored in an instantiation environment stantiation environment. Since we are dropping the and they are referred to indirectly in the bindings dummy instantiation environment we must have all in the shared symbol table (or static environment) the information specifying A in the binding. Thus using indices or paths into the instantiation envi- the new implementation uses ronment. The instantiation environment is repre- → { } sented as a pair of arrays, one for type constructor A FORMALstrb pos = i, spec = sig A components, the other for substructures. The static representation of a structure is essen- as the binding of A. This makes the substructure tially an environment (i.e., symbol table) contain- signature specification available immediately in the ing bindings of types, variables, etc., and an iden- symbol table without having to access it indirectly tifying stamp[26, 33, 23]. In the second implemen- through an instantiation environment. tation a signature was represented as a “dummy” Another improvement in the representation of instance that differs from an ordinary structure signatures (and their instantiations) has to do with in that its volatile components contain dummy or the scope of instantiation environments. In the old bound stamps and it carries some additional infor- implementation each substructure had its own in- mation specifying sharing constraints. The volatile stantiation environment. But one substructure may components with their bound stamps are replaced, contain relative references to a component of an- or instantiated, during signature matching by cor- other substructure, as in the following example responding components from the structure being signature S1 = matched. Similarly, a functor body is represented sig as a structure with dummy stamps that are replaced structure A : sig type t end by newly generated stamps when the functor is ap- structure B : sig val x : A.t end plied. end The problem with representing signatures (and Here the type of B.x refers to the first type com- functor bodies) as dummy structures with bound ponent t of A. This would be represented from the stamps is the need to do alpha-conversion at var- standpoint of B as a relative path [parent, first sub- ious points to avoid confusing bound stamps. To structure, first type constructor]. To accommodate minimize this problem the previous implementation these cross-structure references when each struc- insures that the sets of bound stamps for each signa- ture has a local instantiation environment, the first ture and functor body are disjoint. But there is still structure slot in the instantiation environment con- a problem with signatures and functors that are sep- tains a pointer to the parent signature or structure. arately compiled and then imported into a new con- Defining and maintaining these parent pointers was text; here alpha-conversion of bound stamps is re- another source of complexity, since it made the rep- quired to maintain the disjointness property. Man- resentation highly cyclical. aging bound stamps was a source of complexity and The new representation avoids this problem by bugs in the module system. having a single instantiation environment shared by The usual way of avoiding the complications of the top level signature and all its embedded signa- bound variables is to replace them with an index- tures. An embedded signature is one that is writ- ing scheme, as is done with deBruijn indices in the ten “in-line” like the signatures of A and B in the lambda calculus[13]. Since in the symbol table part example above. In the above example, the new rep- we already used indices into instantiation arrays resentation of A.t within B is [first type constructor] since A.t will occupy the first type constructor slot relativized to this instantiation environment by re- in the shared instantiation environment. placing direct references by indirect paths. As in the A nonembedded signature is one that is defined case of signature matching, this minimizes the effort at top level and referred to by name. The signa- required to create an instance of the body when the ture S0 in the following example is a nonembedded functor is applied, because the symbol table infor- signature. mation is inherited unchanged by the instance. signature S0 = sig type t end Defining a functor is done in three steps: (1) The signature S1 = formal parameter signature is instantiated to create sig a dummy parameter structure. (2) This dummy structure A : S0 structure is bound to the formal parameter name structure B : sig val x : A.t end in the current environment and the resulting envi- end ronment is used to parse and type-check the func- In this case the type A.t of x uses the indi- tor body expression. If a result signature is spec- rect reference [first substructure, first type construc- ified the functor body is matched against it. (3) tor] meaning the first type constructor in the local The resulting body structure is scanned for volatile instantiation environment of A,whichisthefirst components, identified by having stamps belonging structure component in the instantiation environ- to the dummy parameter or generated within the ment of S1. S1 and B share a common instantiation body, and references to these volatile components environment because B is embedded in S1.ButS0, are replaced by indirect positional references into the signature of A, is nonembedded because it was an instantiation environment. defined externally to S1. It therefore can contain The instantiation of the parameter signature no references to other components of S1 and so it must produce a structure that is free modulo the is given its own private instantiation environment sharing constraints contained in the signature. In having the configuration appropriate to S0. other words, it must satisfy the explicit sharing constraints in the signature and all implicit shar- ing constraints implied by them, but there must Signature Matching by no extraneous sharing. The algorithm used for The goal of the representation of signatures is to this instantiation process is mainly due to George make it easy to instantiate them via signature Gonthier and is vaguely related to linear unification matching. A signature is a template for struc- algorithms. This instantiation process is also used tures, and a structure can be obtained from the sig- to create structures declared as abstractions using nature by adding an appropriate instantiation en- the abstraction declaration of Standard ML of New vironment (and recursively instantiating any sub- Jersey (a nonstandard extension of the language). structures with nonembedded signature specifica- Given this processing of the functor definition, tions). functor application is now a fairly straightforward The signature matching process involves the fol- process. The actual parameter is matched with the lowing steps: (1) Create an empty instantiation en- formal parameter signature yielding an instantia- vironment of a size specified in the signature repre- tion environment relative to the parameter signa- sentation. (2) For each component of the signature, ture. This is combined with a new instantiation in the order they were specified, check that there is a environment generated for the functor body using corresponding component in the structure and that freshly generated stamps in new volatile compo- this component satisfies the specification. When nents. this check succeeds it may result in an instance of a volatile component (e.g. a type constructor) that is entered into the new instantiation environment. (3) 5 Translation to λ-language Finally, having created the instantiation structure, any sharing constraints in the signature are verified During the semantic analysis phase, all static pro- by “inspection.” gram errors are detected; the result is an abstract parse tree annotated with type information. This Functors is then translated into a strict lambda calculus aug- mented with data constructors, numeric and string The key idea is to process a functor definition to constants, n-tuples, mutually-recursive functions; isolate volatile components of the result (those de- and various primitive operators for arithmetic, ma- riving from the parameter and those arising from nipulation of refs, numeric comparisons, etc. The generative declarations in the body) in an instanti- translation into λ-language is the phase of our com- ation environment. Then the body’s symbol table is piler that has changed least over the years. Though the λ language has data constructors, it equality function also uses these tags, so even with does not have pattern-matches. Instead, there is a a sophisticated garbage collector they can’t be done very simple case statement that determines which away with. (One alternative is to pass an equality- constructor has been applied at top level in a given test function along with every value of an equality value. The pattern-matches of ML must be trans- type, but this is also quite costly[36].) lated into discriminations on individual construc- Finally, the treatment of equality types in Stan- tors. This is done as described in our previous dard ML is irregular and incomplete[15]. The paper[7], though Bruce Duba has revised the de- Definition categorizes type constructors as either tails of the algorithm. “equality” or “nonequality” type constructors; but The dynamic semantics of structures and functors a more refined classification would more accurately are represented using the same lambda-language op- specify the effects of the ref operator. Some types erators as for the records and functions of the core that structurally support equality are classified as language. This means that the code generator and nonequality types by the Definition. runtime system don’t need to know anything about the module system, which is a great convenience. Also in this phase we handle equality tests. ML 6 Conversion to CPS allows any hereditarily nonabstract, nonfunctional values of the same type to be tested for equality; The λ-language is converted into continuation- even if the values have polymorphic types. In most passing style (CPS) before optimization and code cases, however, the types can be determined at com- generation. CPS is used because it has clean seman- pile time. For equality on atomic types (like in- tic properties (like λ-calculus), but it also matches teger and real), we substitute an efficient, type- the execution model of a von Neumann register ma- specific primitive operator for the generic equal- chine: variables of the CPS correspond closely to ity function. When constructed datatypes are registers on the machine, which leads to very effi- tested for equality, we automatically construct a cient code generaton[18]. set of mutually-recursive functions for the specific In the λ-language (with side-effecting operators) instance of the datatype; these are compiled into we must specify a call-by-value (strict) order of eval- the code for the user’s program. Only when the uation to really pin down the meaning of the pro- type is truly polymorphic—not known at compile gram; this means that we can’t simply do arbitrary time—is the general polymorphic equality function β-reductions (etc.) to partially evaluate and opti- invoked. This function interprets the tags of ob- mize the program. In the conversion to CPS, all jects at runtime to recursively compare bit-patterns order-of-evaluation information is encoded in the without knowing the full types of the objects it is chaining of function calls, and it doesn’t matter testing for equality. whether we consider the CPS to be strict or non- Standard ML’s polymorphic equality seriously strict. Thus, β-reductions and other optimizations complicates the compiler. In the front end, there are become much easier to specify and implement. special “equality type variables” to indicate poly- The CPS notation[30] and our representation of morphic types that are required to admit equality, it[5] are described elsewhere, as is a detailed descrip- and signatures have an eqtype keyword to denote tion of optimization techniques and runtime repre- exported types that admit equality. The eqtype sentations for CPS[4]. We will just summarize the property must be propagated among all types and important points here. structures that share in a functor definition. We es- In continuation-passing style, each function can timate that about 7% of the code in the front end have several arguments (in contrast to ML, in which of the compiler is there to implement polymorphic functions formally have only one parameter). Each equality. of the actual parameters to a function must be The effect on the back end and runtime system is atomic—a constant or a variable. The operands just as pernicious. Because ML is a statically-typed of an arithmetic operator must also be atomic; the language, it should not be necessary to have type result of the operation is bound to a newly-defined tags and descriptors on every runtime object (as variable. There is no provision for binding the re- Lisp does). The only reasons to have these tags are sult of a function call to a variable; “functions never for the garbage collector (so it can understand how return.” to traverse pointers and records) and for the poly- To use CPS for compiling a programming morphic equality function. But it’s possible to give language—in which functions are usually allowed the garbage collector a map of the type system[1], to return results, and expressions can have nontriv- so that it can figure out the types of runtime objects ial sub-expressions—it is necessary to use continu- without tags and descriptors. Yet the polymorphic ations. Instead of saying that a function call f(x) returns a value a, we can make a function k(a)that This means that the compiler is free to perform expresses what “the rest of the program” would do common sub-expression elimination on record ex- with the result a, and then call fcps(x, k). Then pressions (i.e. convert the first expression above to fcps, instead of returning, will call k with its result the second); the garbage collector is free to make a. several copies of a record (possibly useful for concur- After CPS-conversion, a source-language func- rent collection), or to merge several copies into one tion call looks just like a source-language function (a kind of “delayed hash-consing”); a distributed return—they both look like calls in the CPS. This implementation is free to keep separate copies of means it is easy to β-reduce the call without reduc- a record on different machines, etc. We have not ing the return, or vice versa; this kind of flexibility really exploited most of these opportunities yet, is very useful in reasoning about (and optimizing) however. tail-recursion, etc. In a strict λ-calculus, β-reduction is problemat- ical. If the actual parameters to a function have 7 Closure conversion side effects, or do not terminate, then they can- not be safely substituted for the formal parameters The conversion of λ-calculustoCPSmakesthecon- throughout the body of the function. Any actual trol flow of the program much more explicit, which parameter expression could contain a call to an un- is useful when performing optimizations. The next known (at compile time) function, and in this case phase of our compiler, closure conversion,makes it is impossible to tell whether it does have a side explicit the access to nonlocal variables (using lex- effect. But in CPS, the actual parameters to a func- ical scope). In ML (and Scheme, Smalltalk, and tion are always atomic expressions, which have no other languages), function definitions may be nested side effects and always terminate; so it’s safe and inside each other; and an inner function can have easy to perform β-reduction and other kinds of sub- free variables that are bound in an outer function. stitutions. Therefore, the representation of a function-value (at runtime) must include some way to access the values In our optimizer, we take great advantage of a of these free variables. The closure data structure unique property of ML: records, n-tuples, construc- allows a function to be represented by a pointer to tors, etc., are immutable. That is, except for ref a record containing cells and arrays (which are identifiable at compile time through the type system), once a record is cre- 1. The address of the machine-code entry-point ated it cannot be modified. This means that a fetch for the body of the function. from a record will always yield the same result, even if the compiler arranges for it to be performed ear- 2. The values of free variables of the function. lier or later than specified in the program. This al- lows much greater freedom in the partial evaluation The code pointer (item 1) must be kept in a stan- of fetches (e.g. from pattern-matches), in constant- dardized location in all closures; for when a function folding, in instruction scheduling, and common f is passed as an argument to another function g, subexpression elimination than most compilers are then g must be able to extract the address of f in permitted. (One would think that in a pure func- order to jump to f. But it’s not necessary to keep tional language like Haskell this immutable record the free variables (item 2) in any standard order; property would be similarly useful, but such lan- instead, g will simply pass f’s closure-pointer as an guages are usually lazy so that fetches from a lazy extra argument to f, which will know how to ex- cell will yield different results the first and second tract its own free variables. times.) This mechanism is quite old[19] and reasonably A similar property of ML is that immutable efficient. However, the introduction of closures is records are not distinguishable by address. That usually performed as part of machine-code genera- is, if two records contain the same values, they are tion; we have made it a separate phase that rewrites “the same;” the expressions the CPS representation of the program to include closure records. Thus, the output of the closure- [(x,y), (x,y)] conversion phase is a CPS expression in which it is let val a = (x,y) in [a,a] end guaranteed that no function has free variables; this are indistinguishable in any context. This is not expression has explicit record-creation operators to the case in most programming languages, where the build closures, and explicit fetch operators to ex- different pairs (x,y) in the first list would have dif- tract code-pointers and free variables from them. ferent addresses and could be distinguished by a Since closure-introduction is not bundled to- pointer-equality test. gether with other aspects of code generation, it is easier to introduce sophisticated closure techniques back after the continuation is invoked. In a conven- without breaking the rest of the compiler. In gen- tional compiler, the caller of a function might sim- eral, we have found that structuring our compiler ilarly save registers into the stack frame, and fetch with so many phases—each with a clean and well- them back after the call. defined interface—has proven very successful in al- But some conventional compilers also have lowing work to proceed independently on different “callee-save” registers. It is the responsibility of parts of the compiler. each function to leave these registers undisturbed; Initially, we considered variations on two differ- if they are needed during execution of the function, ent closure representations, which we call flat and they must be saved and restored by the callee. linked.Aflat closure for a function f is a record We can represent callee-save variables in the orig- containing the code-pointer for f and the values of inal CPS language, without changing the code- each of f’s free variables. A linked closure for f con- generation interface. We will represent a contin- tains the code pointer, the value of each free vari- uation not as one argument but as N + 1 argu- able bound by the enclosing function,andapointer ments k0,k1,k2,...,kN. Then, when the continua- to the enclosing function’s closure. Variables free in tion k0 is invoked with “return-value” a,thevari- the enclosing function can be found by traversing ables k1,...,kN will also be passed as arguments to the linked list of closures starting from f;thisis the continuation. just like the method of access links used in imple- Since our code generator keeps all CPS vari- menting static scope in Pascal. ables in registers—including function parameters— It would seem that linked closures are cheaper the variables k1,...,kN are, in effect, callee-save to build (because a single pointer to the enclos- registers. We have found that N = 3 is sufficient ing scope can be used instead of all the free vari- to obtain a significant (7%) improvement in perfor- ables from that scope) but costlier to access (get- mance. ting a free variable requires traversing a linked list). In fact, we investigated many different represen- tational tricks on the spectrum between flat and 8 Final code generation linked closures[6], including tricks where we use the same closure record for several different functions The operators of the CPS notation—especially af- with several different code-pointers[5, 4]. ter closure-conversion—are similar to the instruc- In a “traditional” compiler, these tricks make a tions of a simple register/memory von Neumann significant difference. But in the CPS representa- machine. The recent trend towards RISC machines tion, it appears that the pattern of functions and with large register sets makes CPS-based code gen- variable access narrows the effective difference be- eration very attractive. It is a relatively simple mat- tween these techniques, so that closure representa- ter to translate the closure-converted CPS into sim- tion is not usually too important. ple abstract-machine instructions; these are then There are two aspects of closures that are impor- translated into native machine code for the MIPS, tant, however. We have recently shown that us- Sparc, VAX, or MC68020. The latter two machines ing linked or merged closures can cause a compiled are not RISC machines, and to do a really good program to use much more memory[4]. For exam- jobincodegenerationforthemwewouldhaveto ple, a program compiled with flat closures might add a final peephole-optimization or instruction- use O(N) memory (i.e. simultaneous live data) on selection phase. On the RISC machines, we have an input of size N, and the same program compiled a final instruction-scheduling phase to minimize de- with linked closures might use O(N 2). Though this lays from run-time pipeline interlocks. may happen rarely, we believe it is unacceptable One interesting aspect of the final abstract- (especially since the programmer will have no way machine code generation is the register allocation. to understand what is going on). We are therefore After closure-conversion and before code generation re-examining our closure representations to ensure we have a spill phase that rewrites the CPS expres- “safety” of memory usage; this essentially means sion to limit the number of free variables of any sticking to flat closures. subexpression to less than the number of registers We have also introduced the notion of “callee- on the target machine[5, 4]. It turns out that very save registers.”[9, 4] Normally, when an “unknown” few functions require any such rewriting, especially function (e.g. one from another compilation unit) on modern machines with 32 registers; five spills in is called in a compiler using CPS, all the registers 40,000 lines of code is typical. (variables) that will be needed “after the call” are Because the free variables of any expression are free variables of the continuation. As such, they guaranteed to fit in registers, register allocation is are stored into the continuation closure, and fetched a very simple matter: when each variable is bound, only K other variables are live (i.e. free in the con- the free-space register, followed by the addition of a tinuation of the operation that binds the variable), constant (the size of the new record) to the reg- where K ’a) -> ’a five times faster than Poly/ML programs, on the val throw : ’a cont -> ’a -> ’b average (geometric mean). SML/NJ reportedly The type int cont isthetypeofacontinua- uses about 1.5 times as much heap space for tion that is expecting an integer value. The execution[10]; and on a 68020-based platform (like callcc function is similar to call-with-current- a Sun-3), SML/NJ may not do relatively as well continuation or call/cc in Scheme — it is the (since we don’t generate really good code for that primitive that captures continuation values. The machine). So on obsolete machines with tiny mem- function throw coerces a continuation into a func- ories, Poly/ML may do almost as well as SML/NJ. tion that can be applied to invoke the continuation. Figure 2 compares implementations of several Since the invocation of a continuation does not re- programming languages on a Knuth-Bendix bench- turn like a normal function call, the return type of mark. Standard ML of New Jersey does quite well, throw k is a generic type variable that will unify especially on the RISC machine (the DECstation with any type. 5000 has a MIPS processor). The runtime implementation of first-class contin- uations was also quite easy and very efficient, be- 11 Continuations cause of the use of continuation passing style in the code generation, and the representation of continu- One of the more significant language innovations ations as objects in the heap. Bundling up the cur- in Standard ML of New Jersey is typed first-class rent continuation into a closure is just like what is continuations[14]. It turned out to be possible to done on the call to an escaping function, and throw- add a major new capability to the language merely ing a value to a continuation is like a function call. by introducing a new primitive type constructor and So continuations are as cheap as ordinary function Sun 3/280 DEC 5000/200 implemented by John Reppy[27] (themselves ex- 16 Mbytes 16 Mbytes pressed in terms of continuations), one can build Run G.C. Run G.C. light-weight process libraries with preemptive pro- CAML V2-6.1 14.5 14.8 6.2 6.2 cess scheduling entirely within Standard ML of New CAML Light 0.2 28.3 6.5 Jersey. Two major concurrency systems have been SML/NJ 0.65 9.6 0.3 1.7 0.1 implemented at this point: Concurrent ML by John SML/NJ 0.65 x 8.5 0.3 1.4 0.1 Reppy[28] is based on CCS/CSP-style primitives LeLisp 15.23 4.1 1.4 (synchronous communication on typed channels) SunOS 3.5, cc -O 4.35 but introduces the novel idea of first-class events. gcc 1.37.1, gcc -O 4.22 ML Threads is a system designed by Eric Cooper Ultrix 4.0, cc -O2 0.90 and Greg Morrisett[12] that provides mutual exclu- sion primitives for synchronization. A version of Figure 2: Comparison of several different compilers ML Threads runs on shared-memory multiproces- Xavier Leroy translated Gerard Huet’s sors, where threads can be scheduled to run in par- Knuth-Bendix program into several allel on separate physical processors. Both Concur- different languages, and ran them on two rent ML and ML Threads are implemented as ordi- different machines[21]. This table shows nary ML modules, requiring no enhancements of the non-gc run time and gc time in seconds language itself—except that ML Threads required for each version of the program. Since modification of the runtime system to support mul- the program uses higher-order functions, tiprocessing. Leroy had to do manual lambda-lifting to write the program in Lisp and C, and in some places had to use explicit closures 12 Related projects (structures containing function-pointers). A number of very useful enhancements of the Stan- CAML is a different version of the ML dard ML of New Jersey system are being carried language (i.e. not Standard ML) out by other groups or individuals. One such developed at INRIA[11]; CAML V2-6.1 is project is the SML-to-C translator done by David a native-code compiler that shares the Tarditi, Anurag Acharya, and Peter Lee at Carnegie LeLisp runtime system, and CAML Mellon[31]. This provides a very portable basis for Light[20] is a compiler with a byte-code running ML programs on a variety of hardware for interpreter written in C. SML/NJ x which we do not yet have native code generators, refers to Standard ML of New Jersey with with very respectable performance. all modules placed in “super-module” to Mads Tofte and Nick Rothwell implemented the allow cross-module optimization. first version of separate compilation for Standard ML of New Jersey. Recently Gene Rollins at Carnegie Mellon has developed a more sophisticated calls. and efficient system called SourceGroups for man- Continuations are not necessarily a good tool for aging separate compilation. SourceGroups builds routine programming since they lend themselves on the primitive mechanisms provided by Tofte and to tricky and contorted control constructs. How- Rothwell but gains efficiency by doing a global anal- ysis of dependencies among a set of modules and ever, continuations have an important “behind the scenes” role to play in implementing useful tools minimizing redundancy when loading or recompil- and abstractions. They are used in the implemen- ing the modules. tation of the interactive ML system to construct a John Reppy and Emden Gansner have developed barrier between user computation and the ML sys- a library for interacting with the X window system. This system is based on Concurrent ML and pro- tem. This makes it possible to export an executable image of a user function without including the ML vides a much higher-level of abstraction for writing compiler. Another application of continuations is graphical interfaces than the conventional conven- Andrew Tolmach’s replay debugger[35], where they tional C-based libraries. are used to save control states. This is the basis of the time travel capabilities of the debugger. 13 Future Plans It is well known that continuations are useful for implementing coroutines and for simulating paral- The development of Standard ML of New Jer- lel threads of control[37]. Using continuations in sey and its environment is proceeding at an ac- conjunction with the signal handling mechanisms celerating pace. John Reppy is implementing a new multi-generation, multi-arena garbage collector ation. Thanks to Zhong Shao for the common- that should significantly improve space efficiency. subexpression eliminator, as well as the callee-save Work is in progress to improve code generation and convention that uses multiple-register continuations significantly speed up the back end. Exploratory for faster procedure calls[9]. work is being done on new features like type dy- We thank Nick Rothwell and Mads Tofte for namic, extensible datatypes, and higher-order func- the initial implementation of the separate compila- tors. tion mechanism, and Gene Rollins for his recent improvements. Finally we thank our user community that sends 14 Acknowledgments us bug reports, keeps us honest, and actually finds useful things to do with Standard ML. Many people have worked on Standard ML of New Jersey. We would like to thank John H. Reppy for many improvements and rewrites of the References runtime system, for designing and implementing [1] Andrew W. Appel. Runtime tags aren’t necessary. the signal-handling mechanism[27], improving the Lisp and Symbolic Computation, 2:153–162, 1989. call/cc mechanism, designing the current mecha- nism for calls to C functions, implementing a so- [2] Andrew W. Appel. Simple generational garbage collection and fast allocation. Software—Practice phisticated new garbage collector, generally making and Experience, 19(2):171–183, 1989. the runtime system more robust, and implementing the SPARC code generator; and for designing and [3] Andrew W. Appel. A runtime system. Lisp and implementing the Concurrent ML system[28] and Symbolic Computation, 3(343-380), 1990. its X-windows interface[29]. [4] Andrew W. Appel. Compiling with Continuations. Thanks to Trevor Jim for helping to design the Cambridge University Press, 1992. CPS representation[5]; and for implementing the [5] Andrew W. Appel and Trevor Jim. Continuation- match compiler and the original closure-converter, passing, closure-passing style. In Sixteenth ACM the original library of floating point functions, and Symp. on Principles of Programming Languages, the original assembly-language implementation of pages 293–302, 1989. external primitive functions. [6] Andrew W. Appel and Trevor T. Y. Jim. Optimiz- Thanks to Bruce F. Duba for improvements ing closure environment representations. Technical to the match compiler, the CPS constant-folding Report 168, Dept. of Computer Science, Princeton phase, the in-line expansion phase, the spill phase, University, 1988. and numerous other parts of the compiler; and [7] Andrew W. Appel and David B. MacQueen. A for his part in the design of the call-with-current- Standard ML compiler. In Gilles Kahn, edi- continuation mechanism[14]. tor, Languages and Com- Thanks to James W. O’Toole who imple- puter Architecture (LNCS 274), pages 301–324. mented the NS32032 code generator, and Norman Springer–Verlag, 1987. Ramsey who implemented the MIPS code genera- [8] Andrew W. Appel, James S. Mattson, and tor. David R. Tarditi. A lexical analyzer generator for We thank Andrew P. Tolmach for the SML/NJ Standard ML. distributed with Standard ML of debugger[35], and for the new pure-functional style New Jersey, December 1989. of static environments; and Adam T. Dingle for [9] Andrew W. Appel and Zhong Shao. Callee-save the debugger’s Emacs interface. registers in continuation-passing style. Technical Thanks to James S. Mattson for the first ver- Report CS-TR-326-91, Princeton Univ. Dept. of sion of the ML lexical analyzer generator; and to Computer Science, Princeton, NJ, June 1991. David R. Tarditi for making the lexer-generator [10] David Berry. SML resources. sent to the SML production-quality[8], for implementing a really mailing list by [email protected], May 1991. first-class parser generator[32], for helping to im- [11] CAML: The reference manual (version 2.3). Projet plement the type-reconstruction algorithm used by Formel, INRIA-ENS, June 1987. the debugger[35], and for the the ML-to-C trans- [12] Eric C. Cooper and J. Gregory Morrisett. Adding lator he implemented with Anurag Acharya and threads to Standard ML. Technical Report CMU- Peter Lee[31]. CS-90-186, School of Computer Science, Carnegie We appreciate Lal George’s teaching the code Mellon University, December 1990. generator about floating point registers and mak- [13] N. G. deBruijn. Lambda calculus notation with ing floating-point performance respectable; and his nameless dummies, a tool for automatic formula fixing of several difficult bugs not of his own cre- manipulation. Indag. Math., 34:381–392, 1972. [14] Bruce Duba, Robert Harper, and David Mac- [32] David R. Tarditi and Andrew W. Appel. ML-Yacc, Queen. Typing first-class continuations in ML. In version 2.0. distributed with Standard ML of New Eighteenth Annual ACM Symp. on Principles of Jersey, April 1990. Prog. Languages, pages 163–173, Jan 1991. [33] Mads Tofte. Operational Semantics and Polymor- [15] Carl A. Gunter, Elsa L. Gunter, and David B. Mac- phicTypeInference. PhD thesis, Edinburgh Uni- Queen. An abstract interpretation for ML equality versity, 1988. CST-52-88. kinds. In Theoretical Aspects of Computer Soft- [34] Mads Tofte. Type inference for polymorphic ref- ware. Springer, September 1991. erences. Information and Computation, 89:1–34, [16] S. C. Johnson. Yacc – yet another compiler com- November 1990. piler. Technical Report CSTR-32, AT&T Bell Lab- [35] Andrew P. Tolmach and Andrew W. Appel. De- oratories, Murray Hill, NJ, 1975. bugging Standard ML without reverse engineering. [17] James William O’Toole Jr. Type abstraction rules In Proc. 1990 ACM Conf. on Lisp and Functional for references: A comparison of four which have Programming, pages 1–12, June 1990. achieved noteriety. Technical Report 380, MIT [36] Philip Wadler and Stephen Blott. How to make Lab. for Computer Science, 1990. ad-hoc polymorphism less ad hoc.InSixteenth An- [18] David Kranz. ORBIT: An optimizing compiler for nual ACM Symp. on Principles of Prog. Languages, Scheme. PhD thesis, Yale University, 1987. pages 60–76, Jan 1989. [19] P. J. Landin. The mechanical evaluation of expres- [37] Mitchell Wand. Continuation-based multiprocess- sions. Computer J., 6(4):308–320, 1964. ing. In Conf. Record of the 1980 Lisp Conf., pages [20] Xavier Leroy. The ZINC experiment: an economi- 19–28, August 1980. cal implementation of the ML language. Technical [38] Andrew K. Wright and Matthias Felleisen. A Report No. 117, INRIA, February 1990. syntactic approach to type soundness. Technical [21] Xavier Leroy. INRIA, personal communication, Report COMP TR91-160, Rice University, April 1991. 1991. [22] Xavier Leroy and Pierre Weis. Polymorphic type inference and assignment. In Eighteenth Annual ACM Symp. on Principles of Prog. Languages,Jan 1991. [23] David B. MacQueen. The implementation of Stan- dard ML modules. In ACM Conf. on Lisp and Functional Programming, pages 212–223, 1988. [24] David C. J. Matthews. Papers on Poly/ML. Tech- nical Report T.R. No. 161, Computer Laboratory, University of Cambridge, February 1989. [25] and Mads Tofte. Commentary on Standard ML. MIT Press, Cambridge, Mas- sachusetts, 1991. [26] Robin Milner, Mads Tofte, and Robert Harper. The Definition of Standard ML. MIT Press, Cambridge, Mass., 1989. [27] John H. Reppy. Asynchronous signals in Standard ML. Technical Report TR 90-1144, Cornell Uni- versity, Dept. of Computer Science, Ithaca, NY, 1990. [28] John H. Reppy. Concurrent programming with events. Technical report, Cornell University, Dept. of Computer Science, Ithaca, NY, 1990. [29] John H. Reppy and Emden R. Gansner. The eXene library manual. Cornell Univ. Dept. of Computer Science, March 1991. [30] Guy L. Steele. Rabbit: a compiler for Scheme. Technical Report AI-TR-474, MIT, 1978. [31] David R. Tarditi, Anurag Acharya, and Peter Lee. No assembly required: Compiling Standard ML to C. Technical Report CMU-CS-90-187, Carnegie Mellon Univ., November 1990.