Standard ML of New Jersey
Total Page:16
File Type:pdf, Size:1020Kb
Standard ML of New Jersey Andrew W. Appel∗ David B. MacQueen Princeton University AT&T Bell Laboratories CS-TR-329-91, Dept. of Computer Science, Princeton University, June 1991 This paper appeared in Third Int’l Symp. on Prog. Lang. Implementation and Logic Programming, Springer-Verlag LNCS 528, pp. 1–13, August 1991. Abstract free grammar, semantically analyzed into an an- notated abstract syntax tree, type-checked, and The Standard ML of New Jersey compiler has been translated into a lower-level intermediate language. under development for five years now. We have This is the “front end” of the compiler. Then developed a robust and complete environment for the intermediate language—Continuation-Passing Standard ML that supports the implementation of Style—is “optimized,” closures are introduced to large software systems and generates efficient code. implement lexical scoping, registers are allocated, The compiler has also served as a laboratory for de- target-machine instructions are generated, and (on veloping novel implementation techniques for a so- RISC machines) instructions are scheduled to avoid phisticated type and module system, continuation pipeline delays; these together constitute the “back based code generation, efficient pattern matching, end.” and concurrent programming features. 1 Introduction 2Parsing Early in the development of the compiler we used Standard ML of New Jersey is a compiler and a hand-written lexical analyzer and a recursive- programming environment for the Standard ML descent parser. In both of these components the language[26] that has been continuously developed code for semantic analysis was intermixed with since early 1986. Our initial goal was to produce the parsing code. This made error recovery dif- a working ML front end and interpreter for pro- ficult, and it was difficult to understand the syn- gramming language research, but the scope of the tax or semantics individually. We now have ex- project has expanded considerably. We believe that cellent tools[8, 32] for the automatic generation of Standard ML may be the best general-purpose pro- lexical analyzers and error-correcting parsers. Syn- gramming language yet developed; to demonstrate tactic error recovery is handled automatically by this, we must provide high-quality, robust, and effi- the parser generator, and semantic actions are only cient tools for software engineering. evaluated on correct (or corrected) parses. This has Along the way we have learned many useful greatly improved both the quality of the error mes- things about the design and implementation of sages and the robustness of the compiler on incor- “modern” programming languages. There were rect inputs. We remark that it would have been some unexpected interactions between the mod- helpful if the definition of Standard ML[26] had in- ule system, type system, code generator, debugger, cluded an LR(1) grammar for the language. garbage collector, runtime data format, and hard- There are two places in the ML grammar that ware; and some things were much easier than ex- appear not to be context free. One is the treat- pected. We wrote an early description of the com- ment of data constructors: according to the def- piler in the spring of 1987[7], but almost every com- inition, constructor names are in a different lexi- ponent of the compiler has since been redesigned cal class than variable names, even though the dis- and reimplemented at least once, so it is worthwhile tinction depends on the semantic analysis of pre- to provide an updated overview of the system and vious datatype definitions. However, by putting our implementation experience. constructors and variables into the same class of Our compiler is structured in a rather conven- lexical tokens, and the same name space, parsing tional way: the input stream is broken into tokens can be done correctly and the difference resolved in by a lexical analyzer, parsed according to a context- semantic analysis. ∗Supported in part by NSF grant CCR-9002786. The other context-dependent aspect of syntax is 1 the parsing of infix identifiers. ML allows the pro- to error messages. Furthermore, these line num- grammer to specify any identifier as infix, with an bers are sprinkled into the annotated abstract syn- operator precedence ranging from 0 to 9. Our solu- tax tree so that the type checker, match compiler, tion to this problem is to completely ignore operator and debugger can also give good diagnostics. precedence in writing our LALR(1) grammar; the expression a+b∗c is parsed into the list [a, +,b,∗,c] and the semantic analysis routines include a simple 3Semanticanalysis operator precedence parser (35 lines of ML). Each production of our grammar is annotated by A static environment maps each variable of the pro- a semantic action, roughly in the style made pop- gram to a binding containing its type and its runtime ular by YACC[16]. Our semantic actions are writ- access information. The type is used for compile- ten like a denotational semantics or attribute gram- time type checking, and is not used at runtime. The mar, where each fragment is a function that takes access information is (typically) the name of a low- inherited attributes as parameters and returns syn- level λ-calculus variable that will be manipulated by thesized attributes as results. Within the actions the code generator. Static environments also map there are occasional side-effects; e.g. when the type- other kinds of identifiers—data constructors, type checker performs unification by the modification of constructors, structure names, etc.—to other kinds ref-cells. of bindings. A complete parse yields a function p parameter- Our initial implementation treated environments ized by a static environment e (of identifiers defined imperatively: the operations on environments were in previous compilation units, etc.). No side-effects to add a new binding to the global environment; occur until p is applied to e,atwhichpointe is to “mark” (save) the state of the environment; to distributed by further function calls to many levels revert back to a previous mark; and, for imple- of the parse tree. In essence, before p is applied to mentation of the module system, to encapsulate e it is a tree of closures (one pointing to the other) into a special table everything added since a par- that is isomorphic to the concrete parse tree of the ticular mark. We did this even though we knew program. Yet we have not had to introduce a myr- better—denotational semantics or attribute gram- iad of data constructors to describe concrete parse mars would have us treat environments as pure val- trees! ues, to be combined to yield larger environments— Delaying the semantic actions is useful to the because we thought that imperative environments error-correcting parser. If an error in the parse oc- would be faster. curs, the parser might want to correct it at a point We have recently changed to a pure functional 10 tokens previous; this means discarding the last style of environments, in which the operations are few semantic actions. Since the actions have had no to create an environment with a single binding, and side-effects, it is easy to discard them. Then, when to layer one environment on top of another nonde- a complete correct parse is constructed, its seman- structively, yielding a new environment. The im- tic value can be applied to the environment e and plementation of this abstract data type has side all the side-effects will go off in the right order. effects, as sufficiently large environment-values are Finally, the treatment of mutually-recursive defi- represented as hash tables, etc. We made this nitions is easier with delayed semantic actions; the change to accommodate the new debugger, which newly-defined identifiers can be entered into the must allow the user to be in several environments environment before the right-hand-sides are pro- simultaneously; and to allow the implementation of cessed. “make” programs, which need explicit control over There is one disadvantage to this arrangement. the static environments of the programs being com- It turns out that the closure representation of the piled. Though we were willing to suffer a perfor- concrete parse tree is much larger than the anno- mance degradation in exchange for this flexibility, tated parse tree that results from performing the we found “pure” environments to be just as fast as semantic actions. Thus, if we had used a more con- imperative ones. ventional style in which the actions are performed This illustrates a more general principle that we as the input is parsed, the compiler would use less have noticed in ML program development. Many memory. parts of the compiler that we initially implemented Our parser-generator provides, for each nonter- in an imperative style have been rewritten piece- minal in the input, the line number (and position meal in a cleaner functional style. This is one within the line) of the beginning and end of the pro- of the advantages of ML: programs (and program- gram fragment corresponding to that nonterminal. mers) can migrate gradually to “functional” pro- These are used to add accurate locality information gramming. Type checking when it interferes with the matching of a signature specification merely because of the use of an imper- The main type-checking algorithm has changed rel- ative style within a function’s definition. Such im- atively little since our earlier description[7]. The plementation choices should be invisible in the type. representations of types, type constructors, and Research continues on this problem[17, 22, 38], but type variables have been cleaned up in various ways, there is no satisfactory solution yet. but the basic algorithm for type checking is still The interface between the type checker and the based on a straightforward unification algorithm.