ML Module Mania: a Type-Safe, Separately Compiled, Extensible Interpreter
Total Page:16
File Type:pdf, Size:1020Kb
ML Module Mania: A Type-Safe, Separately Compiled, Extensible Interpreter The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Ramsey, Norman. 2005. ML Module Mania: A Type-Safe, Separately Compiled, Extensible Interpreter. Harvard Computer Science Group Technical Report TR-11-05. Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:25104737 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA ¡¢ £¥¤§¦©¨ © ¥ ! " $# %& !'( *)+ %¨,.-/£102©¨ 3¤4#576¥)+ %&8+9©¨ :1)+ 3';&'( )< %' =?>A@CBEDAFHG7DABJILKM GPORQQSOUT3V N W >BYXZ*[CK\@7]_^\`aKbF!^\K/c@C>ZdX eDf@hgiDf@kjmlnF!`ogpKb@CIh`q[UM W DABsr!@k`ajdtKAu!vwDAIhIkDi^Rx_Z!ILK[h[SI ML Module Mania: A Type-Safe, Separately Compiled, Extensible Interpreter Norman Ramsey Division of Engineering and Applied Sciences Harvard University Abstract An application that uses an embedded interpreter is writ- ten in two languages: Most code is written in the original, The new embedded interpreter Lua-ML combines extensi- host language (e.g., C, C++, or ML), but key parts can be bility and separate compilation without compromising type written in the embedded language. This organization has safety. The interpreter’s types are extended by applying a several benefits: sum constructor to built-in types and to extensions, then • Complex command-line arguments aren’t needed; the tying a recursive knot using a two-level type; the sum con- embedded language can be used on the command line. structor is written using an ML functor. The interpreter’s initial basis is extended by composing initialization func- • A configuration file can be replaced by a program in the tions from individual extensions, also using ML functors. embedded language. Lua-ML adds a detailed example to the debate over how • It is easy to write an interactive loop that uses the em- much power is needed in a modules language. bedded language to control the application. The benefits above were demonstrated by Tcl (Ouster- 1 Introduction hout 1990), which was followed by embedded implemen- tations of other languages, including Python, Perl, and This paper solves a problem in the construction of inter- several forms of Scheme (Benson 1994; Laumann and Bor- preters: how to combine extensibility with separate com- mann 1994; Jenness and Cozens 2002; van Rossum 2002), as pilation in a safe language. The solution makes essential well as by another language designed expressly for embed- use of ML modules. It uses functors (parameterized mod- ding: Lua (Ierusalimschy, de Figueiredo, and Celes 1996a). ules) not only in ways that are supported in all dialects Why have two languages? Can’t the host language be of ML, but also in ways that require functors to be first- used for scripting? Sometimes it can; for example, Leijen class, as they are in the modules language of Objective and Meijer (2000) show how to encode embedded-language Caml (Leroy 2000). Powerful modules languages are ob- terms as host-language terms, even to the extent of us- jects of ongoing discussion in the functional-programming ing the host-language type checker to enforce a static type community. For example, Peyton Jones (2003b) recently discipline on the embedded language. But this technique likened the ML modules system to a Porsche, calling it requires that the host-language compiler translate phrases “powerful, but breathtakingly expensive.” To what degree in the embedded language, and in many cases it is inconve- such power is needed is a matter of debate. This paper uses nient or even impossible to make a host-language compiler the power to solve an open programming problem, adding available at run time. Aside from such considerations of im- an example to that debate. plementation, some people would argue that two languages We focus on a kind of interpreter for which extensibil- are needed because a language that is well suited for build- ity and separate compilation are especially important: the ing large applications is inherently ill suited for scripting embedded interpreter. Embedding is motivated by the prob- (Ousterhout 1998). We don’t pursue that argument; we lem of controlling a complex application, like a web server simply take for granted that there are two languages. or an optimizing compiler, that is written in a statically What are the requirements for a scripting language and typed, compiled language such as C, C++, or ML. Such its interpreter? an application will have lots of potential configurations and 1. They must be extensible: The whole point is to add behaviors. How are you to control it? If you use command- application-specific code and data to the scripting lan- line arguments, you may find yourself writing an interpreter guage, which we call extension. for an increasingly complicated language of command-line arguments. If you use a configuration file, you will find 2. The interpreter should be compiled separately from the yourself defining, parsing, and interpreting its syntax. application. In particular, it should be possible to com- A better idea, which we owe to Ousterhout (1990), is to pile an application-specific extension without using or create a reusable language designed just for configuring and changing the interpreter’s source code. In other words, controlling application programs, i.e., for scripting. Mak- the interpreter should be isolated in a library. ing a scripting language reusable means making it easy to 3. The combination of application and scripting language include its interpreter in an application, so the application should be type-safe, and this safety should be checked can invoke the interpreter and the interpreter can call code by the host-language compiler. in the application. Such an interpreter is called embedded. 1 This paper presents Lua-ML, which to my knowledge is • A pointer to the application-specific data can be asso- the first embedded interpreter to meet all three of these ciated with a Tcl command, which gets access through requirements. Lua-ML’s API makes it possible to embed its ClientData parameter. Because the association is a Lua interpreter into an application written in Objective maintained by the Tcl implementation, it is relatively Caml. Lua-ML uses Objective Caml’s modules language to easy to get right, but the implementation of the com- compose the Lua-ML interpreter with its extensions. mand still requires an unsafe cast to project a value of At present, the primary application of Lua-ML is to script type ClientData to a value of application-specific type. and control a nascent optimizing compiler for the portable • The application-specific data can be kept in a hash table assembly language C-- (Ramsey and Peyton Jones 2000). or other data structure. One or more commands may The compiler, which is roughly 25,000 lines of Objective keep a reference to the hash table and may interpret any Caml, uses about 1,000 lines of Lua to configure back ends string as a key in that table. This technique can be made and to call front ends, assemblers, linkers, and so on. type-safe, but it is more difficult to get right, because a Tcl value does not carry any sort of run-time type tag. 2 Related work It is up to the programmer of each command to make sure that strings are interpreted correctly. Prior work using C to implement embedded languages has produced interpreters that are extensible and separately Lua Like Tcl, Lua is a language that is designed ex- compiled but not type-safe. Prior work using functional pressly for embedding (Ierusalimschy, de Figueiredo, and languages to implement modular interpreters has produced Celes 1996a; Ierusalimschy, de Figueiredo, and Celes 2001). interpreters that are extensible and type-safe but not sepa- The Lua language, its library, and its API for embedding rately compiled. Lua-ML draws on both bodies of work, so have all undergone considerable evolution. Lua-ML im- we begin with a review. plements the Lua language version 2.5, which is described In designs that are not type-safe, type safety is lost by Ierusalimschy, de Figueiredo, and Celes (1996b). Ver- when values are converted between their host-language and sion 2.5 is relatively old, but it is mature and efficient, and embedded-language forms. We therefore pay special atten- it omits some complexities of later versions. The most re- tion to the two forms of conversion: converting from host cent version as of this writing is Lua 5.0, which was released value to embedded value, which is called embedding, and in Spring of 2003; I mention differences where appropriate. converting from embedded value to host value, which is Lua is a dynamically typed language with six types: nil, called projection. string, number, function, table, and userdata. Nil is a sin- gleton type containing only the value nil. A table is a muta- ble hash table in which any value except nil may be used as 2.1 Embedded languages and interpreters a key. Userdata is a catchall type, which enables an appli- Tcl The first language designed to be embedded in many cation program to add new types to the interpreter. Except applications was Tcl (Ousterhout 1990; Ousterhout 1994). for table, the built-in types are immutable; userdata is mu- Tcl is unusual in having no notion of type, not even a dy- table at the application’s discretion. Lua 5.0 adds two new namic one: Every value is a string. A string may represent types: Boolean and thread.