Modern Extensible Languages
Total Page:16
File Type:pdf, Size:1020Kb
Modern Extensible Languages Daniel Zingaro, McMaster University, Hamilton, Ontario, Canada April 11, 2007 Abstract Extensible languages are programming languages which allow a user to modify or add syntax, and associate the new syntactic forms with semantics. What are these languages good for? What kinds of features are easy to add, and which are not? Are they powerful enough to be taken seriously? In this survey we will attempt to answer such questions as we consider procedural, object-oriented, functional, and general-purpose extensible languages. We are primarily interested in expressive power (regular, context-free), associated caveats (unhy- gienic, ambiguity) and ease of use of the various mechanisms. 1 What is an Extensible Language? Before beginning, it is essential to have an operational definition of what, exactly, constitutes an extensible language; this will dictate the flow of the remainder of the paper. Standish [23] gives a rather broad definition, stating simply: an extensible language allows users to define new language features. These features may include new notation or operations, new or modified control structures, or even elements from different programming paradigms. To facilitate discussion, it is helpful to adopt Standish's nomenclature, and divide this spectrum of features into three classes: paraphrase, orthophrase and metaphrase. Paraphrase refers to adding new features by relying on features already present in the language; a prototypical example is defining macros, which of course ultimately expand into the programming language's standard syntax. Orthophase refers to adding features not expressible in terms of available language primitives; for example, adding an I/O system to a language that lacks one. Finally, metaphrase refers to changing the interpretation of existing language elements, so that expressions are parsed in new ways. In 1975, Standish painted a rather bleak picture of the success 1 of metaphrase techniques; we will see shortly that thirty years of experience have certainly improved the situation. There are several classes of programming languages (imperative, func- tional, object-oriented) for which we may attempt language extension. The remainder of the paper is thus organized as follows. We begin with a sim- ple macro definition facility for an imperative language (C), then move into more sophisticated mechanisms available in a (mostly) functional language (OCaml). We then look at two rather different extension ideas for an object- oriented language (Java), investigate the body of work related to hygienic macro expansion (Scheme), and conclude with two powerful general-purpose extension languages. Throughout, we will draw comparisons between the parsing mechanisms used and the expressiveness of the techniques. 2 Unhygienic Macros in C We shall begin where we inevitably must begin any survey of program- ming languages: with C. The C preprocessor includes a mechanism whereby macros can be defined; prior to compilation, these definitions are expanded so that the compiler can focus on C proper. (We are therefore looking at a para- phrastic mechanism.) Provisions are made for two types of macros: object- like, and function-like [16]. Object macros simply associate an identifier with a corresponding replacement (typically valid C code). The most common use for this basic text substitution is introducing mathematical constants such as π, or program-specific constants such as ARRAY SIZE. Function macros are similar to standard C functions in many ways. They can accept ar- guments, which can be used in their corresponding replacements, and can additionally be variadic (i.e. can accept a variable number of arguments). A common example of such a macro is the following, which computes the minimum of two numeric values (taken from [16]): #define min(X, Y) ((X) < (Y) ? (X) : (Y)) After this point in the source file, min is available as (what looks like) predefined C syntax, and can be invoked like any other C function. In passing, we should note the other main use of the C preprocessor: allowing inclusion of other C source files. This can be thought of as a feature in the orthophrase category because, for example, it can allow adding features (like strings and input/output) which do not belong to the base C language. 2 2.1 Discussion What we have just described is certainly not a very convincing argument for the claim that C is an extensible language. Most of the functionality allowed by the object-like macros could be directly incorporated in a typesafe way in a language supporting constant definitions. Function macros are expanded inline | an optimization that admits a certain speedup at the cost of increased code size. However, this is typically available as a compile- time option anyway, so these macros seem to offer little more than standard C functions. Even the syntax used when calling these \functions" still looks like C. Is this really extensibility? The utility of the preprocessor is further marred by its so-called un- hygienic nature. This means that innocuous-looking macros can unwit- tingly hide variables which should be acted upon, or { via subtle paren- thesization issues { change precedence rules as to get an undesired effect. Most of these problems stem from the fact that, in contrast to standard functions, direct substitution of actual arguments for formal arguments is made. As a concrete example, a macro to swap two arguments may look like #define swap(a, b) {int t; t = a; a = b; b = t;} If a variable t happens to occur in a lexical scope where this macro is expanded, it will be hidden by the new scope of the macro, and so a swap involving t will do nothing. Additionally, if a function invocation is used as an argument to this macro, the function will be executed twice, because it is expanded twice within the macro body [16]. Should the function have any side effects, the results will again be unexpected. Do seasoned C programmers really make these types of mistakes? Surely not, especially in light of well-known techniques for minimizing the afore- mentioned issues, such as parenthesizing the entire macro body and all of the individual variables therein [16]. Ernst et al. [13], however, analysed 26 C programs comprising 1.4 million lines of C code and found quite the oppo- site. Of all macro definitions, 23% of them included latent errors (of the sort we have described), waiting to strike when called with just the right identifier names, in just the right lexical environment. In summary, then, not only is this macro facility not much of an extension to C, it is a dangerous one at best. We will now consider far safer, more powerful alternatives. 3 Camlp4 Camlp4 is a preprocessor for the multi-paradigm language Ocaml; that is, it performs transformations on code, much like the C preprocessor described 3 above. Vastly different, however, is the method by which Camlp4 achieves its results. First, instead of acting at a textual (search and replace) level, it performs transformations on Ocaml abstract syntax trees (ASTs) [9]. Sec- ond, it allows for the grammar of the Ocaml language to be augmented with new rules, modified rules, or the deletion of rules. To effect a syntax transfor- mation, one first modifies the Ocaml grammar to include the desired syntax. Next, to transform this new syntax into Core Ocaml, transformations are specified in order to create a new AST from the old one. Creating ASTs node-by-node would be tedious [9], and so Camlp4 provides a mechanism called quotations to ease the task. Specifically, quotations allow one to de- scribe an AST by using Ocaml concrete syntax | since it is unambiguous, it can be directly translated to the corresponding AST. While standard Ocaml concrete syntax would suffice, the designers of Camlp4 took this opportu- nity to introduce a modified syntax for use inside quotations. The thinking was that they would clean up the rough parts of the grammar, making it easier to understand, parse and pretty-print [9]. (As with any language that tries to remain backward-compatible for many years, Ocaml's syntax is bogged down with legacy constructs, unnecessary to preserve for use in quotations.) As a case in point, standard Ocaml syntax allows the program- mer to declare what is effectively a global variable: let magicnum = 42. It also allows let-bindings to be introduced with the same let keyword: let magicnum = 42 in magicnum 5+. These two constructions result in different types of AST nodes; the one to create is not known until the word in is (or is not) seen. Using value for the first case, and continuing to use let for the second, dispenses with this parsing difficulty. Other changes include using C-like curly braces in imperative constructions, eliminating duplicate constructs for referring to boolean operations and assignments, and requiring parentheses around tuples [9]. With these pieces in place, we can now present a concrete example, taken from [9], of extending Ocaml's syntax with a C-style for-loop. As specified above, we first extend the Ocaml base grammar; the following achieves this: extend expr: LEVEL "expr1" [ [ "for"; v = LIDENT; iv = expr LEVEL "simple"; wh = expr LEVEL "simple"; nx = expr LEVEL "simple"; "do"; e = expr; "done" -> gen_for loc v iv wh nx e ] ] Notice that the new for-loop extends the expr nonterminal; that is, it is an expression, just like while, if, and the original (restricted) for. The LEVEL annotations are used by Camlp4 to disambiguate concrete syntax, and is used for imposing precedence and associativity [9]. The keywords 4 in quotation marks, such as for include their text exactly in the concrete syntax; the other parts of the definition associate symbolic names with pieces of the syntax tree, such as expressions or identifiers. The following, then, is a valid for-loop according to this new syntax: # for c 0 (c<5) (c+1) do print_int c; done The final step is to translate from the new syntax back into core Ocaml to be compiled; it should be clear that gen_for should carry out this task.