Yhc.Core – from Haskell to Core by Dimitry Golubovsky [email protected] and Neil Mitchell [email protected] and Matthew Naylor [email protected]

The Yhc compiler is a hot-bed of new and interesting ideas. We present Yhc.Core – one of the most popular libraries from Yhc. We describe what we think makes Yhc.Core special, and how people have used it in various projects including an evaluator, and a Javascript code generator.

What is Yhc Core?

The York Haskell Compiler (Yhc) [1] is a fork of the nhc98 compiler [2], started by Tom Shackell. The initial goals included increased portability, a platform inde- pendent , integrated Hat [3] support and generally being a cleaner code base to work with. Yhc has been going for a number of years, and now compiles and runs almost all Haskell 98 programs and has basic FFI support – the main thing missing is the Haskell base library. Yhc.Core is one of our most successful libraries to date. The original nhc com- piler used an intermediate core language called PosLambda – a basic lambda cal- culus extended with positional information. The language was neither a subset nor a superset of Haskell. In particular there were unusual constructs and all names were stored in a symbol table. There was also no defined external representation. When one of the authors required a core Haskell language, after evaluating GHC Core [4], it was decided that PosLambda was closest to what was desired but required substantial clean up. Rather than attempt to change the PosLambda language, a task that would have been decidedly painful, we chose instead to write a Core language from scratch. When designing our Core language, we took ideas from both PosLambda and GHC Core, aiming for something as simple as possible. Due to the similarities to PosLambda we have written a translator from our Core language to PosLambda, which is part of the Yhc compiler. Our idealised Core language differs from GHC Core in a number of ways:

1 The Monad.Reader

I Untyped – originally this was a restriction of PosLambda, but now we see this as a feature, although not everyone agrees.

I Syntactically a subset of Haskell.

I Minimal name mangling.

All these features combine to create a Core language which resembles Haskell much more than Core languages in other Haskell compilers. As a result, most Haskell programmers can feel at home with relatively little effort. By keeping a much simpler Core language, it is less effort to learn, and the number of projects depending on it has grown rapidly. We have tried to add facil- ities to the libraries for common tasks, rather than duplicating them separately in projects. As a result the Core library now has facilities for dealing with primitives, removing recursive lets, reachability analysis, strictness analysis, simplification, inlining and more. One of the first features we added to Core was whole program linking – any Haskell program, regardless of the number of modules, can be collapsed into one single Yhc.Core module. While this breaks separate compilation, it simplifies many types of analysis and transformation. If a such an analysis turns out to be successful then breaking the dependence on whole program compilation is a worthy goal – but this approach allows developers to pay that cost only when it is needed.

A Small Example To give a flavour of what Core looks like, it is easiest to start with a small program: head2 (x:xs) = x map2 f [] = [] map2 f (x:xs) = f x : map2 f xs test x = map2 head2 x

Compiling with yhc -showcore Sample.hs generates:

Sample.head2 v220 = case v220 of (:) v221 v222 -> v221 _ -> Prelude.error Sample._LAMBDA228

2 Dimitry Golubovsky, Neil Mitchell, Matthew Naylor: Yhc.Core – from Haskell to Core

Sample._LAMBDA228 = "Sample: Pattern match failure in function at 9:1-9:15."

Sample.map2 v223 v224 = case v224 of [] -> [] (:) v225 v226 -> (:) (v223 v225) (Sample.map2 v223 v226)

Sample.test v227 = Sample.map2 Sample.head2 v227

The generated Core can be treated as a subset of Haskell, with many restrictions:

I Case statements only examine their outermost constructor

I No type classes

I No where statements

I Only top-level functions.

I All names are fully qualified

I All constructors and primitives are fully applied

Yhc.Core.Overlay We provide many library functions to operate on Core, but one of our most unusual features is the overlay concept. Overlays specify modifications to be made to a piece of code – which functions should be replaced, which ones inserted, which data structures modified. By combining a Core file with an overlay, modifications can be made after translation from Haskell to Core. This idea originated in the Mozilla project [5], and is used successfully to enable extensions in Firefox, and elsewhere throughout their platform. To take a simple example, in Haskell there are two common definitions for reverse: reverse = foldl (flip (:)) [] reverse [] = [] reverse (x:xs) = reverse xs ++ [x]

3 The Monad.Reader

The first definition uses an accumulator, and takes O(n). The second definition requires O(n2), as the tail element is appended onto the whole list. Clearly a Haskell compiler should pick the first variant. However, a program analysis tool may wish to use the second variant as it may present fewer analysis challenges. The overlay mechanism allows this to be done easily. The first step is to write an overlay file: global_Prelude’_reverse [] = [] global_Prelude’_reverse (x:xs) = global_Prelude’_reverse xs ++ [x]

This Overlay file contains a list of functions whose definitions we would like to replace. Any function that previously called Prelude.reverse will now invoke this new copy. For a program to insert an overlay, both Haskell files need to be compiled to Core, then the overlay function is called. But we need not stop at simply replacing the reverse function. Yhc defines an IO type as a function over the World type, but for some applications this may not be appropriate. We can redefine IO as: data IO a = IO a global_Monad’_IO’_return a = IO a global_Monad’_IO’_’gt’gt (IO a) b = b global_Monad’_IO’_’gt’gt’eq (IO a) f = f a global_YHC’_Internal’_unsafePerformIO (IO a) = a

The Overlay mechanism supports escape characters – ’gt is the > character – allowing us to replace the bind and return methods. We have found that with Overlays a compiler can be customized for many dif- ferent tasks, without causing conflicts. With one code base, we can allow different programs to modify the libraries to suit their needs. Taking the example of Int ad- dition, there are at least three different implementations in use: Javascript native numbers, binary arithmetic on a Haskell data type and abstract interpretation.

Semantics of Yhc Core

In this section an evaluator for Yhc Core programs is presented in the form of a literate Haskell program. The aim is to define the informal semantics of Core programs while demonstrating a full, albeit simple, application of the Yhc.Core library.

4 Dimitry Golubovsky, Neil Mitchell, Matthew Naylor: Yhc.Core – from Haskell to Core module Main where import Yhc.Core import System import Monad

Our evaluator is based around the function whnf that takes a Core program (of type Core) along with a Core expression (of type CoreExpr) and reduces that expression until it has the form of:

I a data constructor with unevaluated arguments, or

I an unapplied lambda expression.

In general, data values in Haskell are tree-shaped. The function whnf is often said to “reduce an expression to head normal form” because it reveals the head (or root) of a value’s tree and no more. Stricly speaking, when the result of reduction could be a functional value (i.e. a lambda expression), and the body of that lambda is left unevaluated, then the result is said to be in “weak head normal form” – this explains the strange acronym. The type of whnf is: whnf :: Core -> CoreExpr -> CoreExpr

Defining it is a process of taking each kind of Core expression in turn, and asking “how do I reduce this to weak head normal form?” As usual, it makes sense to define the base cases first, namely constructors and lambda expressions: whnf p (CoreCon c) = CoreCon c whnf p (CoreApp (CoreCon c) as) = CoreApp (CoreCon c) as whnf p (CoreLam (v:vs) e) = CoreLam (v:vs) e

Notice that a constructor may take one of two forms: stand-alone with no ar- guments, or as function application to a list of arguments. Also, because of the way our evaluator is designed, we may encounter lambda expressions with no ar- guments. Hence, only lambdas with arguments represent a base-case. For the no-arguments case, we just shift the focus of reduction to the body: whnf p (CoreLam [] e) = whnf p e

5 The Monad.Reader

Currently, lambda expressions do not occur in the Core output of Yhc. They are part of the Core syntax because they are useful conceptually, particularly when maniplating (and evaluating) higher-order functions. Moving on to case-expressions, we first reduce the case subject, then match it against each pattern in turn, and finally reduce the body of the chosen alternative. In Core, we can safely assume that patterns are at most one constructor deep, so reduction of the subject to WHNF is sufficient. whnf p (CoreCase e as) = whnf p (match (whnf p e) as)

We defer the definition of match for the moment. To reduce a let-expression, we substitute the let-bindings in the body of the let. This is easily done using the Core function replaceFreeVars. Like in Haskell, let-expressions in Core are recursive, but before evaluating a Core program we transform them all to non-recursive lets (see below). Notice that we are in no way trying to preserve the sharing implied by let-expressions, although we have done so in more complex variants of the evaluator. Strictly speaking, Haskell evaluators are not obliged to implement sharing – this is why it is more correct to term Haskell non-strict than lazy. whnf p (CoreLet bs e) = whnf p (replaceFreeVars bs e)

When we ecounter an unapplied function we call coreFunc to lookup its defi- nition (i.e. its arguments and its right-hand-side), and construct a corresponding lambda expression: whnf p (CoreFun f) = whnf p (CoreLam bs body) where CoreFunc _ bs body = coreFunc p f

This means that when reducing function applications, we know that reduction of the function part will yield a lambda: whnf p (CoreApp f []) = whnf p f whnf p (CoreApp f (a:as)) = whnf p (CoreLet [(b,a)] (CoreApp (CoreLam bs e) as)) where CoreLam (b:bs) e = whnf p f

Core programs may contain information about where definitions originally oc- curred in the Haskell source. We just ignore these: whnf p (CorePos _ e) = whnf p e

6 Dimitry Golubovsky, Neil Mitchell, Matthew Naylor: Yhc.Core – from Haskell to Core

And the final, fall-through case covers primitive literals and functions which we are not concerned with here: whnf p e = e

Now, for the sake of completeness, we return to our match function. It takes the evaluated case subject and tries to match it against each case-alternative (a pattern-expression pair) in order of appearance. We use the “failure by a list of successes” technique [6] to model the fact that matching may fail. type Alt = (CoreExpr, CoreExpr) match :: CoreExpr -> [Alt] -> CoreExpr match e as = head (concatMap (try e) as)

Before defining try, it is useful to have a function that turns the two possible constructor forms into a single normal form. This greatly reduces the number of cases we need to consider in the definition of try. norm :: CoreExpr -> CoreExpr norm (CoreCon c) = CoreApp (CoreCon c) [] norm x = x

Hopefully, by now the definition of try will be self-explanatory: try :: CoreExpr -> Alt -> [CoreExpr] try e (pat, rhs) = case (norm pat, norm e) of (CoreApp (CoreCon f) as, CoreApp (CoreCon g) bs) | f == g -> [CoreLet (zip (vars as) bs) rhs] (CoreVar v, e) -> [CoreLet [(v, e)] rhs] _ -> [] where vars = map fromCoreVar

This completes the definition of whnf. However, we would like to be able to fully evaluate expressions – to what we simply call “normal form” – so that the resulting value’s tree is computed in its entirety. Our nf function repeatedly applies whnf at progressively deeper nodes in the growing tree:

7 The Monad.Reader nf :: Core -> CoreExpr -> CoreExpr nf p e = case whnf p e of CoreCon c -> CoreCon c CoreApp (CoreCon c) es -> CoreApp (CoreCon c) (map (nf p) es) e -> e All that remains is to turn our evaluator into a program by giving it a sen- sible main function. We first load the Core file using loadCore and then ap- ply removeRecursiveLet, as discussed ealier, before evaluating the expression CoreFun "main" to normal form and printing it. main :: IO () main = liftM head getArgs >>= liftM removeRecursiveLet . loadCore >>= print . flip nf (CoreFun "main") In future we hope to use a variant of this evaluator (with sharing) in a property- based testing framework. This will let us check that various program analyses and transformations that we have developed are semantics-preserving. As part of another project, we have sucessfully extended the evaluator to support various functional-logic evaluation strategies.

Javascript backend

The Javascript backend is a unique feature of Yhc. The idea to write a converter from Haskell to Javascript, enabling the execution of Haskell programs in a web browser, has been floating around for some time [7, 8, 9]. Many people expressed interest in such feature, but no practical implementation has emerged. Initial goals of this subproject were:

I To develop a program that converts the Yhc Core to Javascript, thus making it possible to execute arbitrary Haskell code within a web browser.

I To develop an unsafe interface layer for quick access to Javascript objects with ability to wrap arbitrary Javascript code into a Haskell-callable function.

I To develop a typesafe interface layer on top of the unsafe interface layer for access to the Document Object Model (DOM) available to Javascript executed in a web browser.

I To develop or adopt an existing GUI library or toolkit working on top of the typesafe DOM layer for actual development of client-side Web applications.

8 Dimitry Golubovsky, Neil Mitchell, Matthew Naylor: Yhc.Core – from Haskell to Core

General concepts The Javascript backend converts a linked and optimized Yhc Core file into a piece of Javascript code to be embedded in a XHTML document. The Javascript code generator tries to translate Core expressions to Javascript expressions one-to-one with minor optimizations of its own, taking advantage of the Javascript capability to pass functions around as values. Three kinds of functions are present in the Javascript backend:

I Unsafe functions that embed pieces of Javascript directly into the generated code: these functions pay no respect to types of arguments passed, and may force evaluation of their arguments if needed.

I Typesafe wrappers that provide type signatures for unsafe functions. Such wrappers are either handwritten, or automatically generated from external interface specifications (such as the DOM interface).

I Regular library functions. These either come unmodified from the standard Yhc packages, or are substituted by the Javascript backend using the Core overlay technique. An example of such a function is the toUpper function which is hooked up to the Javascript implementation supporting Unicode (the original library function currently works correctly only for the Latin1 range of characters).

Unsafe interfaces The core part of unsafe interface to Javascript (or, in other words, Javascript FFI) is a pseudo-function unsafeJS. The function has a type signature: foreign import primitive unsafeJS :: String -> a

The input is a String, but the type of the return value does not matter: the function itself is never executed. Its applications are detected by the Yhc Core to Javascript conversion program and dealt with at the time of Javascript generation. The unsafeJS function should be called with a string literal. Both explicitly coded (with (:)) lists of characters and the concatenation of two or more strings will cause the converter to report an error. A valid example of using unsafeJS is shown below: global_YHC’_Primitive’_primIntSignum :: Int -> Int global_YHC’_Primitive’_primIntSignum a = unsafeJS "var ea = exprEval(a); if (ea>0) return 1; else if (ea<0) return -1; else return 0;"

9 The Monad.Reader

This is a Javascript overlay (in the sense that it overlays the default Prelude definition of the signum function) of a function that returns sign of an Int value. The string literal given to unsafeJS is the Javascript code to be wrapped. Below is the Javascript representation of this function found in generated code. strIdx["F_hy"] = "YHC.Primitive.primIntSignum"; ... var F_hy=new HSFun("F_hy", 1, function(a){ var ea = exprEval(a); if (ea>0) return 1; else if (ea<0) return -1; else return 0;});

Typesafe wrappers These functions add type safety on top of unsafe interface to Javascript. Sometimes they are defined within the same module as unsafe interfaces themselves, thus avoiding the exposure of unsafe interfaces to programmers. An example of a handwritten wrapper is a function to create a new JSRef: a mechanism similar to Haskell’s IORef, but specific to Javascript. data JSRef a newJSRef :: a -> CPS b (JSRef a) newJSRef a = toCPE (newJSRef’ a) newJSRef’ a = unsafeJS "return {_val:a};"

Technically, a JSRef is a Javascript object with a property named val that holds a persistent reference to some value. On the unsafe side, invoking a constructor for such an object would be sufficient. It is however desired that:

I calls to functions creating such persistent references are properly sequenced with calls to functions using these references, and

I the type of values referred to are known to the Haskell compiler.

The unsafe part is implemented by the function newJSRef’ which merely calls unsafeJS with a proper Javascript constructor. The wrapper part newJSRef wraps the unsafe function into a CPS-style function, and is given a proper type signature, so more errors can be caught at compile time. In some cases, such typesafe wrappers may be generated automatically, using some external interface specifications provided by third parties for their APIs. The W3C DOM interface is one such API. For instance, this piece of OMG IDL:

10 Dimitry Golubovsky, Neil Mitchell, Matthew Naylor: Yhc.Core – from Haskell to Core interface Text : CharacterData { Text splitText(in unsigned long offset) raises(DOMException); };

is converted into: data TText = TText ... instance CText TText instance CCharacterData TText instance CNode TText ... splitText :: (CText this, CText zz) => this -> Int -> CPS c zz splitText a b = toCPE (splitText’ a b) splitText’ a b = unsafeJS "return((exprEval(a)).splitText(exprEval(b)));"

These instances and signatures give the Haskell compiler better control over this function’s (initially type-agnostic) arguments.

Usage of Continuation Passing Style Initially we attempted to build a monadic framework. The JS monad was designed to play the same role as the IO monad plays in “regular” Haskell programming. There were, however, arguments in favor of using Continuation Passing Style (CPS) [10]:

I CPS involves less overhead as each expression passes its continuation itself, instead of bind which takes the expression and invokes the continuation

I CPS results in Javascript patterns that are easy to detect and optimize, although this is a not implemented yet.

I The Fudgets [11] GUI library internals are written in CPS, so taking CPS as general approach to programming is believed to make adoption of Fudgets easier.

Integration with DOM The Web Consortium [12] provides OMG IDL [13] files to describe the API to use with the Document Object Model (DOM) [14]. A utility was designed, based on

11 The Monad.Reader

HaskellDirect [15], to parse these files and convert them to set of Haskell mod- ules. The way interface inheritance is reflected differs from HaskellDirect: in HaskellDirect this was achieved by declaration of “nested” algebraic data types. The Javascript backend takes advantage of Haskell typeclasses – representing DOM types with phantom types, and declaring them instances of appropriate classes.

Unicode support Despite the fact that all modern Web browsers support Unicode, this is not the case with Javascript: no access to Unicode characters’ properties is provided. At the same time it is desirable for a Haskell application running in a browser to have access to such information. The approach used is the same as in Hugs [16] and GHC [17]: the Unicode characters database file from the Unicode Consortium [18] was converted into a set of Javascript arrays, each array entry represents a range of character code values, or a case conversion rule for a range. For this imple- mentation, Unicode support is limited to the character category, and simple case conversions. First, a range is found by looking up the character code; then the character category and case conversion distances, i.e. values to add to character code to convert between upper and lower cases, are retrieved from the range entry. The whole set of arrays adds about 70 kilobytes to the web page size, if embedded inside a