COMP3161/COMP9161 Supplementary Lecture Notes Data Types and Exceptions

Liam O’Connor-Davis May 13, 2013

A is a type of static semantics used for verifying programs and improving the reliability of software. It is, essentially, a means of annotating expressions and values in a program with a tag, called a type, which tells us something about the of runtime data the expression can represent. Programming languages are often categorised by the sort of type system they have. Popular statically typed languages (i.e languages which perform type checking as part of static semantics) include Java, C# and C, whereas dynamically typed languages include Ruby, Perl and Python. These languages have essentially uninteresting static semantics, so we concern ourselves primarily with static type systems in this course. In addition to the static-dynamic distinction, languages can be categorised by how type-safe or strongly typed they are. These terms are often used to mean various things by different authors and in different books, so we will define them formally in a moment. Nevertheless it is commonly accepted that C is an unsafe language, and that Java or Haskell have substantially more type-safety.

1 Type Safety

Type Safety. A language with small-step states Q, final states F ⊆ Q, state transition relation 7→, and typing rules is type safe if it has two properties: • Progress - If a program can be typed, it is either a final state or can progress to another state. That is, if ` e : τ then e ∈ F or ∃e0. e 7→ e0. • Preservation - If a program has type, evaluation will not change that type. That is, if ` e : τ and e 7→ e0 then ` e0 : τ.

It can be seen from the above definition that well typed programs will not reach a stuck state. If the program is a final state, then it is by definition not stuck. If not, we know from the progress property that the program must move to a new state. We know from preservation that this new state is also typed, which means (from progress) that it must either be a final state or progress to a new state. Similar reasoning applies until the program terminates (or loops). It therefore follows that languages such as C, which are unsafe, could reach a stuck state. In such a situation, the program doesn’t simply halt (or at least, it’s not obliged to). What happens is left undefined. For example, there is no telling what this C program will do without referring to platform or compiler documentation:

int main() { return *((int*)(0x0)); }

Clearly, speaking of type safety is only applicable in the context of formal treatment of programming lan- guages. Determining exactly what guarantees a type system gives you requires these techniques. In general, the more expressive the type system is, the more information can be inferred by the compiler. Therefore, for practical reasons we typically want type checking to be decidable. If it was not, our compiler may not terminate. There are languages such as C++, however, where type checking may not terminate. The remainder of this document will focus on three extensions to the MinHS semantics. One dynamic extension, to ensure progress in the face of partial functions, and two static types extensions, which make MinHS type system more expressive.

1 2 Dealing with Partiality

Suppose we have a partial operation, such as division, typed as follows:

Γ ` t1 : Int Γ ` t2 : Int Γ ` div t1, t2 : Int We’ve assigned it a type, but it does not necessarily guarantee progress. The expression div x, 0 for any x will not return a value. There are two solutions:

• Change the static semantics - That is, disallow divisions by zero statically. There are techniques to approximate this for turing-complete languages, however in general this is undecidable. For those that are interested, the proof is roughly as follows: If we had the ability to decide statically if we are dividing by zero in all cases, we could write a program that solves the halting problem. Given a program X as input, which may or may not terminate, produce a program X0 which just runs X and then returns 0. Then insert X0 into the divisor position of some division, and statically check if this division would divide by zero. If yes, the program terminates, otherwise it does not. Seeing as the halting problem is known undecidable, we therefore know that our assumption (that we can decide if we are dividing by zero statically) is false. Hence determining if we can divide by zero statically is undecidable.

• Change the dynamic semantics - This approach is used by most languages.

Seeing as MinHS is turing complete, we are unable to statically analyse if the program divides by zero. Hence, we shall extend the dynamic semantics of the language to handle the situation at runtime. The simplest fix is to make partial functions yield some new state error ∈ F for undefined cases:

Div v (Num 0) 7→ error

Furthermore, we would define error to interrupt any nested computation and produce error.

Plus error e 7→ error

Plus e error 7→ error

If error e1 e2 7→ error There are, of course, a very large number of additional error propagation rules. Here, our abstract machines actually buy us some brevity. We simply state that partial functions result in error, and completely annihilate the stack (e.g in the C Machine):

Div v . s ≺ 0 7→ error This guarantees progress - partial functions will evaluate to error where they are not defined, meaning that the evaluation will not hit a stuck state. We have yet to ensure preservation, however. Preservation says that type is preserved across evaluation. Seeing as any partial function application (of any type) could evaluate to error, the only way to make error respect preservation is to make it a member of every type:

Γ ` error : τ 2.1 The Bottom Type It could be considered that function letfun f x = error should similarly be of any α → τ - and indeed, languages which support this sort of polymorphism would type it as such. What this type signature is actually saying is that the function would return a value of any type τ. If we examine the dynamic semantics, however, this is not accurate. The function never actually returns at all. It perhaps makes more sense to say that the function’s return type is a type which contains no values, a “bottom” type, written ⊥. While this would break preservation, it is worth investigating as the bottom type has many interesting mathematical properties, discussed later.

Letfun f.x.error : α → ⊥

2 If we admitted bottom types, however, the same reasoning would lead us to an undecidability problem. Another source of the ⊥ type is non-terminating functions, as they also do not return anything:

Letfun f.x.(Apply f x): α → ⊥ Therefore, if we use this reasoning, checking or inferring that an expression has type ⊥ would require checking if the function terminates, which is known to be undecidable.

2.2 Exceptions Adding a error state seems well and good for ensuring type safety, but many real-world languages have more robust, fine-grained error handling techniques, namely exceptions. Exceptions are a means for a function to exit without returning. Instead, the function may raise an exception, which is caught by an exception handler somewhere further up the runtime stack. Most of you would have seen exceptions from languages such as Java, Python, or C++. We will extend MinHS to include exceptions by adding two pieces of abstract syntax: Try e1 x.e2 will evaluate e1, and if Raise τ v is ever encountered while evaluating e1, it will stop evaluating e1, and start evaluating e2 where x is bound to the value of v. These Try expressions can of course be nested, and exceptions can be re-Raised within an exception handler.

Exception values (Such as v in the above example), are made to be of a fixed type, τexc. It is not relevant what type this is, it could be a special Exception type, it could be an Int error code, or just a String message. We type these new expressions as follows. Try expressions take the type of their subexpressions, and raise expressions are of any type specified in the expression (for a similar reason to the typing of error):

Γ ` e : τexc Γ ` e1 : τ Γ ∪ {x : τexc} ` e2 : τ Γ ` Raise τ e : τ Γ ` Try e1 x.e2 : τ

2.2.1 Dynamic Semantics for the C Machine We introduce a new execution mode (in addition to ≺ and ), used to signify that an exception is being raised, written 4. So, when we evaluate a try expression, we simply evaluate the first subexpression, and push the handler onto the stack:

s Try e1 x.e2 7→c Try x.e2 . s e1 Then, if the evaluation returns, we simply discard the try stack frame:

Try x.e2 . s ≺ v 7→c s ≺ v If we encounter a raise expression, we first evaluate the exception value being raised:

s Raise τ e 7→c Raise τ . s e And, once it returns, we enter the new mode, 4:

Raise τ . s ≺ v 7→c s 4 v This mode continuously pops frames off the stack:

f . s 4 v 7→c s 4 v Until at last we encounter a Try expression, where the handler is evaluated.

Try x.e2 . s 4 v 7→c s {v/x}e2

3 2.2.2 Optimising Exceptions The problem with this approach is one of performance. Raising an exception is O(n) in the size of the stack, which could be a serious performance hit if the stack is very large (for example, in a big recursive function). Seeing as in our abstract machines we are concerned about performance, we will refine our machine definition to make exception handling fast. We will define a new type of stack, a Handler stack. The empty handler stack is denoted by ?, and each handler frame consists of a runtime stack, and the handler expression: s HStack e Expr r Stack ? HStack hr, x.ei . s HStack Our states will now resemble h, r e, where h is the handler stack, r is the runtime stack. The “exception handling” mode 4 is not needed and therefore is removed. When we enter a Try block, we add a handler to the handler stack, and a placeholder to the runtime stack. The handler includes the current runtime stack and the handler expression:

h, r Try e1 x.e2 7→c hr, x.e2i . h, Try . r e1 We include the placeholder so that if we return from a Try block, we can remove the handler from the handler stack as it was not used:

0 hr , x.e2i . h, Try . r ≺ v 7→c h, r ≺ v When we encounter a Raise expression, we first evaluate the exception value as normal, and then we immediately switch to the runtime stack in the top handler frame of the exception stack, saving us the trouble of manually going through the runtime stack frame by frame:

0 0 hr , x.e2i . h, Raise τ . r ≺ v 7→c h, r {v/x}e2 Note: It may seem inefficient to copy the runtime stack to the handler stack each time a Try block is reached. Note however that, in the course of evaluating the try block, the machine will never pop off the Try place- holder. Therefore a pointer to the current runtime stack could be kept in the handler stack rather than a copy. Everything above that pointer is freed when an exception is raised.

Exceptions provide a robust means for dealing with partial functions and ensuring type-safety, but MinHS is still not very useful for representing real world programs. The next section introduces new type system features that make its type system almost equivalent to that of ML’s or Haskell’s.

3 Composite Data Types

In this section we will extend MinHS to have a type system that is able to encode data types of arbitrary complexity or sophistication. In Haskell, we can easily come up with a that contains exactly four values:

data FourVals = One | Two | Three | Four

In MinHS, we only have three types: Int, Bool and function types. Of those, only Bool and some function types have a finite number of values, but not four 1. It seems as though it is impossible to make a type with four values only in MinHS.

3.1 Products It would be possible, however, to achieve this if we had some notion of a pair. If we had a pair of boolean values, there would be a total of four possible values: (True, True), (True, False), (False, True), (False, False). Many languages include ways to group types together into one composite type - a struct in C, a class in Java. Our notion of a pair (also called a tuple) is somewhat more inconvenient, but equally powerful to C structs. 1It may appear that Bool → Bool has four values, but this does not include the nonterminating function fx = fx

4 The type of pairs is called a , for reasons that will become clear later. The type of the pair of 2 τ1 and τ2 is therefore written τ1 × τ2 . We introduce a data constructor, Pair, to MinHS, which produces values of a product type:

Γ ` e1 : τ1 Γ ` e2 : τ2 Γ ` Pair e1 e2 : τ1 × τ2 Note that we can combine more than two types simply by using nested products: Int × Int × Int can be constructed with Pair 3 (Pair 4 5). Once we have a pair value, how do we get each component out of it? We introduce a more syntax for some destructors, Fst and Snd, which, given a pair, return the first or second component of the pair respectively:

Γ ` e : τ1 × τ2 Γ ` e : τ1 × τ2 Γ ` Fst e : τ1 Γ ` Snd e : τ2 Evaluation rules are omitted as they are straightforward.

3.2 Sums Sum types are less common than product types in programming languages, but they can be simulated with tagged unions in C and inheritance in Java. In essence τ1 + τ2 allows either a value of type τ1 or a value of type τ2, but not both. In Haskell, an equivalent definition would be: data Sum a b = Inl a | Inr b So, we will introduce similar constructors, inl and inr, for our sum type in MinHS:

Γ ` e : τ1 Γ ` e : τ1 Γ ` Inl e : τ1 + τ2 Γ ` Inr e : τ1 + τ2 Once again, we can combine multiple types to give n-ary sums via nesting. Instead of function-like destructors for sum types, we will introduce a new Case expression, similar to Haskell’s equivalent, which provides two alternative expressions. One for Inl, and one for Inr. As an example: SumToProd :: (Int + Int -> Bool * Int) e = case e of Inl u = pair False u Inr v = pair True v;

The abstract syntax of a case expression is as follows: Case τ1 τ2 e x.e1 y.e2, where the input sum value e is of type τ1 + τ2, and e1 is the alternative for Inl, and e2 is the alternative for Inr. Typing rules:

Γ ` e : τ1 + τ2 Γ ∪ {x : τ1} ` e1 : τ Γ ∪ {y : τ2} ` e2 : τ Γ ` Case τ1 τ2 e x.e1 y.e2 : τ Big step evaluation rules:

e ⇓ Inl v {v/x}e1 ⇓ r e ⇓ Inr v {v/y}e2 ⇓ r Case τ1 τ2 e x.e1 y.e2 ⇓ r Case τ1 τ2 e x.e1 y.e2 ⇓ r With this, we can make a four-valued type as well: Bool+Bool has values Inl False, Inr False, Inl True, Inr True.

3.3 Could we define a type with only three values? At the moment, the smallest type we’ve got is Bool, which has two values, and both products and sums of Bool give us a type too large with four values. The way we resolve this is to introduce a new type, called Unit, sometimes written >, which has exactly one value - (), sometimes written unitel in abstract syntax. It can be thought of as the “empty” tuple, and corresponds to a in C. Typing rules are obvious:

Γ ` Unitel : Unit With this type, we can construct a type of three values using a sum type: The type Unit + Unit + Unit has three values: Inl unitel, Inr (Inl Unitel), Inr (Inr Unitel).

2Product types are analogous to cartesian products in set theory, for those that are familiar with sets.

5 3.4 Type Isomorphism The above type of three values could be represented as Unit + Unit + Unit as shown, but it could also be represented as Unit + Bool (as it also has three values). We will define the notion of type isomorphism to capture the similarity between these terms. Formally, two mathematical objects A and B are considered isomorphic if there exists a structure preserving mapping (or a morphism) f from A to B, and an inverse morphism f −1 from B to A such that f ◦ f −1 is an −1 −1 identity morphism IA from A to A (i.e f(f (x)) = x) and f ◦ f is an identity morphism IB from B to B (i.e f −1(f(y)) = y). It is analogous to bijection in set theory or logic, and is often written as A ⇔ B where A is isomorphic to B. Seeing as types can be thought of as (slightly restricted) sets of values, we can construct such mappings to show types A and B isomorphic if and only if A has the same number of values as B. As an example, we could map Inl Unitel to Inl Unitel, Inr (Inl Unitel) to Inr False and Inr (Inr Unitel) to Inr True (and vice-versa) and thus show that Unit + Unit + Unit ⇔ Unit + Bool. Generally, it is suffi- cient to show that the two types have the same number of values in order to show that they are isomorphic. Some dimensionality problems arise with infinitely large types, however: Int has an infinite number of values, and so does Int × Int. Are they isomorphic? For this we can fall back to our more rigid definition of isomorphism - is there a mapping that could take a point on a plane and put it on a line, such that I could then have an inverse mapping from that point on the line to the original point on the plane? The answer to this is yes - a technique called Cantor’s Argument, or diagonalization. This provides a mapping from a plane to a line by assigning each point on a plane a unique point on the line, radiating outward from the origin in a diagonal fashion from the x axis to the y axis. As you have probably already realised, computing the size of types can be achieved easily by using this function: |⊥| = 0 |Unit| = 1 |Bool| = 2 |Int| = ∞ |τ1 × τ2| = |τ1| × |τ2| |τ1 + τ2| = |τ1| + |τ2| |τ1| |τ1 → τ2| = |τ2| for total, terminating functions only

3.5 Type Algebra It is worth noting that seeing as type equivalence is based simply on the number of values within the types, and sum and product types correspond to addition and multiplication of type size, types follow all the same algebraic laws as the natural numbers :

• Closure: The sum or product of two types is a type.

• Unit is the multiplicative identity: Unit × x ⇔ x • ⊥ is the additive identity: ⊥ + x ⇔ x • The additive identity is the multiplicative absorbing element: ⊥ × x ⇔ ⊥ • Additive and multiplicative operations are commutative: x × y ⇔ y × x and x + y ⇔ y + x.

• Additive and multiplicative operations are associative: x × (y × z) ⇔ (x × y) × z, similarly for sum types. • Multiplicative operation distributes over the additive operation: a × (b + c) ⇔ (a × b) + (a × c)

4 Recursive Types

What about if we wanted to write a common linked-list? In Haskell, this would be easily defined:

data ListOfInt = Int ListOfInt | Nil

6 In MinHS, however, we are stuck - we have no name by which we can introduce recursive types. All types in MinHS are anonymous. What we want to be able to declare is the type τ = (Int × τ) + Unit (which is isomorphic to the Haskell definition above). To solve this, we introduce a new type system feature, called rec (sometimes written µ). It would allow a linked list type to be written as follows:

rec t. (Int × t) + Unit We also need to introduce constructors and destructors for this rec construct. We call then Roll and Unroll. The typing rules for Roll are straightforward, and mostly uninteresting: Γ ` x : τ Γ ` Roll x : rec t. τ Unroll, however, gives us an insight into how these recursive types actually work: Γ ` x : rec t. τ Γ ` Unroll x : {rec t. τ/t}τ That is, when we Unroll a recursive type, the recursive name variable (t) is replaced with the entire recursive type. Each Unroll eliminates one level of recursion. We can define a value of our linked list type, written in Haskell as [3, 4], as follows:

Roll (Inl (Pair 3 (Roll (Inl (Pair 4 (Roll (Inr Unitel)))))))

Not very pretty, but isomorphic to the Haskell list.

5 Haskell Datatypes

MinHS’s type system, with these extensions, is now nearly as powerful as Haskell’s. To show this, I will demonstrate a technique which gives an isomorphic type in MinHS to any (non-polymorphic) type in Haskell. Suppose we have the following Haskell type:

data Foo = Bar Int Bool | Baz | Herp Foo First, replace all data constructors with Unit:

data Foo = Unit Int Bool | Unit | Unit Foo Then, multiply the constructors with all of their arguments:

data Foo = Unit × Int × Bool | Unit | Unit × Foo Then, replace all the alternatives with sums:

data Foo = Unit × Int × Bool + Unit + Unit × Foo Apply algebraic simplification (usually just Unit × x ⇔ x):

data Foo = Int × Bool + Unit + Foo Then, replace the recursive references (if any) with a rec construct:

rec t. Int × Bool + Unit + t As can be seen from the above example, this technique will convert non polymorphic, non parameterised Haskell types into isomorphic MinHS ones. A useful trick for your exam ;)

7