Lexical and Syntax Analysis
Abstract Syntax What is Parsing?
Parser
String of Data characters structure
Easy for humans Easy for programs to write to process
A parser also checks that the input string is well-formed, and if not, rejects it. Example 1
Parser
17
Charlton, 49 “Beckham” Lineker, 48 Beckham, 17 48 49 “Lineker” “Charlton”
CSV (Comma Array of pairs Separated Value) Concrete and Abstract Syntax
The concrete syntax is a set of rules that describe valid inputs to the parser.
The abstract syntax is a set of rules that describe valid outputs from the parser.
The data structure produced by a parser is commonly termed the abstract syntax tree. Concrete and Abstract Syntax
Parser
String of Data characters structure
Conforms to the Conforms to the Concrete Syntax Abstract Syntax of the language of the language Concrete syntax
The concrete syntax is usually defined by regular expressions and context-free grammars.
Example:
name = [a-zA-Z]+
goals = [0-9]+
squad = ( name ∙ , ∙ goals )* Abstract syntax
The abstract syntax is usually specified as a data type in the programming language being used, in our case C. Example: struct player { char* name; int goals; }; struct squad { struct player* players; int size; }; An abstract syntax tree is a value of this type. This lecture
How: . to define abstract syntax . to construct abstract syntax trees in the programming language C.
Also revisits some important C programming techniques. POINTERS
Pointer: a variable that holds the address of a core storage location. [The Free Dictionary] Pointers
Declare a variable x of type int and initialise it to the value 10. x: int x = 10; 10
Declare a variable p of type int* (read: int pointer). p: int* p;
Make p point to x (or assign the address of x to p). p: x: p = &x; 10 Pointers
Print the value pointed to by p (here, the value of x). p: x: printf("%i\n", *p ); 10
Assign 20 to the location pointed to by p. p: x: *p = 20; 20
The value NULL means “points to nothing”. p: p = NULL; DYNAMIC ALLOCATION
Dynamic Allocation: the allocation of memory storage for use in a computer program. [The Free Dictionary] Array allocation
Declare a variable p of type int*. p: int* p;
Allocate memory for an array of 4 int values and let point p to it.
p = malloc(4 * sizeof(int));
p: Array indexing
Assign 10 to the location pointed to by p. p: *p = 10; 10
Assign 20 to the first element of the array pointed to by p. p: p[0] = 20; 20
Copy the first element of the array to the third element. p: p[2] = p[0]; 20 20 Array deallocation
When finished with an array allocated by malloc, call free to release the space, otherwise your program may run out of memory.
p: free(p);
Space released, so it can be reused by future calls to malloc. STRINGS
String: a series of consecutive characters. [The Free Dictionary] Strings
Declare a variable s, initialised to point to the string “hi”. s: char* s = “hi”; h i \0
Let s point to the next character. s: s = s + 1; h i \0
And let s point to the previous character again. s: s = s - 1; h i \0 Exercise 1
What is printed by the following program?
int f(char* s) { int i = 0; while (s[i] != '\0') i++; return i; }
void main() { char* x = “Hello”; printf(“%i\n”, f(x)); } USER-DEFINED TYPES
Type: the general character or structure held in common by a number of things. [The Free Dictionary] Type definitions
A typedef declaration allows a new name to be given to a type.
typedef int Integer;
typedef char* String;
Existing type A new name Example use: String s; /* Declare a string s */ Integer i; /* and an integer i */ i = 0; s = “hello”; Enumerations
An enum declaration introduces a new type whose values are members of a given set.
enum colour {RED, GREEN, BLUE};
New type Possible values Example use: enum colour c; c = RED; if (c == RED) printf(“Red\n”); Give it a shorter name: typedef enum colour Colour; Structures
An struct declaration introduces a new type that is a conjunction of one or more existing types. New type struct rectangle { float width; A width and float height; a height };
Example use: A circle: struct rectangle r; struct circle { r.width = 10; float radius; r.height = 20; }; Unions
An union declaration introduces a new type that is a disjunction of one or more existing types.
New type union shape { struct circle circ; A circle or struct rectangle rect; a rectangle }; Example use: struct shape s; s.circ.radius = 10; Tagged unions
Often a tag is used to denote the active member of a union.
Another definition of shape: struct shape { enum { CIRCLE, RECTANGLE } tag; union { struct circle circ; struct rectangle rect; }; };
typedef struct shape Shape; Tagged unions
Example: s is a circle and t is a rectangle, and both are of type Shape.
Shape s, t; s.tag = CIRCLE; s.circ.radius = 10; t.tag = RECTANGLE; t.rect.width = 5; t.rect.height = 15; Tagged unions
It is often convenient to define a constructor function for each member of the tagged union.
Shape mkCircle(float r) { Shape s; s.tag = CIRCLE; s.circ.radius = r; return s; }
Shape mkRectangle(float w, float h) { Shape s; s.tag = RECTANGLE; s.rect.width = w; s.rec.height = h; return s; } Tagged unions
Example revisited: s is a circle and t is a rectangle, and both are of type Shape.
Shape s, t;
s = mkCircle(10);
t = mkRectangle(5, 15); Tagged unions
Example: compute the area of any given shape s.
float area(Shape s) { if (s.tag == CIRCLE) { float r = s.circ.radius; return (3.14 * r * r); } if (s.tag == RECTANGLE) { return (s.rect.width * s.rect.height); } } Exercise 2
Define a function to compute the perimeter of any given shape s. float perim(Shape s) { ... } Recursive structures
A value of type struct t may contain a value of type struct t*.
struct list { int head; struct list* tail; };
typedef struct list List;
Suppose p is a value of type List*
(*p).head ≡ p->head
(*(*p).tail).head ≡ p->tail->head Recursive structures
Example: inserting an item onto the front of a linked list.
List* insert(List* xs, int x) { List* ys = malloc(sizeof(List)); ys->head = x; ys->tail = xs; return ys; } Exercise 3
Define a function to compute the length of a given list xs. int length(List* xs) { ... } CASE STUDY
A simplifier for arithmetic expressions. Concrete syntax
Here is a concrete syntax for arithmetic expressions. v = [a-z]+ n = [0-9]+ e → v | n | e + e | e * e | ( e ) Example expression: x * y + (foo * 10) Simplification
Consider the algebraic law:
∀x. x * 1 = x
This law can be used to simplify expressions by using it as a rewrite rule from left to right.
Example simplification:
x * (y * 1) → x * y Problem
1. Define an abstract syntax, in C, for arithmetic expressions.
2. Define constructor functions so that we can build abstract syntax trees representing expressions.
3. Implement the simplification rule as a C function over abstract syntax trees.
1. Abstract syntax typedef enum { ADD, MUL } Op; struct expr { enum { VAR, NUM, APP } tag; union { char* var; A variable or int num; a number or struct {
struct expr* e1; a left expr and Op op; an op and struct expr* e2; a right expr } app; }; }; typedef struct expr Expr; 2. Constructor functions
Expr* mkVar(char* v) { Expr* e = malloc(sizeof(Expr)); e->tag = VAR; e->var = v; return e; }
Expr* mkNum(int n) { Expr* e = malloc(sizeof(Expr)); e->tag = NUM; e->num = n; return e; }
Expr* mkApp(Expr* e1, Op op, Expr* e2) { Expr* e = malloc(sizeof(Expr)); e->tag = APP; e->app.op = op; e->app.e1 = e1; e->app.e2 = e2; return e; } 2. Abstract syntax trees
An abstract syntax tree that represents the expression x + (y * 2) can be constructed by the following C expression mkApp( mkVar("x") , ADD , mkApp( mkVar("y") , MUL , mkNum(2))) 3. Simplification
x * 1 → x
is implemented by void simplify(Expr* e) { if (e->tag == APP && e->app.op == MUL && e->app.e2->tag == NUM && e->app.e2->num == 1) { *e = *(e->app.e1); } if (e->tag == APP) { simplify(e->app.e1); simplify(e->app.e2); } } Homework exercises
Implement a pretty printer that prints an abstract syntax tree in a concrete form.
void print(Expr* e) { ... }
Extend the simplifier to exploit the following algebraic law.
∀x. x * 0 = 0 Motivation for Parsing
In SYAC, we are interested in how to implement the following kind of function
Expr* parse(char* string) { ... }
It takes a string conforming to the concrete syntax and returns an abstract syntax tree.