Lesson 1: Symbol Tables 1. Introduction 2. Name spaces 3. Organization 4. Block structured languages 5. Perspective 6. tabla. and tabla.h 7. Exercises

Readings: Scott, section 3.3 Munchnick, chapter 3 Aho, section 7.6 Fischer, chapter 8 Holub, section 6.3 Bennett, section 5.1 Cooper, sections 5.7 y B.4

12048 II - J. Neira – University of Zaragoza 1 The Symbol Table • Why is it important? – ? int a, 1t; – Syntactic analysis? – Semantic analysis? ¿?¿? – Code generation? – Code optimization? – Execution? while a then ... • Why is it particular and complex? – What information does it contain? ¿?¿? – How/when is information included? – How/when is it accessed? – How/when is it deleted? • Do interpreters need one? c := i > 1; • Do debuggers? ¿? ¿? • Unassemblers? ¿? ¿?

12048 Compilers II - J. Neira – University of Zaragoza 2 1. Introduction Symbol table: structure It can additionally include: used by the to store – temporary symbols information (in the form of attributes) associated to – labels symbols declared in the – Predefined symbols program. •Conceptually it is a set of records •Program dictionary •Its organization is strongly •name: lexical analysis influenced by syntactic as •type: syntactic analysis well as semantic aspects of •: semantic analysis the language at hand. •address: code generation

• Types available in the language: determine the CONTENTS of the table. • Scope rules: determine the visibility of the symbols, i.e., the table’s ACCESS MECHANISM. 12048 Compilers II - J. Neira – University of Zaragoza 3 Table contents • Reserved words: they have a • Literals, constants that denote a special meaning; they CANNOT certain value be redefined. • Symbols generated by the program begin end type compiler var array if ... var a: record • Predefined symbols: also have b,c : integer; a special meaning, but can be end; redefined. – Generates symbol noname1 for sin cos get put read the anonymous type correspon- write integer! real! ding to the record.

•Symbols predefined by the programmer –Variables: type, place in memory, value? references? –Data types: description –Procedures and functions: address, parameters, result type –Parameters: type of variable, parameter class –Labels: place in the program 12048 Compilers II - J. Neira – University of Zaragoza 4 Query When processing ...... the compiler ...... …. • declarations queries the table to prevent illegal duplication of symbol names. • statements queries the table to verify that the involved symbols are accessible and used correctly. const c = v;

var v : t; • Does identifier c exist? • Identifier v exists? v := f (a, b + 1); • Is type t declared? •Is f a function? v := e; • Does the number of •Is v defined? Is it a variable? arguments agree with the Which is its address? number of parameters? •Is e defined? Is it a variable? • Do the types of the arguments agre with those of the • Is it the same type as v (or of a parameters? compatible type)? • Is the use of arguments

12048 Compilers II - J. Neira – University of Zaragozacorrect? 5 Update When processing ...... the compiler ...... …. • Declarations updates the table to include new symbols. • scopes updates the table to modify the visibility of symbols. const c = v; var v : t;

– Include c of type v and its value – Include v with type t – Assign a location in memory? – Assign it a position in memory end; – Delete (or hide) all symbols of the block (scope) that is closing. function f (i : integer) : integer; procedure P (i : integer; var b : boolean); – Include f (as function) – Open a new scope – Include P (procedure) –Include f (as assignment variable) – Open a new scope –Include i (as parameter) –Include i and b (parameters) 12048 Compilers II - J. Neira – University of Zaragoza 6 Requirements • Speed: query is the most • Easy maintenance: frequent operation. identifier deletion must be simple O(n)? – It is not random

O(log2(n))? O(1)? • Duplicate identifiers • Efficiency in space should be allowed: most management: a large languages allow it. It must amount of information is be clear which are stored. accessible at each moment.

• Flexibility: the possibility of defining types makes the declaration of variables arbitrarily complex.

12048 Compilers II - J. Neira – University of Zaragoza 7 Requirements program e; e(): var a, b, c : char; a, b, c procedure f; f, g, j var a, b, c : char; ... procedure g; var a, b : char; g(): f(): j(): procedure h; a, b var c, d : char; a, b, c b, d ... h, i procedure i; var b, c : char; ...... procedure j; h(): i(): var b, d : char; c, d b, c ......

Program being compiled: e() f() g() h() i() j() e() h() i() f() g() g() g() j() e() e() e() e() e() e() e() Symbol table:

12048 Compilers II - J. Neira – University of Zaragoza 8 2. Names • Remember: conceptually, we store records with the struct { name and attributes of char name[MAX]; each symbol. ... } table_entry; •A storage and search mechanism by name is required. Space for the longest Possible identifier • Alternative 1: define the name attributes name field as a vector of base ... characters. indice • FORTRAN IV: number of comienzo significant characters x severely limited (6). It is not so in modern languages. • Variability in the length of names -> space is wasted.

12048 Compilers II - J. Neira – University of Zaragoza 9 Use of the Heap • Alternative 2: define the name field as a pointer to char.

name attributes HEAP

...base ... indice struct { ... comienzo char *nombre; ...... } entrada_tabla; x ...... e->nombre = strdup(nombre); • Obtaining space may be slow • Space reuse depends on the heap memory recovering mechanism. • The requirements for a symbol table are simpler than those offered by heaps.

12048 Compilers II - J. Neira – University of Zaragoza 10 Name space • Alternative 3: define the name field as an index in a vector of names.

Name space

bb a a s s e e i i n n d d i i c c e e c c o o m m i i e e n n z z o o x x free struct { int nombre; ... name attributes } entrada_tabla; .... char espacio_nombres[MAX];

• Administration is local • Space can be reused • Space is not ‘unlimited’

12048 Compilers II - J. Neira – University of Zaragoza 11 Name space •Small:it may be WhatWhat is is the the appropriate appropriate size size insufficient forfor the the name name space? space? •Large:space may be wasted • Solution: segmented name space

Array – segment: of Vector of size name div s pointers s 0 bb a a s s e e i i n n d d i i c c e e – Index in segment: c o m i e n z o x name mod s c o m i e n z o x

T

name = segment * s + index • Overall size is limited by the size of the vector of pointers (T=50 pointers to 1024 chars = 50k)

12048 Compilers II - J. Neira – University of Zaragoza 12 3. Organization • Three basic operations: – search() – include() – delete() • Alternative 1: unordered list search() O(n) include() O(1) name attributes delete() O(1) 0 name attributes p Include here n u

M-1

Include here 12048 Compilers II - J. Neira – University of Zaragoza 13 3. Organization • Aternative 2: List ordered by name

name attributes name attributes 0 p u n

Insertion where?

M-1

search() O(log2(n)) O(n)

include() O(log2(n))+O(n) O(n)+O(1) delete() ? ?

TheThe orderorder ofof inclusioninclusion is is lost! lost!

12048 Compilers II - J. Neira – University of Zaragoza 14 3. Organization • Alternative 3: Binary trees Only if balanced! name attributes

search() O(log2(n))

include() O(log2(n))+O(1) delete() ? • In the worst case, the cost of each operation is the same as the ordered list. i i var i, j var i, m j, m, k, k j, j l, l l, l m : integer; m k: integer; k • There is no guarantee that the tree will be balanced (names are not random).

12048 Compilers II - J. Neira – University of Zaragoza 15 3. Organization • Alternative 4: hash • Collisions: two different tables. Mechanism to sequences may be randomly distribute an associated with the same arbitrary number of items index. into a finite set of classes.

0 base i x management?management? h(’base’) = 0 h(’x’) = i ... h(’comienzo’) = j j comienzo M-1 ¿¿hh?? • With an appropriate hash • Hash function h: function and an adequate associates a character collision management, you sequence with a hash code can search() in constant = index in the table. time .

12048 Compilers II - J. Neira – University of Zaragoza 16 The hash function h • Desirable characteristics: – h(s) should depend only on s – Efficiency: it must be simple and easy to compute – Efficacy: it should produce small collision lists » Uniform: all indexes should be assigned with equal probability » Randomizing: similar names should go to different indexes • Birthday paradox: –GivenN names, and a table of size M – Let h be uniform and randomizing. The number of expected insertions before a collision is: M sqrt(pi M/2) 10 4 Random numbers between 0 y 100: 365 24

1.000 40 84 35 45 32 89 1 58 16 38 69 5 90 16 53 61 ... 10.000 125 Collision at 13th. 100.000 396

12048 Compilers II - J. Neira – University of Zaragoza 17 Examples Method of division function hash_add(s : string; M : integer) : integer; var k : integer; begin k := 0; for i := 1 to length(s) do k := k + ord(s[i]); hash_add := k mod M; end; • Size = 19, ids A0...A199 • Size = 100, ids A0...A99

0- 50- - - 0-- ********************* - - - ********** - - - ********** - - - ********** - - - ********** - - - ********** - - - *********** - -**-* - ******************** -* - *** - *********** -* - ***** - *********** -* - ************* - *********** -* - ******** - *********** -* - ********* - *********** --* - *************** 18-- *********** - - ****** - - ********* - -**- *** - --* • Size = 19, ids A0...A99 ------0- ********** - - - ********* - - - *************** - - - ****** - - - ***** - - - ******* - - - *** - - -*- 49- 99- --** *** - ********* - ************* 18-- ***************** Anagrams are not handled 12048 Compilers II - J. Neira – University of Zaragoza 18 Examples int hash_horner(char *s, int M) { int h, a = 256; for (h=0; *s != ’\0’; s++) h = (a*h + *s) % M; return h; bcba 47 ccad 1 baca 26 abad 4 } bddc 43 bdac 83 dbcb 24 cada 85 dabc 85 dabb 84 dbab 17 dabd 86 dbdb 78 dcbd 60 dbdd 80 babb 74 bccc 2 addd 39 Test: ’abcd’ bcbd 50 adbc 31 bcda 55 256*(256*(256*97+98)+99)+100 mod after each multiplication:

256*97+98 = 24930 % 101 = 84 256*84+99 = 21603 % 101 = 90 256*90+100 = 23140 % 101 = 11 Collision at 21st (large standard deviation).

12048 Compilers II - J. Neira – University of Zaragoza 19 More examples

unsigned hash_pjw (unsigned char *s) { unsigned i, hash_val = 0; for (; *s; ++s) { hash_val = (hash_val << 2) + *s; if (i = hash_val & 0x3fff) hash_val = (hash_val ^(i >> 12)) & ~0x3fff; } return hash_val; } • Mean length equal to hash_add • 9% slower than hash_add [Holub 90].

function uw_hash (s : string; M : integer) : integer; var l : integer; begin l := length(s); uw_hash := (ord(s[1]) * ord(s[l])) mod M; end; • UW-PASCAL: simple, works reasonably well; but its preferable to use all characters.

12048 Compilers II - J. Neira – University of Zaragoza 20 [Pearson, 90] • Specific for sequences of characters of variable length

h = 0 ; for i in 1..n loop h := T[h xor C[i]]; end loop ; return h

• n: chain length • C[i]: i-th character • T: random numbers (a permitation in 0..255). • Advantages: – Small changes in the chains (similar chains) become large changes in the resulting value. – It considers anagrams • Other: add odd chars, largest ASCII code

12048 Compilers II - J. Neira – University of Zaragoza 21 Collision management

• Keep N < M: if h(s)mod M is occupied, try: » Linear resolution: (h(s) + 1) mod M, (h(s) + 2) mod M, etc.

Left to themselves, things tend to go from bad to worse:

• Large chains form, and they tend to grow! • The mean cost of search() gets close to M as the tables fills Primary clusteing • Too slow if table is 70%-80% full » Quadratic resolution: (h(s) + 1^2) mod M, (h(s) + 2^2) mod M, etc. • Similar problem (secondary clustering)

12048 Compilers II - J. Neira – University of Zaragoza 22 Collision management » Double hashing : another function h’ determined the size of the leaps h(s) mod M, (h(s) + h’(s)) mod M, (h(s) + 2*h’(s)) mod M, etc.

• Very complex analysis • Not too slow until the table becomes 90%- 95% full

12048 Compilers II - J. Neira – University of Zaragoza 23 Choosing a value for M • Avoid powers of 2: k mod 2b selects the b less significant bits of k – Part of the information is ignored – If the hash function is not uniform in the lower bits, it generates unnecessary collisions

• Double hashing: h’(s) y M must be relative primes • It may be impossible to find a • Experience suggests choosing place in a non-full table! primes close to powers of 2:

Test:

h = 10, h’ = 3, M = 9 Table of ~4000 elements: 1 4 7 1 4 7 1 4 7 1 4 7 ... m = 4093 h = 10, h’ = 3, M = 10 0 3 6 9 2 5 8 1 4 7 0 3 ... Largest prime <= 4096 = 212

12048 Compilers II - J. Neira – University of Zaragoza 24 Collision management • Allow N >> M (most common alternative): collision chaining (open 0 base tables). 1 Λ indice Λ – It still works on a full table – Low filling of space in comienzoΛ the table ... – Uniform hast function → access ≈ constant

M-1 ShouldShould thethe collisionscollisions bebe x organizedorganized inin aa binarybinary tree?tree? clase Λ

12048 Compilers II - J. Neira – University of Zaragoza 25 Hash table and name space (orthogonal)

bb a a s s e e i i n n d d i i c c e e c c o o m m i i e e n n z z o o x x

0 1 Λ Λ

Λ ...

M-1

Λ

12048 Compilers II - J. Neira – University of Zaragoza 26 4. Block structured Languages

• Block Structured Languages (Algol 60): – Each line of a program belongs to a series of nested blocks (scopes) (open) – The most internal is the current scope – Blocks or scopes not containing the current line are closed

• Visibility rules: – The block in which an object is declared is its scope – In any line of a program, only objects declared in the current and open blocks are accessible. – New declarations are only legal in the current scope. – If an object is declared in more than one open scope, the most internal declaration is used (the most recent).

• Implications: – The last scope to open is the first to close. – Once a scope is closed, it will not be reopened. – Once a scope is closed, all its declarations become inaccesible.

12048 Compilers II - J. Neira – University of Zaragoza 27 Examples: Pascal program pr; var a : char; function f (i : integer): integer; var d : real; begin .. end; 1 procedure p (var j: char) var i, f : real; begin .. procedure q ; var i, c : integer; begin ¿?¿? f = i*d; end 2 end; 1 begin ... end. 0

12048 Compilers II - J. Neira – University of Zaragoza 28 Examples: C char a; int f (int i, int j, char* c) { double d; .. } 1

void g (char j, int c) { double i, f; static int d; .. { int i, j, c; ¿? f = i*a; ¿? } 2 } 1 main () { ... } 0

12048 Compilers II - J. Neira – University of Zaragoza 29 4. Block structured languages • We consider static scopes: during compilation, every reference to an object can be resolved. • Dynamic scopes: this has to be done during execution.

a : integer -- global declaration procedure first a := 1 procedure second a : integer -- local declaration first () a := 2 if read_integer() > 0 WhatWhat does does this this second () else programprogram write write first () inin each each case? case? write_integer(a)

12048 Compilers II - J. Neira – University of Zaragoza 30 4. Block structured languages

• Alternative 1: A separate Current table for each scope: scope

• Open a scope: Open push() scopes • Close a scope: pop() Most • search(): from the top external downwards. Scope stack • Advantage: operation pop()is highly efficient. • Disadvantage 1: efficiency in search() depends on depth of nesting. • Disadvantage 2: each scope requires a separate has table, waste of space.

12048 Compilers II - J. Neira – University of Zaragoza 31 4. Block structured languages • Alternative 2: a single table. – Each object is associated with a different scope number. – New names go at the beginning of the collision list. – When closing a scope, all its names are erased. { A int a,b; A b 1 a 1 Λ .... { B int b; B b 2 b 1 a 1 Λ ... C } C b 1 a 1 Λ ... D } D Λ • Advantage: searching is very efficient • Disadvantage: elimination is less efficient – In cases in which the symbol table is saved, this operation is not used

12048 Compilers II - J. Neira – University of Zaragoza 32 Alternative 2: a single table • This alternative is not appropriate when collisions are stored in binary trees: name attributes

• Insertion is always at the leaves. • Searching also has to go to the leaves to find the most internal definition of an object. • Closing a scope implies traversing the whole tree.

TheThe orderorder ofof insertioninsertion isis lost!lost!

12048 Compilers II - J. Neira – University of Zaragoza 33 Summary • Components of a symbol • Scope structure defined in table: the program must be 1. Name space: stores the preserved.. names of the symbols introduced in the program. – Scope stack: more efficient 2. Access mechanism: used update, inefficient search. to initiate the search for a symbol. »Lists – A single table for all scopes: highly efficient » Binary trees search, inefficient update. » Hash tables 3. Attribute table: it’s were attributes associated to a ¿ given symbol are stored. ?

12048 Compilers II - J. Neira – University of Zaragoza 34 5. Perspective • Records and fields: In C, PASCAL and ADA the field name should be unique only within the register definition:

type r1 = record a : integer; type a = record b : real; a : integer; end; b : record a : char; r2 = record b : boolean; a : char; end; b : boolean; end; end; • This increases legibility, and ease of programming.

HowHow cancan thisthis bebe consideredconsidered inin thethe symbolsymbol tabletable avoidingavoiding ambiguity?ambiguity?

12048 Compilers II - J. Neira – University of Zaragoza 35 5. Perspective • Forward references: some languages allow the use of names before they are declared:

• Pascal: records • ADA: labels

type asignatura = record profesor = ^empleado; <> delegado = ^alumno; ...... end; goto etiqueta;

empleado = record ... • Basically allowed to end; support automatic code generators • This is allowed: alumno = record • In some cases this helps cursa = ^asignatura; avoid excessive nesting of ... statements end;

12048 Compilers II - J. Neira – University of Zaragoza 36 5. Perspective • Visibility: Visibility rules can be different in special situations : – PASCAL: names of record fields are not visible without the record name.

type r = record – ADA: prefixes n : integer; c : char; procedure P; b : boolean; var X : real; end; procedure Q; var X : real; var v : r; begin ..... X := P.X; n := 0; ...

12048 Compilers II - J. Neira – University of Zaragoza 37 5. Perspective • Altering visibility rules: • Problems: These rules can be altered by the programmer: var v1, v2 : r; – PASCAL: with statement...... with v1, v2 do n := 0; type r = record n : integer; c : char; • Solved in Modula-3: b : boolean; end; WITH e = whatever, f = whatever DO var v : r; e.f1 = f.f2; ..... END; with v do n := 0;

12048 Compilers II - J. Neira – University of Zaragoza 38 5. Perspective • Separate compilation: an – C: no such thing as executable is built with library during modules compiled compilation. separately. p.h – ADA: this is based on the concept of library typedef ..... t; p.ads void q(.....); package P is t1 f(.....); type t is private;

procedure q(.....); q.c function f(.....) return t1; ..... #include ”p.h” end P; q.ads with P:

12048 Compilers II - J. Neira – University of Zaragoza 39 5. Perspective • Overloading: the same – PASCAL: functions symbol has different meaning depending on the function f (n : integer) conext. : integer; – PASCAL: math operations. begin if n > 1 then f := n * f (n-1) var i, j : integer; else r, s : real; f := 1 .... end; j := i * 2; s := r * 2.0;

– ADA: generalized concept of overloading. function ”+” (x, y : complejo) return complejo is ... function ”+” (u, v : polar) return polar is ...

12048 Compilers II - J. Neira – University of Zaragoza 40 6. tabla.c and tabla.h • Implementatión of a symbol table as an open hast table.

#include

#define TAMANO 7 typedef struct { char *nombre; int nivel; TIPO_SIMBOLO tipo; TIPO_VARIABLE variable; SIMBOLO *buscar_simbolo( int dir; TABLA_SIMBOLOS tabla, ... char *nombre) } SIMBOLO; /* NULL si el nombre no aparece typedef LISTA en la tabla, dlc puntero al TABLA_SIMBOLOS[TAMANO]; simbolo mas reciente */

SIMBOLO *introducir_variable (TABLA_SIMBOLOS tabla, char *nombre, TIPO_VARIABLE variable, int nivel, int dir);

/* NULL si el nombre ya aparece en la tabla con igual nivel, dlc puntero a simbolo creado con esos datos */

12048 Compilers II - J. Neira – University of Zaragoza 41 7. Exercises 1. Consider the possibility that your compiler should detect variables used prior to having a value assigned. programa p; entero i;

principio escribir(i); fin How can a variable be assigned a value? 1. Assignment statement 2. Read statment 3. Reference parameters When is a variable’s value obtained? 1. Expresions 2. Write statement 3. Value parameters

12048 Compilers II - J. Neira – University of Zaragoza 42 Task a) Explain what changes are required in the symbol table (both in the definition of the SIMBOLO type, as in its associated functions) in order to detect these situations.

typedef struct { char *nombre; int nivel; ... int asignada; } SIMBOLO; SIMBOLO *introducir_variable (....) { ... simbolo.asignada = 0; ... }

12048 Compilers II - J. Neira – University of Zaragoza 43 Task b) Detail required changes in production rules to implement this possibility. leer: tLEER '(' lista_asignables ')' ';' ; lista_asignables: tIDENTIFICADOR { ... $1.simbolo.asignada = 1; ... } | lista_asignables ',' tIDENTIFICADOR ;

12048 Compilers II - J. Neira – University of Zaragoza 44 Task b) Detail required changes in production rules to implement this possibility.

expresion: ...

factor: tIDENTIFICADOR { ... if (!$1.simbolo.asignada) error(“Variable sin valor asignado.”); ... }

12048 Compilers II - J. Neira – University of Zaragoza 45 Task c) Does your solution have any limitations? May it not detect SOME cases?

programa p; programa p; entero i, j; entero i;

principio accion a; leer (j); principio si j > 0 ent escribir(i); i := 1; fin fsi escribir (i); principio fin i := 1; a(i); fin

12048 Compilers II - J. Neira – University of Zaragoza 46 Initializers 2. Consider the possibility of using initializes for simple variables in Pascual, as in the following example:

programa p;

var n = 1, c = "c", b = true; ..... accion q (entero E i; caracter ES d); var j = 2*i+1, g = entacar(caraent(d)+1), f = not b; .....

• As you can see, the proposed syntax explicitly declares the initial value; the value type is implicit.

12048 Compilers II - J. Neira – University of Zaragoza 47 Task a) Detail the production rules that specify syntactically this type of variable declaration.

inicializadores: tVAR lista ;

lista: inicializador | lista ’,’ inicializador ;

inicializador: tIDENTIFICADOR ’=’ expresion ;

12048 Compilers II - J. Neira – University of Zaragoza 48 Task b) Complete the production rules to include pertinent semantic verifications and symbol table updates. Define the required attributes for each symbol.

12048 Compilers II - J. Neira – University of Zaragoza 49 Functions 3. Consider the following Pascal program:

function f (...... ) : .....; function g (...... ) : .....; begin ..... f := .....; ..... end begin ..... end

• Task a) From the point of view of the compiler, is this acceptable? Justify your answer. • Task b) Does this happen in C? Why?

12048 Compilers II - J. Neira – University of Zaragoza 50 Scopes 4. In a single scope language, with variables of a single type, and no declarations, can the lexical analyzer manage the symbol table?

10 REM programa simple en BASIC 4 20 REM 30 z = 0 40 IF a > 0 GOTO 60 50 z = 1 60 PRINT z, a

{identificador} { if (!buscar_simbolo(tabla, yytext)) introducir_variable(tabla, yytext,...); }

12048 Compilers II - J. Neira – University of Zaragoza 51