Lab Manual Contents for Compiler

LEX and YACC

Introduction

Before 1975 writing a compiler was a very time-consuming process. Then Lesk [1975] and Johnson [1975] published papers on lex and yacc. These utilities greatly simplify compiler writing.

Lex and Yacc are two very important and powerful tools on UNIX and LINUX. Lex and yacc are tools used to generate lexical analyzers and parsers. In fact, they are so powerful that building compilers for FORTRAN or C. These tools used for you to write your own language and its compiler!

This manual covers regular expressions, declarations, matching patterns, variables, Yacc grammar, and parser code. At the end, it explains how to tie Lex and Yacc together. Lex stands for Lexical Analyzer. Yacc stands for yet another Compiler Compiler.

1. Basics of Linux commands used for compiler design

1.1. Working With files and Directories

Command Summary Use cd  Use cd to change directories.  Type cd followed by the name of a directory to access that directory.  Keep in mind that you are always in a directory and can navigate to directories hierarchically above or below.  Ex: cd compiler  To move up one directory, use the shortcut command.  Ex: cd ..  Use cp -r to copy a directory and all of its contents.  Type cp -r followed by the name of an existing directory and the name of the new directory.

 Ex: cp -r compiler newdir mkdir . Use mkdir to make/create a brand new directory

. Type mkdir followed by the name of a directory.

. Ex: mkdir testdir mv o Use mv to change the name or to rename of a directory.

o Type mv followed by the current name of a directory and the new name of the directory.

o Ex: mv testdir newnamedir

 Trying to find out where on your Linux server you currently are located?

 The pwd (print working directory/path) command will show you the full path to the pwd directory you are currently in.

 This is very handy to use, especially when you design compiler, to know where you are currently! rmdir  Use rmdir to remove an existing directory (assuming you have permissions set to allow this). Type rmdir followed by a directory's name to remove it.  Ex: rmdir testdir

 You CAN'T remove a directory that contains files with this command. The rmdir command is used mostly to remove empty directories.

 A more useful command is rm -r that removes directories and files within the directories.

Table 1 Exercises Ex-1 To create a folder or compilerFolder  mkdir compilerFolder To work on the new directory compilerFolder  cd compilerFolder Then now creat two directories namely lex and yacc in side the compilerFolder  mkdir lex  mldir yacc Now work on the directory yacc  cd yacc To move up one directory, use the shortcut command.  cd .. To move to your home directories use  cd

Ex-2 Create the same copy of the directory compiler Folder and give name lexandyacc  cp -r compilerFolder lexandyacc Work on the directory /lexandyacc/lex  cd lexandyacc  cd lex Print the directory to know where you are currently  pwd Rename the folder lex to lexical  cd ..  mv lex lexical

Ex-3 Try to delete the directory compilerFolder  rmdir compiler Folder You will get error message why? And try to delete the folder lex  cd compilerFolder  rmdir lex You deleted it why? Try to delete the directory complierFolder using rm -r  cd  rm –r compilerFolder You deleted it why?

1.2. Working with text editor, vi

Command Summary Use vi  When you start vi, you're in Command mode.  To enter Input mode, type the letter a (lowercase only) to signal that you want to add text after the cursor position.  Press esc to switch back to Command mode at any time.  Here's how to create a file from scratch using vi. To start, create a new file named compiler.l by typing  vi compiler.l

ZZ Saving Your Work

 Press esc to enter Command mode;

 Then type ZZ (to put your file to sleep). You won't see any Z's on the screen, but after you've entered the second Z, your file will disappear, your Linux command prompt will return, and your file was successfully saved. Table 2

Ex-4

Congratulations--you've just survived your first encounter with vi. You know that the a command switches to Input mode, esc gets you back to Command mode, and ZZ saves the file, but you'll have to expand this to get any real work done. Common vi Commands

Have a look at this list of common vi commands (there are many more, but these will at least allow you to get some basic work done).

Positioning the Cursor

 Move cursor one space right.  Move cursor one space left. Move cursor up one line.  Move cursor down one line. ctrl-F Move forward one screen. ctrl-B Move backward one screen. $ Move cursor to end of line. ^ Move cursor to beginning of line. :1 Move to first line of file :$ Move to last line of file / Search for a character string. ? Reverse search for a character string. x Delete the character at the cursor position. dd Delete the current line. p Paste data that was cut with x or dd commands. u Undo.

Table 3

Entering Input Mode

a Add text after the cursor. i Insert text before the cursor. R Replace text starting at the cursor. o Insert a new line after the current one.

Table 4

Entering Command Mode

esc Switch from Input mode to Command mode.

Table 5

Exiting or Saving Your File

:w Write file to disk, without exiting editor. ZZ Save the file and exit. :q! Quit without saving.

Table 6

2. Basics of C programming 2.1. The stander output Function printif()

We have already seen that interactive output in C require the use of printf. Perhaps the simple from of printf is a user prompt for input, such as: printf(“This Is Your First C Program”);

Writing the source the source of our first program is below:

Code:

#include main() { printf("This is your first C program\n"); } Save the program above and call it simplelpro.c here is the breakdown of the program above. The first line #include is a preprocessor directive. This basically tells the compiler that we are using the functions that are in the stdio library. The stdio library contains all the basic functions needed for basic input and output for our program.

The second line main() is a required function for every C program. main() is the starting point for the program. Like all functions the body begins with a { (open curly brace) and ends with a } (close curly brace).

The body of our main function printf("This is your first C program\n");is a function call to the printf function. printf stands for print formatted and has complex rules for printing text, numbers to specific formats. Here we are just displaying the test “This is your first C program " to the screen. The \n at the end of string is a newline. It tells printf to end the line and start any additional text on the next line. All function calls in C must end in a;

Compiling the Source +The compile is done from the command line.

Code: $ gcc -o simplelpro simplelpro.c $ gcc is the GNU C compiler. The -o option tells it what to name the output file and the simplelpro.c is the source file.

The output from the compiler will be a binary file called simplelpro Running the executable

In order to run our sample executable we will need to apply the execute permission to the file. Then we will execute it after. Code:

$ chmod 744 simplelpro $ ./simplelpro This is your first C program

The output of our sample program produced the text " This is your first C program " to the console (or screen). Try to add some more text to the sample program, recompile and watch the output.

2.2 scanf: Reading User Values Try this program and run it:

#include int main() { int a, b, c; printf("Enter the first value:"); scanf("%d", &a); printf("Enter the second value:"); scanf("%d", &b); c = a + b; printf("%d + %d = %d\n", a, b, c); return 0; }

The %d is a placeholder that will be replaced by the value of the variable b when the printf statement is executed. You can print all of the normal C types with printf by using different placeholders:

← int (integer values) uses %d ← float (floating point values) uses %f ← char (single character values) uses %c ← character strings (arrays of characters) use %s ← The scanf function uses the same placeholders as printf: ← int uses %d ← float uses %f ← char uses %c ← character strings (arrays of characters) use %s The Lexical Analyzer, Lex

Lex is a tool for generating scanners. Scanners are programs that recognize Lexical patterns in text. These lexical patterns (or regular expressions) are defined in a particular syntax.

A matched regular expression may have an associated action. This action may also include returning a token. When Lex receives input in the form of a file or text, it attempts to match the text with the regular expression. It takes input one character at a time and continues until a pattern is matched.

If a pattern can be matched, then Lex performs the associated action (which may include returning a token). If, on the other hand, no regular expression can be matched, further processing stops and Lex displays an error message.

Lex and C are tightly coupled. A .lex file (files in Lex have the extension. lex) is passed through the lex utility, and produces output files in C. These file(s) are compiled to produce an executable version of the lexical analyzer.

Regular expressions in Lex

An expression is made up of symbols. Normal symbols are characters and numbers, but there are other symbols that have special meaning in Lex. The following two tables define some of the symbols used in Lex and give a few typical examples.

Defining regular expressions in Lex

A-Z, 0-9, a-z Characters and numbers that form part of the pattern. . Matches any character except \n. - Used to denote range. Example: A-Z implies all characters from A to Z. [ ] A character class. Matches any character in the brackets. If the first character is ^ then it indicates a negation pattern. Example: [abC] matches either of a, b, and C. * Match zero or more occurrences of the preceding pattern. + Matches one or more occurrences of the preceding pattern. ? Matches zero or one occurrences of the preceding pattern. $ Matches end of line as the last character of the pattern.

{ } Indicates how many times a pattern can be present. Example: A{1,3} implies one or three occurrences of A may be present. \ Used to escape meta characters. Also used to remove the special meaning of characters as defined in this table.

^ Negation. | Logical OR between expressions.

"" Literal meanings of characters. Meta characters hold. / Look ahead. Matches the preceding pattern only if followed by the succeeding expression. Example: A0/1 matches A0 only if A01 is the input. ( ) Groups a series of regular expressions. \n New line \t White space

Table 7 Token

A token is a language element that can be used in forming higher-level language constructors. C has six kinds of tokens:

1Reserved Words (Key Word) 2. Identifiers 3. Constants 4. String Literals 5. Punctuators 6. Operators

Examples of token declarations

Token Associated Meaning expression (chars)+ 1 ([0-9])+ number 1 or more occurrences or more of a digit occurrences of chars

chars [A-Za-z] Any character

blank A " " blank space word abc abc abc* … ab, abc, abcc, abccc, abc abc, abcc, abccc, … a(bc)+ abc, abcbc, abcbcbc, … a(bc)? a, abc [abc] a, b, c [a-z] any letter, a through z [a\-z] a, -, z [A-Za-z0-9]+ one or more alphanumeric characters [ \t\n]+ whitespace [^ab] anything except: a, b [a^b] a, ^, b [a|b] a, |, b a|b A or b

Table -8 Lex

The program Lex generates a so called `Lexer'. This is a function that takes a stream of characters as its input, and whenever it sees a group of characters that match a key, takes a certain action. A very simple example:

1) Ex-1

%{ #include %}

%%

Well printf("Well command received\n"); Compiler printf("Compiler command received\n");

%%

The first section, in between the %{ and %} pair is included directly in the output program. We need this, because we use printf later on, which is defined in stdio.h. Sections are separated using '%%', so the first line of the second section starts with the 'Well' key. Whenever the 'Complier' key is encountered in the input, the rest of the line (a printf() call) is executed.

We terminate the code section with '%%' again. To compile Example 1, do this: lex example1.l cc lex.yy.c −o example1 −ll This will generate the file 'example1'. If you run it, it waits for you to type some input. Whenever you type something that is not matched by any of the defined keys (i.e., 'Well' and 'Compiler') its output again. If you enter 'Compiler' it will output 'Compiler command received';

2) Ex-2 This example wasn't very useful in itself, and our next one won't be either. It will however show how to use regular expressions in Lex, which are massively useful later on. %{ #include %} %% [0123456789]+ printf("NUMBER\n"); [a− zA − Z][a − zA − Z0 − 9]* printf("WORD\n"); [0-9][a− zA − Z0 − 9]* printf("ERROR\n"); %%

This Lex file describes three kinds of matches (tokens): WORDs and NUMBERs and ERROR. Regular expressions can be little work and it is easy to understand them.  Let's examine the NUMBER match:

[0123456789]+ This says: a sequence of one or more characters from the group 0123456789. We could also have written it shorter as: [0−9]+

 Now, the WORD match is somewhat more involved:

[a−zA−Z][a−zA−Z0−9]* The first part matches 1 and only 1 character that is between 'a' and 'z', or between 'A' and 'Z'. In other words, a letter. This initial letter then needs to be followed by zero or more characters which are either a letter or a digit. Why use an asterisk here? The '+' signifies 1 or more matches, but a WORD might very well consist of only one character, which we've already matched. So the second part may have zero matches, so we write a '*'.

 Then, the ERROR matches if it starts by Number and then it can be any number of letters or numbers.

[0-9][a− zA − Z0 − 9]*  This way, we've mimicked the behavior of many programming languages which demand that a variable name *must* start with a letter, but can contain digits afterwards. In other words, 'idnumber1' is a valid name, but '1idnnumber' is not.

 Another important thing here is if you press space or new line it starts to tokenize another string.

3) Ex-3 Let's say we want to parse a file that looks like this: WORDs, like 'zone' and 'type' FILENAMEs, like '/etc/bind/db.root' QUOTEs, like those surrounding the filename OBRACEs, { EBRACEs, } SEMICOLONs, ;

%{ #include %}

%%

[a− zA − Z][a − zA − Z0 − 9]* printf("WORD "); [a− zA − Z0 − 9\/. − ]+ printf("FILENAME "); \" printf("QUOTE "); \{ printf("OBRACE "); \} printf("EBRACE "); ; printf("SEMICOLON ");

The Syntax Analyzer, Yacc YACC can parse input streams consisting of tokens with certain values. This clearly describes the relation YACC has with Lex, YACC has no idea what 'input streams' are, it needs preprocessed tokens. While you can write your own Tokenizer, we will leave that entirely up to Lex.

A note on grammars and parsers. When YACC saw the light of day, the tool was used to parse input files for compilers: programs. Programs written in a programming language for computers are typically *not* ambiguous − they have just one meaning. As such, YACC does not cope with ambiguity and will complain about shift/reduce or reduce/reduce conflicts.

Lex Yacc interaction

Conceptually, lex parses a file of characters and outputs a stream of tokens; yacc accepts a stream of tokens and parses it, performing actions as appropriate. In practice, they are more tightly coupled.

If your lex program is supplying a tokenizer, the yacc program will repeatedly call the yylex routine. The lex rules will probably function by calling return every time they have parsed a token. We will now see the way lex returns information in such a way that yacc can use it for parsing.

The shared header files of return codes

If lex is to return tokens that yacc will process, they have to agree on what tokens there are. This is done as follows. • The yacc file will have token definitions

%token NUMBER in the definitions section.

• When the yacc file is translated with yacc -d, a header file y.tab.h is created that has definitions like

#define NUMBER 258

This file can then be included in both the lex and yacc program. • The lex file can then call return NUMBER, and the yacc program can match on this token.

Structure of a yacc file A yacc file looks much like a lex file:

...definitions... %% ...rules... %% ...code... In the example you just saw, all three sections are present:

Definitions: All code between %{ and %} is copied to the beginning of the resulting C file. Rules: A number of combinations of pattern and action: if the action is more than a single command it needs to be in braces. Code: This can be very elaborate, but the main ingredient is the call to yylex, the lexical analyzer. If the code segment is left out, a default main is used which only calls yylex.

E.g. -1

A simple thermostat controller

Let's say we have a thermostat that we want to control using a simple language. A session with the thermostat may look like this: heat on Heater on! heat off Heater off! target temperature 22 New temperature set!

The tokens we need to recognize are: heat, on/off (STATE), target, temperature, NUMBER.

The Lex tokenizer egheat.l

%{ #include #include "y.tab.h" %} %% [0− 9]+ return NUMBER; heat return TOKHEAT; on|off return STATE; target return TOKTARGET; temperature return TOKTEMPERATURE; \n /* ignore end of line */; [ \t]+ /* ignore whitespace */; %%

 We note two important changes. First, we include the file 'y.tab.h', and secondly, we no longer print stuff, we return names of tokens. This change is because we are now feeding it all to YACC, which isn't interested in what we output to the screen. y.tab.h has definitions for these tokens.  But where does y.tab.h come from? It is generated by YACC from the Grammar File we are about to create.

 As our language is very basic, so is the grammar: commands: /* empty */ | commands command ; command: heat_switch | target_set ; heat_switch: TOKHEAT STATE { printf("\tHeat turned on or off\n"); } ; target_set: TOKTARGET TOKTEMPERATURE NUMBER { printf("\tTemperature set\n"); } ; A complete YACC file

 The yyerror() function is called by YACC if it finds an error. We simply output the message passed.

 The function yywrap() can be used to continue reading from another file. It is called at EOF and you can than open another file, and return 0. Or you can return 1, indicating that this is truly the end.

 Then there is the main() function, that does nothing but set everything in motion.

 The last line simply defines the tokens we will be using. These are output using y.tab.h if YACC is invoked with the '−d' option.

Compiling & running the thermostat controller lex egheat.l yacc − d egheat.y cc lex.yy.c y.tab.c − o egheat