BNF • Syntax Diagrams
Total Page:16
File Type:pdf, Size:1020Kb
Some Preliminaries • For the next several weeks we’ll look at how one can define a programming language • What is a language, anyway? “Language is a system of gestures, grammar, signs, 3 sounds, symbols, or words, which is used to represent and communicate concepts, ideas, meanings, and thoughts” • Human language is a way to communicate representaons from one (human) mind to another Syntax • What about a programming language? A way to communicate representaons (e.g., of data or a procedure) between human minds and/or machines CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. IntroducCon Why and How We usually break down the problem of defining Why? We want specificaons for several a programming language into two parts communies: • Other language designers • defining the PL’s syntax • Implementers • defining the PL’s semanBcs • Machines? Syntax - the form or structure of the • Programmers (the users of the language) expressions, statements, and program units How? One way is via natural language descrip- Seman-cs - the meaning of the expressions, Bons (e.g., user manuals, text books) but there statements, and program units are more formal techniques for specifying the There’s not always a clear boundary between syntax and semanBcs the two Syntax part Syntax Overview • Language preliminaries • Context-free grammars and BNF • Syntax diagrams This is an overview of the standard process of turning a text file into an executable program. IntroducCon Lexical Structure of Programming Languages A sentence is a string of characters over some alphabet (e.g., def add1(n): return n + 1) • The structure of its lexemes (words or tokens) – token is a category of lexeme A language is a set of sentences • The scanning phase (lexical analyser) collects characters into tokens A lexeme is the lowest level syntacBc unit of a • Parsing phase (syntacBc analyser) determines language (e.g., *, add1, begin) syntacBc structure A token is a category of lexemes (e.g., idenfier) Stream of tokens and Result of characters values parsing Formal approaches to describing syntax: • Recognizers - used in compilers lexical Syntactic • Generators - what we'll study analyser analyser Formal Grammar Role of Grammars in PLs • A (formal) grammar is a set of rules for • A grammar defines a (programming) strings in a formal language language, at least w.r.t. its syntax • The rules describe how to form strings • It can be used to from the language’s alphabet that are valid – Generate sentences in the language according to the language's syntax – Test whether a string is in the language • A grammar does not describe the meaning – Produce a structured representation of of the strings or what can be done with the string that can be used in them in whatever context — only their form subsequent processing Adapted from Wikipedia Grammars Context-Free Grammars • Developed by Noam Chomsky in the mid-1950s • Language generators, meant to describe the syntax of natural languages • Define a class of languages called context-free languages Backus Normal/Naur Form (1959) • Invented by John Backus to describe Algol 58 and refined by Peter Naur for Algol 60. • BNF is equivalent to context-free grammars •Chomsky & Backus independently came up with equiv- alent formalisms for specifying the syntax of a language •Backus focused on a pracBcal way of specifying an BNF (conCnued) arBficial language, like Algol •Chomsky made fundamental contribuBons to mathe- macal linguisBcs and was moBvated by the study of A metalanguage is a language used to describe human languages another language. In BNF, abstracons are used to represent classes of syntacBc structures -- they act like NOAM CHOMSKY, MIT InsBtute Professor; syntacBc variables (also called nonterminal Professor of LinguisBcs, LinguisBc Theory, Syntax, symbols), e.g. SemanBcs, Philosophy of Language <while_stmt> ::= while <logic_expr> do <stmt> This is a rule; it describes the structure of a while Six parBcipants from the 1960 Algol conference at the 1974 statement ACM conference on the history of programming languages. Top: John McCarthy, Fritz Bauer, Joe Wegstein. Bocom: John Backus, Peter Naur, Alan Perlis. BNF Non-terminals, pre-terminals & terminals • A rule has a leh-hand side (LHS) which is a single • A non-terminal symbol is any symbol in the RHS of a non-terminal symbol and a right-hand side (RHS), rule. They represent abstracBons in the language, e.g. one or more terminal or non-terminal symbols <if-then-else-statement> ::= if <test> • A grammar is a finite, nonempty set of rules then <statement> else <statement> • A non-terminal symbol is “defined” by its rules • A terminal symbol (AKA lexemes) is a symbol that’s not • MulBple rules can be combined with the verBcal-bar in any rule’s LHS. They are literal symbols that will ( | ) symbol (read as “or”) appear in a program (e.g., if, then, else in rule above) • These two rules: <if-then-else-statement> ::= if <test> <stmts> ::= <stmt> then <statement> else <statement> <stmts> ::= <stmnt> ; <stmnts> • A pre-terminal symbol is one that appears as a LHS of are equivalent to this one: rules, but in every case, the RHSs consist of single <stmts> ::= <stmt> | <stmnt> ; <stmnts> terminal symbol, e.g., <digit> in <digit> ::= 0 | 1 | 2 | 3 … 7 | 8 | 9 BNF BNF Example • RepeBBon is done with recursion A simple grammar for a subset of English • E.g., SyntacBc lists are described in BNF A sentence is noun phrase and verb phrase using recursion followed by a period ! • An <ident_list> is a sequence of one or more <sentence> ::= <nounPhrase> <verbPhrase> .! <nounPhrase> ::= <article> <noun>! <ident>s separated by commas. <article> ::= a | the! <noun> ::= man | apple | worm | penguin! <verbPhrase> ::= <verb>|<verb><nounPhrase>! <ident_list> ::= <ident> | <verb> ::= eats | throws | sees | is! <ident> , <ident_list> Derivaons Derivaon using BNF • A derivation is a repeated application of rules, beginning with the start symbol and ending with a sequence of just Here is a derivation for “the man eats the apple.” terminal symbols • It demonstrates, or proves, that the derived sentence is <sentence> -> <nounPhrase> <verbPhrase> . “generated” by the grammar and is thus in the language <article> <noun> <verbPhrase> . that the grammar defines the <noun> <verbPhrase> . • As an example, consider our baby English grammar the man <verbPhrase> . <sentence> ::= <nounPhrase><verbPhrase>.! the man <verb> <nounPhrase> . <nounPhrase> ::= <article><noun>! This is a <article> ::= a | the! leEmost the man eats <nounPhrase> . <noun> ::= man | apple | worm | penguin! derivaon the man eats <article> < noun> . <verbPhrase> ::= <verb> | <verb><nounPhrase>! <verb> ::= eats | throws | sees | is the man eats the <noun> . the man eats the apple . Derivaon Another BNF Example <program> -> <stmts> Note: There is some Every string of symbols in the derivaon is a varia-on in nota-on <stmts> -> <stmt> for BNF grammars. senten-al form (e.g., the man eats <nounPhrase> .) | <stmt> ; <stmts> Here we are using -> in the rules instead <stmt> -> <var> = <expr> of ::= . A sentence is a sentenBal form that has only <var> -> a | b | c | d terminal symbols (e.g., the man eats the apple .) <expr> -> <term> + <term> | <term> - <term> <term> -> <var> | const A leEmost deriva-on is one in which the leh- most nonterminal in each sentenBal form is the one that is expanded in the next step A derivaon may be either lehmost or rightmost or something else Another BNF Example Finite and Infinite languages <program> -> <stmts> Note: There is some varia-on in nota-on <stmts> -> <stmt> for BNF grammars. • A simple language may have a finite number | <stmt> ; <stmts> Here we are using -> in the rules instead of sentences <stmt> -> <var> = <expr> of ::= . <var> -> a | b | c | d • The set of strings represenBng integers <expr> -> <term> + <term> | <term> - <term> between -10**6 and +10**6 is a finite <term> -> <var> | const language Here is a derivaon: • A finite language can be defined by enumer- <program> => <stmts> => <stmt> ang the sentences, but using a grammar is => <var> = <expr> usually easier (e.g., small integers) => a = <expr> => a = <term> + <term> • Most interesBng languages have an infinite => a = <var> + <term> number of sentences => a = b + <term> => a = b + const! Is English a finite or infinite language? Rules with epsilons • Assume we have a finite set of words • An epsilon (ε) symbol expands to ‘nothing’ • Consider adding rules like the following to the • It is used as a compact way to make previous example something optional <sentence> ::= <sentence><conj><sentence>.! • For example: <conj> ::= and | or | because NP -> DET NP PP • Hint: When you see recursion in a BNF, the PP -> PREP NP | ε language is probably infinite PPEP -> in | on | of | under –When might it not be? • You can always rewrite a grammer with ε –The recursive rule might not be reachable. symbols to an equivalent one without There might be epsilons. Parse Tree Another Parse Tree A parse tree is a hierarchical representation of a derivation <program>! ! ! <sentence> <stmts>! ! <nounPhrase> <verbPhrase> ! <stmt>! ! <article> <noun> <verb> <nounPhrase> ! <var> = <expr>! ! the man eats ! <article> <noun> a <term> + <term>! ! the apple ! <var> const! ! ! b ! Ambiguous Grammar Ambiguous English Sentences • A grammar is ambiguous if and only if • (iff) it generates a sentenBal form that I saw the man on the hill with a has two or more disBnct parse trees telescope • Ambiguous grammars are, in general, • Time flies like an arrow very undesirable in formal languages • Fruit flies like a banana