<<

Symbol Tables & Scoping A language definition may or may not allow forward Programming languages use references to an identifier. scopes to limit the range in If forward references are which an identifier is active allowed, you may use a name (and visible). that is defined later in the Within a a name may be scope (Java does this for field defined only once (though and method declarations within overloading may be allowed). a class). A symbol table (or dictionary) is If forward references are not commonly used to collect all allowed, an identifier is visible the definitions that appear only after its declaration. , within a scope. C++ and Java do this for At the start of a scope, the variable declarations. symbol table is empty. At the In CSX no forward references end of a scope, all declarations are allowed. within that scope are available In terms of symbol tables, within the symbol table. forward references require two passes over a scope. First all

© © CS 536 Fall 2013 38 CS 536 Fall 2013 39

declarations are gathered. Block Structured Languages Next, all references are resolved using the complete set • Introduced by Algol 60, includes C, of declarations stored in the C++, CSX and Java. symbol table. • Identifiers may have a non-global scope. Declarations may be local to a If forward references are class, subprogram or block. disallowed, one pass through a scope suffices, processing • Scopes may nest, with declarations propagating to inner (contained) declarations and uses of scopes. identifiers together. • The lexically nearest declaration of an identifier is bound to uses of that identifier.

© © CS 536 Fall 2013 40 CS 536 Fall 2013 41 Example (drawn from C): Block Structure Concepts

• Nested Visibility int x,z; void A() { No access to identifiers outside float x,y; their scope. print(x,y,z); • Nearest Declaration Applies int } float Using static nesting of scopes. void B() { float print (x,y,z) • Automatic Allocation and Deallocation int of Locals } undeclared int Lifetime of data objects is bound to the scope of the Identifiers that denote them.

© © CS 536 Fall 2013 42 CS 536 Fall 2013 43

Is Case Significant? need to case before entering or looking up In some languages (C, C++, identifiers. Java and many others) case is This just means that identifiers significant in identifiers. This are converted to a uniform case means aa and AA are different before they are entered or symbols that may have entirely looked up. Thus if we choose to different definitions. use lower case uniformly, the In other languages (Pascal, Ada, identifiers aaa, AAA, and AaA are Scheme, CSX) case is not all converted to aaa for significant. In such languages purposes of insertion or aa and AA are two alternative lookup. spellings of the same identifier. BUT,inside the symbol table the Data structures commonly used identifier is stored in the form it to implement symbol tables was declared so that usually treat different cases as programmers see the form of different symbols. This is fine identifier they expect in when case is significant in a listings, error messages, etc. language. When case is insignificant, you probably will

© © CS 536 Fall 2013 44 CS 536 Fall 2013 45 How are Symbol Tables Implementing Block- Implemented? Structured Symbol Tables There are a number of data To implement a block structures that can reasonably structured symbol table we be used to implement a symbol need to be able to efficiently table: open and close individual • An Ordered List scopes, and limit insertion to Symbols are stored in a linked list, the innermost current scope. sorted by the symbol’s name. This This can be done using one is simple, but may be a bit too slow if many identifiers appear in a symbol table structure if we tag scope. individual entries with a “scope number.” • A Binary Search Tree Lookup is much faster than in It is far easier (but more linked lists, but rebalancing may be wasteful of space) to allocate needed. (Entering identifiers in one symbol table for each sorted order turns a search tree scope. Open scopes are into a linked list.) stacked, pushing and popping • Hash Tables tables as scopes are opened The most popular choice. and closed.

© © CS 536 Fall 2013 46 CS 536 Fall 2013 47

Be careful though—many Reading Assignment preprogrammed stack implementations don’t allow Read Chapter 3 of you to “peek” at entries below Crafting a . the stack top. This is necessary to lookup an identifier in all open scopes. If a suitable stack implementation (with a peek operation) isn’t available, a linked list of symbol tables will suffice.

© © CS 536 Fall 2013 48 CS 536 Fall 2013 49 Scanning Can a newline character appear in a string? In C it cannot, unless it is A scanner transforms a character escaped with a backslash. stream into a token stream. C, C++ and Java allow escaped A scanner is sometimes called a newlines in strings, Pascal forbids lexical analyzer or lexer. them entirely. Ada forbids all Scanners use a formal notation unprintable characters. (regular expressions) to specify Are null strings (zero-length) the precise structure of tokens. allowed? In C, C++, Java and Ada But why bother? Aren’t tokens they are, but Pascal forbids them. very simple in structure? (In Pascal a string is a packed Token structure can be more array of characters, and zero detailed and subtle than one length arrays are disallowed.) might expect. Consider simple A precise definition of tokens can quoted strings in C, C++ or Java. ensure that lexical rules are The body of a string can be any clearly stated and properly sequence of characters except a enforced. quote character (which must be escaped). But is this simple definition really correct?

© © CS 536 Fall 2013 50 CS 536 Fall 2013 51

Regular Expressions • Most editors provide a “context search” command that specifies Regular expressions specify desired matches using regular simple (possibly infinite) sets of expressions. strings. Regular expressions •The Windows Find utility allows routinely specify the tokens some regular expressions. used in programming languages. Regular expressions can drive a scanner generator. Regular expressions are widely used in computer utilities: •The Unix utility grep uses regular expressions to define search patterns in files. •Unix shells allow regular expressions in file lists for a command.

© © CS 536 Fall 2013 52 CS 536 Fall 2013 53 Regular Sets This vocabulary is normally the character set used by a computer. The sets of strings defined by Today, the ASCII character set, regular expressions are called which contains a total of 128 regular sets. characters, is very widely used. When scanning, a token class will Java uses the Unicode character be a regular set, whose structure set which includes all the ASCII is defined by a regular characters as well as a wide expression. variety of other characters. Particular instances of a token An empty or null string is allowed class are sometimes called (denoted λ, “lambda”). Lambda lexemes, though we will simply represents an empty buffer in call a string in a token class an which no characters have yet instance of that token. Thus we been matched. It also represents call the string abc an identifier if optional parts of tokens. An it matches the regular expression integer literal may begin with a that defines valid identifier plus or minus, or it may begin tokens. with λ if it is unsigned. Regular expressions use a finite character set, or vocabulary (denoted Σ).

© © CS 536 Fall 2013 54 CS 536 Fall 2013 55