What's in a Word? Lexical Analysis for Pl/Sql And

What's in a Word? Lexical Analysis for Pl/Sql And

What’s In A Word? Lexical Analysis For pl/sql And sql Charles Wetherell Oracle Corporation 21 April 2016 Abstract A friend wrote a pl/sql source code analyzer. He was surprised when his tool worked on return 'Y'; but failed on return'Y';. Is the blank-free version legal pl/sql? Does his tool have a bug? The answer lies in the lexical structure of pl/sql and sql. This note explains lexical analysis and provides an answer to our friend’s question. 1 1 Words Ye Highlands and ye Lowlands Oh, where hae ye been? They hae slain the Earl o’Moray And laid him on the green. The Bonnie Earl o’Moray A friend builds pl/sql programming tools as a business. While testing a source code analyzer, he was puzzled by an anomaly. He had written the statement return'Y'; in a pl/sql program (probably the missing space after return was a typo). He passed the program to his source code analysis tool and the tool failed! When he rewrote the statement as return 'Y'; the tool worked exactly as expected. By contrast, the pl/sql compiler happily ac- cepted both forms. Our friend wondered if he had discovered a bug in the pl/sql compiler. Or perhaps he just didn’t understand the rules of pl/sql completely. For any language, each new utterance must be broken into meaningful pieces. Each language has its own rules and regulations. If these are unknown or unclear, the analysis can be a challenge. Mistakes are the basis of puns and ludicrous misun- derstandings. The writer Sylvia Wright loved the song The Bonnie Earl o’Moray as a child but only realized when she was an adult that the last line of the verse shown above was And laid him on the green, not And Lady Mondegreen. Natural language errors like this are now known as mondegreens. A pl/sql program or a sql statement is an utterance, commonly a text preserved in a file. For these texts, just as for English or Hindi or Mandarin, the first step is to find the words. How is this done? What are the specific rules for pl/sql and sql? Most programmers never worry about the rules; they write sensible programs and don’t notice the odd cases. But for folks who write tools to create, analyze, or manage pl/sql and sql, the details do matter. Mondegreens are to be avoided. The 2 knowledge may also help those who set coding standards, write elegant code, or are curious about everything to do with the two languages. 2 Lexical Principles Linguistic theories generally regard human languages as consisting of two parts: a lexicon, essentially a catalogue of a language’s words (its wordstock); and a grammar, a system of rules which allow for the combination of those words into meaningful sentences. Wikipedia Artificial languages have an advantage over natural languages: they are specifically designed for easy decoding. pl/sql and sql are no exceptions. These languages share their rules for finding words. Even better, they share several principles for the word finding rules. Before going further, it is important to note that there is no meaning ascribed to words at this stage. The job of finding words begins and ends with the words themselves. Whether they are arranged properly, whether they make sense when combined, whether they are useful in any way is outside consideration. Only the words themselves matter. Programming languages use a term of art for their words: the token. Language tools break text expressed as a sequence of characters into a sequence of tokens so no two tokens overlap and so every character is accounted for. Here are the lexical principles for pl/sql and sql; other programming languages are similar. Category Each token has a category. For example, a token may be an identifier, a numeric literal, a single character operator, white space, and so on. Natural languages commonly only have the categories word, punctuation, and white space although they might include a few more like number. Start Pattern The lexicon defines a family of start patterns. For example, an iden- tifier must start with an alphabetic character. A string literal may start with several different patterns, but those patterns all include the single quote char- acter '. Each start pattern identifies a particular category; several distinct patterns may all identify the same category. 3 Listing 3.1: The Lexical Algorithm 1 −− Input is a file with text 2 −− Output is an array of tokens 3 4 function LexIt(text) 5 6 TheTokens = [] 7 8 while text is not empty 9 T := find start of token category 10 e r r o r i f T i s n u l l 11 Token= find end of category T 12 e r r o r i f Token i s n u l l 13 push Token on end of TheTokens 14 15 r e t u r n TheTokens 16 17 end Ordering The start patterns are ordered so that the first that might apply is always taken. This ensures that /* is seen as the token that starts a comment and not as the pair of a divide operator / followed by a multiply operator *. Greedy Each token category is greedy. Once a start pattern has been found, the token continues until no more characters belonging to its continuation can be found. Ident˜ifiers continue so long as letters, digits, and a few special characters are seen. A string literal that starts with a single quote ' continues until a second single quote is seen. A single character operator has, by definition, no continuation and terminates as soon as its start pattern is found. Using identifier as a token category makes some pl/sql and sql purists anxious (myself included when I am in maximum purity mode) because there is another notion of identifier used in the semantic analysis of these languages. Any anxiety will be relieved by a discussion later on. 3 The Lexical Algorithm The principles are fine, but how are they applied in practice? Two ingredients are needed: a description of each token category along with its start and continuation information and an algorithm to do the lexical analysis. Listing 3.1 provides the algorithm in some imaginary programming language. Imagine a demon standing on the first brick of a long road, each brick inscribed with a character. The demon looks down at the brick under his feet and then may have to look ahead as far as two more bricks. Once he has seen these starting bricks, the demon determines the category of the token begining underfoot. There may be no 4 Listing 3.2: Example pl/sql And Token String 1 begin 2 HumptyDumpty(’abc’ || to_char(1.0)); 3 end ; Category Text Identifier begin Whitespace CR Whitespace ␣ Whitespace ␣ Identifier HumptyDumpty SingleCharOperator ( StringLiteral 'abc' Whitespace ␣ DoubleCharOperator || Whitespace ␣ Identifier to_char SingleCharOperator ( NumericLiteral 1.0 SingleCharOperator ) SingleCharOperator ) SingleCharOperator ; Whitespace CR Identifier end SingleCharOperator ; Whitespace EOF Figure 1: Tokens From The pl/sql Block legal category; in that case, the demon announces an error. Now the demon walks until the next brick in front of him can no longer continue the token under construction. At that moment, he announces that the characters from where he started to where he is standing form a token. After the demon takes one step forward onto the next character, he starts the process again. It is also possible that the continuation does not end properly and, once again, the demon may announce an error. When the demon comes to the end of the road and the end of a token at the same time, the analysis is complete and the demon retires. Readers who know something about finite state machines will recognize this as an informal description of such a machine. How might this algorithm work on a small pl/sql program? Consider the example of Listing 3.2 and the token list from Figure 1. The first surprise is that the word begin is an identifier. Isn’t it supposed to a reserved word or keyword? When parsing and semantic analysis come along, they may give identifiers more detailed roles, but for the purpose of breaking text into tokens, if it looks like an identifier, it is an identifier. 5 Secondly, every token has a text associated. That text is what the demon found when walking along. Surprisingly, the demon found three instances of whitespace in a row: CR (that is, carriage return), a blank ␣, and another ␣. Why weren’t those amalgamated into one token? Because there is no need; the analysis is simpler if each whitespace item is just a character by itself. The same reasoning applies to the end of file character EOF that ends the token list. The token list observes the lexical principles. • The first token starts at the beginning of the text. • The last token ends at the end of the text. • Every text character appears in a token. • No character appears in more than one token. • The tokens appear in text order. 4 Categories What are the categories of tokens and how are they recognized? On line 9 of Listing 3.1, there is a magical command find start of token category. This command knows how all the sql and pl/sql tokens start. Figure 2 encapsulates this knowledge. There are some restrictions on the way the rules are applied. • If the character under observation is not in the table, a lexical error has occurred. • The rules must be applied in the order they are listed.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    12 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us