CS264/Assignment 1: a Running Start: Getting Over Lex and Yacc Fast

UNIVERSITY OF CALIFORNIA Department of Electrical Engineering and Computer Sciences Computer Science Division CS 264 Prof. Richard Fateman Fall 1999 CS264/Assignment 1: A running start: getting over lex and yacc fast Due: Sept. 7, 1999 for part 0 Reading: Your choice of material on scanning and parsing. The Aho/Sethi/Ullman Compilers: Principles, Techniques and Tools (henceforth, Red Dragon book) chapters 3, 4 and 5 can be helpful, Alternatively, you may find coverage of scanning and parsing in almost any other book on compiling. We expect you are already familiar with this material. Generally we will not make much use of the Red Dragon book, but you might wish to have it handy for reference. For this first assignment you may wish to look at your favorite parser (or one from a friend?) from a previous course, and consider reviving it. Especially if you are a Lisp fan, Norvig’s Paradigms of AI, Chapter 19 totally finesses parsing, and may help solve this assignment relatively easily; however, it totally punts on scanning. (As a Lisp fan myself, and given the restrictions of LALR parser generators, I found Norvig’s approach much easier. I will also talk about and distribute a 50-line scanner written in Lisp.) General guidelines: You should provide brief but accurate documentation for all your programs, being especially careful to state any assumptions or limitations you make on the input. (Naturally you must have some taste in not assuming away a significant part of the problem, but careful preliminary thoughtful design can make your job much easier. Discussions with other students are recommended, but the best approach is probably to ask me if you have any questions. ) Caution: If the only tool you have is a hammer, then every problem looks like a nail. Be sure that the problem you are addressing is not just a matter of using the wrong tool. What to hand in: For this and later assignments I will generally expect a paper version of material that should be read by a grader (me), plus some way of accessing 1 machine-readable on-line versions of anything I might reasonably expect to execute. Nor- mally you should include enough test runs to demonstrate that your programs work at least sometimes. Here is a problem that, suitably refined or elaborated, will give your favorite programming tools a run for their money. The assignment We tend to think that the parts of programming languages that deal with arithmetic assignment statements are simply modeling the ordinary language of mathematics. In fact all major programming languages hold considerably distorted views of what you might actually see in a math text. So-called computer algebra systems like Mathematica or Maple are not much better, linguistically, than more conventional “numerical” systems. Your assignment here is to see if you can in fact scan and parse something that looks considerably more like mathematics. We will start by overly simplifying the assignment as a base. We then suggest extensions that will add to the value of your project. It is up to you to decide which of the extensions is worth your time and energy. The level-0 version: 0. Write a program or programs in any language or languages you wish, to accept user input (from standard input) that looks like a sequence of infix arithmetic assignment statements in some well-known programming language (Fortran, C, Matlab, Pascal, Basic), and produces some evidence that it is understandable. In particular we expect you to do the following: • Show that you can separate the input stream into tokens (your choice of how to structure them). • Come up with (preferably simple, extensible) parser for your language. • Write a simple read-print loop for demonstration, showing you can compute a correct parse of the input. Here’s a sample interaction <in1> a:=b*c+ 4 <out1> (SET! A (+ (* B C) 4)) <in2> For convenience in checking the return value against some unambiguous standard I have printed it in a Lisp-like form that should not require explanation to you. You are free to provide an alternative tree form. 2 To get full credit for this level in the assignment you must also have a formal definition of the grammar you are accepting. This would, among other things implicitly specify precedences and associativity of operations, (Does a*b*c mean a*(b*c) or (a*b)*c?) and other details that you may find in the language definition document. You are not expected to create much new material, just find existing documentation and trim it down. You should cite all sources for material. At this level, we are happy for you to selectively eliminate parts of the languages to avoid grungy programming. Here’s one acceptable simplification: let all numbers be decimal integers only. You may even find it acceptable at the token level to ignore the distinction between symbols and numbers. It used to be that Basic even insisted that all symbols were one character long! The next levels require you to extend the previous program(s) in various ways. I think that most of them are best addressed in the order presented here, but you can in general pick and choose from them. 1. Introduce a powering operation in case your language did not have it. Parse a^b^c so it is different from a^(b*c). Add subscripting (array references) if you have not already included this feature. 2. Add a list of typical “built-in” functions, such as sin, cos, tan, log, exp as well as sinh, cosh, tanh (these are hyperbolic trig functions). The fact that their names start out as the names of other functions may require some attention later. Adding such functions may require no change unless you are using single-letter identifiers tokens. You might try to set up your program so it can work either way by setting a switch. (With one-character identifier tokens, the input “hello” would be 5 identifiers in a row but “sinx” would be two: sin, x.) You must be able to parse a:=b*c*d+4*sin(x)^bar. This would translate into Scheme as something like (set! (+ (*bcd)(*4(^(sin x) bar)))). 3. Figure out a way for the user to add new “built-in” function operators. The usefulness of such a designation will become apparent in the next item. 4. Change your implementation so that an explicit * is optional and multiplication can be represented in a way closer to mathematical convention, namely by spaces or other breaks. That is, a:=b c + 4d +k(x+y)+sin(r) is equivalent to a:=b*c+4*d+k*(x+y)+sin(r) . Note that we did not change sin(r) to sin*(r) because sin is a built-in function. Contrast this to the treatment of k in the given expression. 5. Change your implementation even further to allow function application notation closer to that of conventional mathematics, allowing sin x instead of sin(x).This requires some thought as to how much to include as an argument to sin.Howdoyou parse each of these: sin(x)*y, sin(x*y), sin x*y, sin x y, sin(x y), sin x sin y? 3 6. Change your implementation to allow spaces to be optional under the assump- tion that all user variables are single letters. Then a:=xcoshx is understood as a:=x*cosh(x), a:=xcosgx is a:=x*cos(g*x) and a:=xbosgx is a:=x*b*o*s*g*x. 7. Change your implementation to allow for ambiguous parses. (This is easy with a general CFG parser, impossible with YACC, I think.) In particular it is common in some contexts to interpret z/xy as z/(x*y) even though your rules up to this point would probably result in (z*y)/x. The safest approach for a mathematician or physicist who wishes to be understood unambiguously is to add parentheses. The safest thing for a compiler-writer is perhaps to say: there are two (or more) interpre- tations of your input: which do you mean?. Perhaps providing a warning suggesting extra parentheses. There are also similar ambiguities with a/b/c. While you are at it, you may wish to consider k(x) as ambiguous: if k is not known as a function, we interpret it as k*x, but maybe this is clearly a function application: how else to explain the parentheses? 8. Do something more sensible with incorrect input. This might be a message pointing out where the utterance first became irretrievable wrong (no longer a prefix of a correct expression), or an attempt to patch it up. This could be done by either discarding or adding characters. Don’t get sucked into doing too much here. There is a large literature, but it is based on a world view in which computers were slow and you had only one “run” a day to find errors. An interactive view where the computer says in effect “the previous line was erroneous at the marked location. Here is your input line for editing” may be sufficient. 9. “Real” mathematical notation is displayed on multiple lines with horizontal divide bars, raised superscripts, etc. Design extensions to the notion of 1-D (string) parsing to accomodate this 2-dimensional input. Given the open-endedness of mathematics, this cannot be solved with a fixed program. A program that does this well can easily get you an MS degree. 4.

Load more