International Journal of Research (IJR) Vol-1, Issue-8, September2014 ISSN 2348-6848

Parsing – A Brief Study

Tanya Sharma, Sumit Das & Vishal Bhalla

Research Scholars, Computer science and engineering department, MDU Rohtak, India

ABSTRACT Within computational linguistics the term is is a technique to determine how a used to refer to the formal analysis by a string might be derived using productions computer of a sentence or other string of words into its constituents, resulting in a (rewrite rules) of a given grammar. It can be showing their syntactic relation to used to check whether or not a string each other, which may also contain semantic belongs to a given language. When a and other information. statement written in a programming The term is also used in psycholinguistics language is input, it is parsed by a compiler when describing language comprehension. to check whether or not it is syntactically In this context, parsing refers to the way that correct and to extract components if it is human beings analyze a sentence or phrase (in spoken language or text) "in terms of correct. Finding an efficient parser is a grammatical constituents, identifying the nontrivial problem and a great deal of parts of speech, syntactic relations, etc." research has been conducted on parser This term is especially common when design. This paper basically serves the discussing what linguistic cues help speakers purpose of elaborating the concept of to interpret garden-path sentences. parsing. Within computer science, the term is used in the analysis of computer languages, 1) INTRODUCTION referring to the syntactic analysis of the input code into its component parts in order Parsing or syntactic analysis is the process to facilitate the writing of compilers and of analysing a string of symbols, either in interpreters. natural language or in computer languages, according to the rules of a . 2) NATURAL LANGUAGE PARSING The term parsing comes from Latin pars (orationis), meaning part (of speech). 2.1) TRADITIONAL METHODS The term has slightly different meanings in different branches of linguistics and The traditional grammatical exercise of computer science. Traditional sentence parsing, sometimes known as clause parsing is often performed as a method of analysis, involves breaking down a text understanding the exact meaning of a into its component parts of speech with sentence, sometimes with the aid of devices an explanation of the form, function, and such as sentence diagrams. It usually syntactic relationship of each part. This emphasizes the importance of grammatical is determined in large part from study of divisions such as subject and predicate. the language's conjugations and declensions, which can be quite intricate

PARSING – A BRIEF STUDY Tanya Sharma, Sumit Das & Vishal Bhalla

P a g e | 1265 International Journal of Research (IJR) Vol-1, Issue-8, September2014 ISSN 2348-6848

for heavily inflected languages. To parse syntax is affected by both linguistic and a phrase such as 'man bites dog' involves computational concerns; for instance noting that the singular noun 'man' is the some parsing systems use lexical subject of the sentence, the verb 'bites' is functional grammar, but in general, the third person singular of the present parsing for grammars of this type is tense of the verb 'to bite', and the known to be NP-complete. Head-driven singular noun 'dog' is the object of the phrase structure grammar is another sentence. Techniques such as sentence linguistic formalism which has been diagrams are sometimes used to indicate popular in the parsing community, but relation between elements in the other research efforts have focused on sentence. less complex formalisms such as the one Parsing was formerly central to the used in the Penn Treebank. Shallow teaching of grammar throughout the parsing aims to find only the boundaries English-speaking world, and widely of major constituents such as noun regarded as basic to the use and phrases. Another popular strategy for understanding of written language. avoiding linguistic controversy is However the teaching of such techniques dependency grammar parsing. is no longer current. Most modern parsers are at least partly statistical; that is, they rely on a corpus 2.2) COMPUTATIONAL METHODS of training data which has already been annotated (parsed by hand). This In some machine translation and natural approach allows the system to gather language processing systems, written information about the frequency with texts in human languages are parsed by which various constructions occur in computer programs. Human sentences specific contexts. Approaches which are not easily parsed by programs, as have been used include straightforward there is substantial ambiguity in the PCFGs (probabilistic context-free structure of human language, whose grammars), maximum entropy, and usage is to convey meaning (or neural nets. Most of the more successful semantics) amongst a potentially systems use lexical statistics (that is, unlimited range of possibilities but only they consider the identities of the words some of which are germane to the involved, as well as their part of speech). particular case. So an utterance "Man However such systems are vulnerable to bites dog" versus "Dog bites man" is over-fitting and require some kind of definite on one detail but in another smoothing to be effective. language might appear as "Man dog Parsing algorithms for natural language bites" with a reliance on the larger cannot rely on the grammar having 'nice' context to distinguish between those two properties as with manually designed possibilities, if indeed that difference grammars for programming languages. was of concern. It is difficult to prepare As mentioned earlier some grammar formal rules to describe informal formalisms are very difficult to parse behaviour even though it is clear that computationally; in general, even if the some rules are being followed. desired structure is not context-free, In order to parse natural language data, some kind of context-free approximation researchers must first agree on the to the grammar is used to perform a first grammar to be used. The choice of pass. Algorithms which use context-free

PARSING – A BRIEF STUDY Tanya Sharma, Sumit Das & Vishal Bhalla

P a g e | 1266 International Journal of Research (IJR) Vol-1, Issue-8, September2014 ISSN 2348-6848

grammars often rely on some variant of are extracted, rather than a parse tree the CKY algorithm, usually with some being constructed. Parsers range from heuristic to prune away unlikely analyses very simple functions such as scanf, to to save time. However some systems complex programs such as the frontend trade speed for accuracy using, e.g., of a C++ compiler or the HTML parser linear-time versions of the shift-reduce of a web browser. An important class of algorithm. A somewhat recent simple parsing is done using regular development has been parse re-ranking expressions, where a regular expression in which the parser proposes some large defines a regular language, and then the number of analyses, and a more complex regular expression engine automatically system selects the best option. generates a parser for that language, allowing pattern matching and extraction 3) COMPUTER LEVEL PARSING of text. In other contexts regular expressions are instead used prior to 3.1) PARSER parsing, as the lexing step whose output is then used by the parser. A parser is a software component that The use of parsers varies by input. In the takes input data (frequently text) and case of data languages, a parser is often builds a data structure – often some kind found as the file reading facility of a of parse tree, or program, such as reading in HTML or other hierarchical structure – giving a XML text; these examples are markup structural representation of the input, languages. In the case of programming checking for correct syntax in the languages, a parser is a component of a process. The parsing may be preceded or compiler or interpreter, which parses the followed by other steps, or these may be source code of a computer programming combined into a single step. The parser language to create some form of internal is often preceded by a separate lexical representation; the parser is a key step in analyser, which creates tokens from the the compiler frontend. Programming sequence of input characters; languages tend to be specified in terms alternatively, these can be combined in of a deterministic context-free grammar . Parsers may be because fast and efficient parsers can be programmed by hand or may be written for them. For compilers, the automatically or semi-automatically parsing itself can be done in one pass or generated by a parser generator. Parsing multiple passes – see one-pass compiler is complementary to templating, which and multi-pass compiler. produces formatted output. These may The implied disadvantages of a one-pass be applied to different domains, but compiler can largely be overcome by often appear together, such as the adding fix-ups, where provision is made scanf/printf pair, or the input (front end for fix-ups during the forward pass, and parsing) and output (back end code the fix-ups are applied backwards when generation) stages of a compiler. the current program segment has been The input to a parser is often text in recognized as having been completed. some computer language, but may also An example where such a fix-up be text in a natural language or less mechanism would be useful would be a structured textual data, in which case forward GOTO statement, where the generally only certain parts of the text target of the GOTO is unknown until the

PARSING – A BRIEF STUDY Tanya Sharma, Sumit Das & Vishal Bhalla

P a g e | 1267 International Journal of Research (IJR) Vol-1, Issue-8, September2014 ISSN 2348-6848

program segment is completed. In this case, the application of the fix-up would be delayed until the target of the GOTO was recognized. Obviously, a backward GOTO does not require a fix-up. Context-free grammars are limited in the extent to which they can express all of the requirements of a language. Informally, the reason is that the memory of such a language is limited. The grammar cannot remember the presence of a construct over an arbitrarily long input; this is necessary for a language in which, for example, a name must be declared before it may be referenced. More powerful grammars that can express this constraint, however, cannot be parsed efficiently. Thus, it is a common strategy to create a relaxed parser for a context-free grammar which accepts a superset of the desired language constructs (that is, it accepts some invalid constructs); later, the unwanted constructs can be filtered out at the semantic analysis (contextual analysis) step. Fig 1)

3.2) OVERVIEW OF PROCESS 4) TYPES OF PARSING

The task of the parser is essentially to determine if and how the input can be derived from the start symbol of the grammar. This can be done in essentially two ways:

• TOP-DOWN PARSING - top-down parsing is a parsing strategy where one first looks at the highest level of the parse tree and works down the parse tree by using the rewriting rules of a formal grammar. LL parsers are a type of parser that uses a top-down parsing strategy. Top-down parsing is a strategy of analyzing unknown data

PARSING – A BRIEF STUDY Tanya Sharma, Sumit Das & Vishal Bhalla

P a g e | 1268 International Journal of Research (IJR) Vol-1, Issue-8, September2014 ISSN 2348-6848

relationships by hypothesizing The bottom-up name comes from the general parse tree structures and then concept of a parse tree, in which the considering whether the known most detailed parts are at the bushy fundamental structures are bottom of the (upside-down) tree, and compatible with the hypothesis. It larger structures composed from them occurs in the analysis of both natural are in successively higher layers, until at languages and computer languages. the top or "root" of the tree a single unit Top-down parsing can be viewed as describes the entire input stream. A an attempt to find left-most bottom-up parse discovers and processes derivations of an input-stream by that tree starting from the bottom left searching for parse-trees using a top- end, and incrementally works its way down expansion of the given formal upwards and rightwards. A parser may grammar rules. Tokens are consumed act on the structure hierarchy's low, mid, from left to right. Inclusive choice is and highest levels without ever creating used to accommodate ambiguity by an actual data tree; the tree is then expanding all alternative right-hand- merely implicit in the parser's actions. sides of grammar rules. Bottom-up parsing lazily waits until it Simple implementations of top-down has scanned and parsed all parts of some parsing do not terminate for left- construct before committing to what the recursive grammars, and top-down combined construct is. parsing with may have The opposite of this are top-down exponential time complexity with parsing methods, in which the input's respect to the length of the input for overall structure is decided (or guessed ambiguous CFGs. However, more at) first, before dealing with mid-level sophisticated top-down parsers have parts, leaving the lowest-level small been created by Frost, Hafiz, and details to last. A top-down parser Callaghan which do accommodate discovers and processes the hierarchical ambiguity and left recursion in tree starting from the top, and polynomial time and which generate incrementally works its way downwards polynomial-sized representations of and rightwards. Top-down parsing the potentially exponential number eagerly decides what a construct is much of parse trees. earlier, when it has only scanned the leftmost symbol of that construct and has • Bottom-up parsing - parsing reveals not yet parsed any of its parts. Left the grammatical structure of linear corner parsing is a hybrid method which input text, as a first step in working works bottom-up along the left edges of out its meaning. Bottom-up parsing each subtree, and top-down on the rest of identifies and processes the text's the parse tree. lowest-level small details first, If a language grammar has multiple rules before its mid-level structures, and that may start with the same leftmost leaving the highest-level overall symbols but have different endings, then structure to last. that grammar can be efficiently handled by a deterministic bottom-up parse but 4.1) TOP DOWN VERSUS BOTTOM cannot be handled top-down without UP PARSING guesswork and backtracking. So bottom- up parsers handle a somewhat larger

PARSING – A BRIEF STUDY Tanya Sharma, Sumit Das & Vishal Bhalla

P a g e | 1269 International Journal of Research (IJR) Vol-1, Issue-8, September2014 ISSN 2348-6848

range of computer language grammars than do deterministic top-down parsers.

Fig 2)

5) TYPES OF PARSERS therefore exclude all ambiguous grammars, as well as all grammars 5.1) TOP-DOWN PARSERS that contain left recursion. Any Some of the parsers that use top-down context-free grammar can be parsing include: transformed into an equivalent grammar that has no left recursion, • - a but removal of left recursion does recursive descent parser is a kind of not always yield an LL(k) grammar.) top-down parser built from a set of A predictive parser runs in linear mutually recursive procedures (or a time. non-recursive equivalent) where each Recursive descent with backtracking such procedure usually implements is a technique that determines which one of the production rules of the production to use by trying each grammar. Thus the structure of the production in turn. Recursive descent resulting program closely mirrors with backtracking is not limited to that of the grammar it recognizes. LL(k) grammars, but is not guaranteed to terminate unless the A predictive parser is a recursive grammar is LL(k). Even when they descent parser that does not require terminate, parsers that use recursive backtracking. Predictive parsing is descent with backtracking may possible only for the class of LL(k) require exponential time. grammars, which are the context-free Although predictive parsers are grammars for which there exists widely used, and are frequently some positive integer k that allows a chosen if writing a parser by hand, recursive descent parser to decide programmers often prefer to use a which production to use by table-based parser produced by a examining only the next k tokens of parser generator, either for an LL(k) input. (The LL(k) grammars language or using an alternative parser, such as LALR or LR. This is

PARSING – A BRIEF STUDY Tanya Sharma, Sumit Das & Vishal Bhalla

P a g e | 1270 International Journal of Research (IJR) Vol-1, Issue-8, September2014 ISSN 2348-6848

particularly the case if a grammar is have traditionally been considered to not in LL(k) form, as transforming be difficult to parse, although this is the grammar to LL to make it less true now given the availability suitable for predictive parsing is and widespread use of parser involved. Predictive parsers can also generators supporting LL( k) be automatically generated, using grammars for arbitrary k. tools like ANTLR. An LL parser is called an LL( *) parser if it is not restricted to a finite • LL parser (Left-to-right, Leftmost k tokens of lookahead, but can make derivation) - An LL parser is a top- parsing decisions by recognizing down parser for a subset of the whether the following tokens belong context-free grammars. It parses the to a regular language (for example input from Left to right, and by use of a Deterministic Finite constructs a Leftmost derivation of Automaton). the sentence (hence LL, compared with LR parser). The class of 5.2) BOTTOM-UP PARSERS grammars which are parsable in this Some of the parsers that use bottom- way is known as the LL grammars . up parsing include: The remainder of this article describes the table-based kind of • Operator-precedence parser - parser, the alternative being a While ad-hoc methods are recursive descent parser which is sometimes used for parsing usually coded by hand (although not expressions, a more formal technique always; see e.g. ANTLR for an using operator precedence simplifies LL(*) recursive-descent parser the task. For example an expression generator). such as An LL parser is called an LL( k) parser if it uses k tokens of (x*x + y*y)^.5 lookahead when parsing a sentence. If such a parser exists for a certain is easily parsed. Operator precedence grammar and it can parse sentences parsing is based on bottom-up of this grammar without parsing techniques and uses a backtracking then it is called an precedence table to determine the LL( k) grammar. A language that has next action. The table is easy to an LL(k) grammar is known as an construct and is typically hand- LL(k) language. There are LL(k+n) coded. This method is ideal for languages that are not LL(k) applications that require a parser for languages. A corollary of this is that expressions and where embedding not all context-free languages are compiler technology, such as yacc, LL(k) languages. would be overkill. LL(1) grammars are very popular because the corresponding LL • LR PARSER (LEFT-TO-RIGHT, parsers only need to look at the next RIGHTMOST DERIVATION) token to make their parsing decisions. Languages based on ‹ Simple LR (SLR) parser - The grammars with a high value of k most prevalent type of bottom-up

PARSING – A BRIEF STUDY Tanya Sharma, Sumit Das & Vishal Bhalla

P a g e | 1271 International Journal of Research (IJR) Vol-1, Issue-8, September2014 ISSN 2348-6848

parser today is based on a methods (see the concept called LR(k) parsing; the bibliographic notes). "L" is for left-to-right scanning 3. An LR parser can detect a of the input, the "R" for syntactic error as soon as it constructing a rightmost is possible to do so on a left- derivation in reverse, and the k to-right scan of the input. for the number of input symbols 4. The class of grammars that of lookahead that are used in can be parsed using LR making parsing decisions. The methods is a proper superset cases k = 0 or k = 1 are of of the class of grammars that practical interest, and we shall can be parsed with predictive only consider LR parsers with k or LL methods. For a ≤ 1 here. When (k) is omitted, k grammar to be LR(k), we is assumed to be 1. must be able to recognize the occurrence of the right side of Some familiarity with the basic a production in a right- concepts is helpful even if the LR sentential form, with k input parser itself is constructed using symbols of lookahead. This an automatic parser generator . requirement is far less We begin with "items" and stringent than that for LL(k) "parser states"; the diagnostic grammars where we must be output from an LR parser able to recognize the use of a generator typically includes production seeing only the parser states, which can be used first k symbols of what its to isolate the sources of parsing right side derives. Thus, it conflicts. should not be surprising that LR parsing is attractive for a LR grammars can describe variety of reasons: more languages than LL 1. LR parsers can be grammars. constructed to recognize virtually all programming- ‹ LALR parser - LALR parsers language constructs for which are based on a finite-state- context-free grammars can be automata concept, from which written. NonLR context-free they derive their speed. The data grammars exist, but these can structure used by an LALR generally be avoided for parser is a pushdown automaton typical programming- (PDA). A deterministic PDA is a language constructs. deterministic-finite automaton 2. The LR-parsing method is the (DFA) that uses a stack for a most general memory, indicating which states nonbacktracking shift-reduce the parser has passed through to parsing method known, yet it arrive at the current state. can be implemented as Because of the stack, a PDA can efficiently as other, more recognize grammars that would primitive shift-reduce be impossible with a DFA; for example, a PDA can determine

PARSING – A BRIEF STUDY Tanya Sharma, Sumit Das & Vishal Bhalla

P a g e | 1272 International Journal of Research (IJR) Vol-1, Issue-8, September2014 ISSN 2348-6848

whether an expression has any parser is an LR parser whose unmatched parentheses, whereas parsing tables are constructed in an automaton with no stack a similar way as with LR(0) would require an infinite number parsers except that the items in of states due to unlimited nesting the item sets also contain a of parentheses. lookahead , i.e., a terminal that is LALR parsers are driven by a expected by the parser after the parser table in a finite-state right-hand side of the rule. For machine (FSM) format. An FSM example, such an item for a rule is very difficult for humans to A → B C might be construct and therefore an LALR which would mean that the parser parser generator is used to create has read a string corresponding to the parser table automatically B and expects next a string from a grammar in Backus–Naur corresponding to C followed by Form which defines the syntax of the terminal 'a'. LR(1) parsers can the computer language the parser deal with a very large class of will process. The parser table is grammars but their parsing tables often generated in source code are often very big. This can often format in a computer language be solved by merging item sets if (such as C++ or Java). When the they are identical except for the parser (with parser table) is lookahead, which results in so- compiled and/or executed, it will called LALR parsers. recognize text files written in the language defined by the BNF • GLR parser - A GLR parser (GLR grammar. standing for "generalized LR", where LALR parsers are generated from L stands for "left-to-right" and R LALR grammars, which are stands for "rightmost (derivation)") capable of defining a larger class is an extension of an LR parser of languages than SLR algorithm to handle nondeterministic grammars, but not as large a class and ambiguous grammars. The as LR grammars. Real computer theoretical foundation was provided languages can often be expressed in a 1974 paper by Bernard Lang as LALR(1) grammars, and in (along with other general Context- cases where a LALR(1) grammar Free parsers such as GLL). It is insufficient, usually an describes a systematic way to LALR(2) grammar is adequate. If produce such algorithms, and the parser generator handles only provides uniform results regarding LALR(1) grammars, then the correctness proofs, complexity with LALR parser will have to respect to grammar classes, and interface with some hand-written optimization techniques. code when it encounters the special LALR(2) situation in the SUMMARY input language. Parsing’ is the term used to describe the process of automatically building syntactic ‹ Canonical LR (LR(1)) parser - analyses of a sentence in terms of a given A canonical LR parser or LR(1) grammar and lexicon. The resulting

PARSING – A BRIEF STUDY Tanya Sharma, Sumit Das & Vishal Bhalla

P a g e | 1273 International Journal of Research (IJR) Vol-1, Issue-8, September2014 ISSN 2348-6848 syntactic analyses may be used as input to a ‘robust’: behaving in a reasonably sensible process of semantic interpretation,(or way when presented with a sentence that it perhaps phonological interpretation, where is unable to aspects of this, like prosody, are sensitive to Full, analyze successfully. syntactic structure). Occasionally, ‘parsing’ is also used to include both syntactic and DISCLOSURE STATEMENT semantic analysis. We use it in the more There is no financial support for this conservative sense here, however. In most research work from the funding agency. contemporary grammatical formalisms, the output of parsing is something logically ACKNOWLEDGMENTS equivalent to a tree, displaying dominance We thank our guide for his timely help, and precedence relations between giving outstanding ideas and encouragement constituents of a sentence, perhaps with to finish this research work successfully. further annotations in the form of attribute- value equations (‘features’) capturing other SIDE BAR aspects of linguistic description. However, Comparison: it is an act of assessment or there are many different possible linguistic evaluation of things side by side in order to formalisms, and many ways of representing see to what extent they are similar or each of them, and hence many different different. It is used to bring out similarities ways of representing the results of parsing. or differences between two things of same We shall assume here a simple tree type mostly to discover essential features or representation, and an underlying context- meaning either scientifically or otherwise. free grammatical (CFG) formalism. However, all of the algorithms decribed here Content: The amount of things contained in can usually be used for more powerful something. Things written or spoken in a unification based formalisms, provided these book, an article, a programme, a speech, etc. retain a context-free ‘backbone’, although in these cases their complexity and termination DEFINITION properties may be different. Parsing algorithms are usually designed for • Mutual - held in common by two or classes of grammar rather than tailored more parties. towards individual grammars. There are • Hierarchical - a system in which several important properties that a parsing members of an organization or algorithm should have if it is to be practically useful. It should be ‘sound’ with society are ranked according to respect to a given grammar and lexicon; that relative status or authority. is, it should not assign to an input sentence • Segment - each of the parts into analyses which cannot arise from the which something is or may be grammar in question. It should also be divided. ‘complete’; that it, it should assign to an • Component - a part or element of a input sentence all the analyses it can have with respect to the current grammar and larger whole, especially a part of a lexicon. Ideally, the algorithm should also machine or vehicle. be ‘efficient’, entailing the minimum of computational work consistent with fulfilling the first two requirements, and

PARSING – A BRIEF STUDY Tanya Sharma, Sumit Das & Vishal Bhalla

P a g e | 1274 International Journal of Research (IJR) Vol-1, Issue-8, September2014 ISSN 2348-6848

Richard LeBlanc, Addison Wesley REFERENCES 2009. [11] Flex & Bison: Text [1] "Bartleby.com homepage". Retrieved Processing Tools, by John Levine, 28 November 2010. O'Reilly Media 2009. [2] "parse". dictionary.reference.com. [12] Compilers: Principles, Retrieved 27 November 2010. Techniques, and Tools (2nd Edition), [3] "Grammar and Composition". by Alfred Aho, Monica Lam, Ravi

[4] Aho, A.V., Sethi, R. and Ullman Sethi, and Jeffrey Ullman, Prentice ,J.D. (1986) " Compilers: principles, techniques, and tools." Addison- Hall 20 Wesley Longman Publishing Co., Inc. Boston, MA, USA. [5] Frost, R., Hafiz, R. and Callaghan, P. (2007) " Modular and Efficient Top- Down Parsing for Ambiguous Left- Recursive Grammars ." 10th International Workshop on Parsing Technologies (IWPT), ACL- SIGPARSE , Pages: 109 - 120, June 2007, Prague. [6] Frost, R., Hafiz, R. and Callaghan, P. (2008) " Parser Combinators for Ambiguous Left-Recursive Grammars." 10th International Symposium on Practical Aspects of Declarative Languages (PADL), ACM-SIGPLAN , Volume 4902/2008, Pages: 167 - 181, January 2008, San Francisco. [7] Knuth, D. E. (July 1965). "On the translation of languages from left to right". Information and Control 8 (6): 607–639. doi:10.1016/S0019- 9958(65)90426-2. Retrieved 29 May 2011. edit [8] Language theoretic comparison of LL and LR grammars [9] Engineering a Compiler (2nd edition), by Keith Cooper and Linda Torczon, Morgan Kaufman 2011. [10] Crafting and Compiler, by Charles Fischer, Ron Cytron, and

PARSING – A BRIEF STUDY Tanya Sharma, Sumit Das & Vishal Bhalla

P a g e | 1275