<<

Chhillar N. et al., J. Harmoniz. Res. Eng., 2013, 1(2), 73-79

• - Journal Of Harmonized Research (JOHR)

Journal Of Harmonized Research in Engineering 1(2), 2013, 73-79 ISSN 2347 – 7393

Original Research Article

PARSING: PROCESS OF ANALYZING WITH THE RULES OF A FORMAL

Nikita Chhillar, Nisha Yadav, Neha Jaiswal

Department of Computer Science and Engineering, Dronacharya College of Engineering, Khentawas, Farukhnagar, Gurgaon, India

Abstract: is the process of structuring a linear representation in accordance with a given grammar. The “linear representation” may be a sentence, a computer program, knitting pattern, a sequence of geological strata, a piece of music, actions in ritual behavior, in short any linear sequence in which the preceding elements in some way restrict† the next element. For some of the examples the grammar is well-known, for some it is an object of research and for some our notion of a grammar is only just beginning to take shape.

Keywords: Parsing, Grammar, Bottom -up parsing , Top -down parsing, Parser. Introduction: the aid of devices such as sentence diagrams. Parsing or syntactic analysis is the process of Within computational the term is analyzing a string of symbols, either in natural used to refer to the formal analysis by or in computer , according computer of a sentence or other string of to the rules of a . The term words into its constituents, resulting in a parse parsing comes from Latin pars meaning part tree showing their syntactic relation to each (of speech). other, which may also contain semantic and The term has slightly different meanings in other information. different branches of linguistics and computer The term is also used in psycholinguistics science. Traditional sentence parsing is often when describing language comprehension. In performed as a method of understanding the this , parsing refers to the way that exact meaning of a sentence, sometimes with human beings analyze a sentence or phrase (in spoken language or text) "in terms of

For Correspondence: grammatical constituents, identifying the parts nikitachhillarATyahoo.com of speech, syntactic relations, etc." This term Received on: October 2013 is especially common when discussing what Accepted after revision: December 2013 linguistic cues help speakers to interpret garden-path sentences. Downloaded from: www.johronline.com Within computer science, the term is used in the analysis of computer languages, referring

www.johronline.com 73 | P a g e

Chhillar N. et al., J. Harmoniz. Res. Eng., 2013, 1(2), 73-79 to the syntactic analysis of the input code into this type is identified as NP-complete. Head- its component parts in order to facilitate the driven is another writing of and interpreters. linguistic that has been accepted in 1) Traditional methods: the parsing community, but further research The traditional grammatical exercise of efforts have focused on simple formalisms parsing, are known as clause analysis , such as the one used in the Penn Treebank. involves splitting a text into its component Shallow parsing aims to locate only the parts of speech with an explanation of the boundaries of major constituents like noun form, purpose, and syntactic relationship of phrases. Another admired strategy for each part. This is determined in many part avoiding linguistic controversy is dependency from study of the language's conjugations and grammar parsing. declensions, which can be quite complex for Most modern parsers are at least partially heavily inflected languages. To parse a phrase statistical; that is, they rely on a body of such as 'Kittu saw monkey' involves noting training data which has already been that the singular noun 'Kittu' is the subject of interpreted (parsed by hand). This approach the sentence, the verb 'saw' is the third person permits the system to gather information about singular of the past tense of the verb 'to see', the frequency with which different and the singular noun 'monkey' is the object of constructions occur in specific contexts. the sentence. Techniques such as sentence Approaches which have been used consist of diagrams are used to indicate relation between straightforward PCFGs (probabilistic context- elements in the sentence. free ), maximum entropy, and neural 2) Computational methods: nets. Most of the successful systems use In some machine translation and natural lexical statistics (that is, they consider the language processing systems, written texts in identities of the words involved, as well as human languages are parsed by computer their part of speech). However such systems programs . Human sentences are not easily are vulnerable to over fitting and need some parsed via programs, as there is substantial kind of smoothing to be effective. inside the structure of human Parsing algorithms used for natural language language, whose usage is to convey meaning cannot rely on the grammar having 'good' (or ) in a potentially unlimited range properties as with manually designed of possibilities however only some of which grammars for programming languages. As are germane to the particular case. So an mentioned before some grammar formalisms utterance "Kittu saw monkey" versus are very difficult to parse computationally; in "Monkey saw Kittu" is definite on one detail general, even if the desired structure is not but in another language might appear as "Kittu context-free, some type of context-free monkey saw" with a reliance on the larger approximation to the grammar is used to context to distinguish between those two perform a first pass. Algorithms which make possibilities, if indeed that difference was of use context-free grammars often rely on some concern. It is difficult to prepare formal rules alternative of the CKY algorithm, usually with to describe informal behavior even though it is some heuristic to prune away unlikely clear that some rules are being followed. analyses to keep time. However some systems To parse natural language data, researchers trade speed for accurateness using, example should first have same opinion on the linear-time versions of the shift-reduce grammar to be used. The selection of is algorithm. A recent development has been affected by both linguistic and computational parse reranking that the parser proposes concerns; for example some parsing systems several large numbers of analyses, and a more make use of lexical functional grammar, complex system picks the best option. whereas in general, parsing for grammars of

www.johronline.com 74 | P a g e

Chhillar N. et al., J. Harmoniz. Res. Eng., 2013, 1(2), 73-79

3) Overview of process: of identifiers. These rules can be formally conveyed with attribute grammars. The last phase is semantic parsing or analysis that works out the implications of the expression just validated and taking the suitable action. Calculator or interpreter evaluates the expression or program, a , would generate some kind of code. Attribute grammars can also be used to describe these actions. Types of parsing: The task of the parser is basically to determine how the input can be derived from the start of the grammar. This can be done in basically two ways: • Top-down parsing- Top-down parsing can be viewed as an approach to find left-most origin of an input-stream by searching for parse trees using a top-down of the given formal grammar rules. Tokens are used from left to right. Complete choice is used to hold ambiguity by expanding every alternative right-hand-

These examples demonstrate the general case side of grammar rules. of parsing a computer language by two levels • Bottom-up parsing - A parser can initiate of grammar: lexical and syntactic. with the input and approach to rewrite it to The first step is the token generation, or the start symbol. Intuitively, the parser , in which the input character attempts to place the most basic elements, stream is split into meaningful symbols then the elements include these, and so on. defined with a grammar of regular LR parsers are instances of bottom-up expressions. For example, a calculator parsers. Another term used for this sort of program would come across an input such as parser is Shift-Reduce parsing. "12*(3+4^2" and split this into the tokens 12, LL parsers and recursive-descent parser are *, (, 3, +, 4,), ^, 2, each of which is a examples of top-down parsers that cannot significant symbol in the context of an accommodate left recursive production rules. arithmetic expression. The lexer would Although it has been assumed that simple contain rules that tell it, the characters *, +, ^, implementations of top-down parsing cannot (and) mark the beginning of a new token, so hold direct and indirect left-recursion and may meaningless tokens such as "12*" or "(3" will need exponential time and space complexity not be generated). as parsing ambiguous context-free grammars, The next stage is parsing or syntactic analysis, extra sophisticated algorithms for top-down which examines that the tokens form an parsing have been produced by Frost, Hafiz, acceptable expression. This is generally done and Callaghan that accommodate ambiguity with reference to a context-free grammar and in polynomial time and that which recursively defines parts that can make generate polynomial-size depictions of the up an expression and the arrangement in potentially exponential number of parse trees. which they must appear. However, not all Their algorithm is capable to produce both rules defining programming languages can be left-most and right-most derivations of an conveyed by context-free grammars alone, for input with respect to a given CFG (context- example type validity and proper declaration free grammar).

www.johronline.com 75 | P a g e

Chhillar N. et al., J. Harmoniz. Res. Eng., 2013, 1(2), 73-79

An important dissimilarity with regard to parsing by applying each production rule to parsers is whether a parser produces a leftmost the received symbols, working from the left- derivation or a rightmost derivation. LL most symbol give way a production rule and parsers will produce a leftmost derivation and then proceeding to the next production rule for LR parsers will produce a rightmost derivation every non-terminal symbol encountered. In (although usually in reverse). this way the parsing begins on the Left of the Top-down parsing: result side (right side) of the production rule Top-down parsing is a parsing approach and calculates non-terminals from the Left where one begins with the top most level first and, thus, moves down the for of the parse tree and works downward by every new non-terminal before continuing to using the rules for a formal the another symbol for a production rule. grammar. LL parsers are a kind of parser For example: that makes use of top-down parsing • approach. • • Top-down parsing is a strategy of • studying unknown data relationships by would match and attempt to hypothesizing general parse tree structure match next. Then and then considering the known fundamental structures are well-suited would be tried. As one may with the hypothesis. It occurs in the study suppose, some languages are extra ambiguous of both natural languages and computer than others. For a non-ambiguous language languages. that has all productions for non-terminal • Top-down parsing can be viewed as an produce different strings: the string produced shot to find left-most derivations of an by one production will not begin with the input-stream by searching for parse-trees similar symbol as the string produced by using a top-down expansion of the given another production. A non-ambiguous formal grammar rules. Tokens are used language might be parsed by an LL (1) from left to right. Inclusive choice is used grammar where the (1) implies the parser to hold ambiguity by expanding all reads ahead one token at a time. For an unconventional right-hand-sides of ambiguous language to be parsed by an LL grammar rules. parser, the parser must look ahead more than 1 • Simple executions of top-down parsing symbol, e.g. LL (3). do not end for left-recursive grammars, The common way out to this problem is to use and top-down parsing by an LR parser, which is a kind of shift-reduce may have exponential time complexity parser, and does bottom-up parsing. with respect to the length of the input for Accommodating left recursion in top-down ambiguous CFGs. However, more parsing: complicated top-down parsers have been A formal grammar which contains left created by Frost, Hafiz, and Callaghan recursion cannot be parsed by a naive which do hold ambiguity and left if they are not recursion within polynomial time and converted to a weakly equivalent right- which generate polynomial-sized recursive form. However, recent research illustrations of the potentially exponential reveals that it is possible to hold left-recursive number of parse trees grammars (along with all other forms of application: general CFGs) in a more complicated top- A compiler parses input to assembly language down parser by use of curtailment. A from a programming language or an internal detection algorithm which accommodates representation by harmonizing the incoming ambiguous grammars and curtails an ever- symbols to production rules. Production rules emergent direct left-recursive parse by are defined using Backus-Naur form. An LL imposing depth limitations with respect to parser is a kind of parser that does top-down input length and recent input position, is

www.johronline.com 76 | P a g e

Chhillar N. et al., J. Harmoniz. Res. Eng., 2013, 1(2), 73-79 described by Frost and Hafiz in 2006. That Bottom-up parsing works in the reverse algorithm was extended to a absolute parsing direction from top-down parsing. A top-down algorithm to accommodate indirect (by parser initiate with the start symbol at the top comparing earlier computed context with of the parse tree and works downward, driving current context) with direct left-recursion in productions in forward direction until it gets to polynomial time, and to generate compact the terminal leaves. A bottom-up parse begins polynomial-size demonstrations of the with the string of terminals itself and build potentially exponential number of parse trees from the leaves upward, working in reverse to for highly ambiguous grammars by Frost, the start symbol by applying the productions. Hafiz and Callaghan in 2007. The algorithm Then a bottom-up parser searches for has since been executed as a set of parser substrings of the working string which combinatory written in the Haskell matches the right side of some production. programming language. When it finds such a substring, it substitutes Time and space complexity of top-down the left side non-terminal for the matching parsing: right side. The goal is to lessen all the way till When top-down parser tries to parse an the start symbol and report a successful ambiguous input with respect to an ambiguous parsing. Generally, bottom-up parsing CFG, it may require exponential number of algorithms are more authoritative than top- ladder (with respect to the length of the input) down methods, but not surprisingly, the to try all substitutes of the CFG in order to construction required is also more complex. create all possible parse trees, which in due It’s difficult to write a bottom-up parser by course would require exponential memory hand for anything but trivial grammar, but space. The problem of exponential time fortunately, there are excellent parser complexity in top-down parsers created as sets generator tools like that build a parser of mutually recursive functions has been from an input specification, not unlike the way explained by Norvig in 1991. His technique is lex build a scanner to your spec. Shift-reduce similar to the use of parsing is the commonly used and powerful of and state-sets in Earley's algorithm (1970), the bottom-up techniques. Bottom-up parsing and tables of the CYK algorithm of Cocke, takes as input a stream of tokens and develops Younger and Kasami. the list of productions used for building the The main idea is to store results of applying a parse tree, but the production is discovered in parser p at point j in a memorable and to reuse reverse order of a top down parser. Like a result whenever the same situation arises. table-driven predictive parser, it makes use of Frost, Hafiz and Callaghan also use a stack to keep track of the position in the memorization for refraining redundant parse and a parsing table to determine what to computations for accommodating any form of do next. CFG in polynomial time ( Θ (n 4) for left- Bottom-up Versus Top-down: recursive grammars and Θ (n 3) for non left- The bottom-up name originally comes from recursive grammars). Their top-down parsing the concept of a parse tree, in which the algorithm also needs polynomial space for detailed parts are at the bushy bottom of the potentially exponential ambiguous parse trees (upside-down) tree, and larger structures by 'compressed representation' and 'local collected from them are in successively higher grouping'. layer, until at the top or "root" of the tree a Bottom-up parsing: single unit explains the entire input stream. A Parsing tells the grammatical structure of bottom-up parse processes that tree starting linear input text, as a primary step in working from the bottom left end, and incrementally out its meaning. Bottom-up parsing work its way upwards and rightwards. A discovers and processes the text's lowest-level parser may act on the structure hierarchy's small elements first, before its mid-level low, mid, and highest levels without ever structures, and leaving the highest-level whole creating an actual data tree; the tree is then structure to last. merely implicit in the parser's actions.

www.johronline.com 77 | P a g e

Chhillar N. et al., J. Harmoniz. Res. Eng., 2013, 1(2), 73-79

Bottom-up parsing lazily waits until it has before committing to what the combined scanned and parsed all parts of some construct construct is.

Typical parse tree for Bottom-up parse steps Top-down parse steps A = B + *2; D = 1

The opposite of bottom-up parsing methods • LL parser ( Left-to-right, Leftmost are top-down parsing methods, in which the derivation) input's almost structure is decided (or guessed • at) first, before dealing with mid parts, leaving 5) Bottom-up parsers: the lowest small details to last. In top-down Parsers which use bottom-up parsing are: parse processes the hierarchical tree starting • Precedence parser from the top, and incrementally work o Operator-precedence parser downwards and rightwards. Top-down parsing o Simple precedence parser keenly decides what a construct is much • BC (bounded context) parsing earlier, when it has only scanned the leftmost • LR parser ( Left-to-right, Rightmost symbols of that construct and has not yet derivation) parsed any of its parts. Left corner parsing is o Simple LR (SLR) parser a hybrid method which works bottom-up o LALR parser along the left edge of each subtree, and top- o Canonical LR (LR(1)) parser down on the rest of the parse tree. o GLR parser If grammar has multiple rules which start with • CYK parser the same leftmost symbol but have different • Recursive ascent parser endings, then the grammar can be handled by Conclusions: a deterministic bottom-up parse but cannot be Parsing splits a sequence of characters or handled by top-down without guesswork and letters into smaller parts. Parsing is also used backtracking. So bottom-up parsers handle a for recognizing characters or letters that occur larger range of computer language grammar in a specific order. In addition to providing a than do deterministic top-down parsers. strong, readable, and maintainable approach to Bottom-up parsing is every now and then done pattern matching, parsing by backtracking. But generally, bottom-up enable you to create your own custom parsing is done by a shift-reduce parser such languages for specific purposes. as a LALR parser. References: 4) Top-down parsers: 1. Aho, Alfred V.; Sethi, Ravi; Ullman, Parsers which use top-down parsing are: Jeffrey D. (1986). Compilers, principles, • Recursive descent parser techniques, and tools (Rep. with

www.johronline.com 78 | P a g e

Chhillar N. et al., J. Harmoniz. Res. Eng., 2013, 1(2), 73-79

corrections. Ed.). Addison-Wesley Pub. 7. Tomita, M. (1985) “Efficient Parsing for Co. ISBN 978-0201100884. Natural Language.” Kluwer, Boston, MA . 2. Aho, Alfred V.; Ullman, Jeffrey D. (1972). 8. "Bartleby.com homepage". Retrieved 28 The Theory of Parsing, Translation, and November 2010. Compiling (Volume 1: Parsing.) (Repr. 9. "Parse". Dictionary.reference.com. Ed.). Englewood Cliffs, NJ: Prentice-Hall. Retrieved 27 November 2010. ISBN 978-0139145568. 10. "Grammar and Composition". 3. Frost, R., Hafiz, R. and Callaghan, P. 11. Aho, A.V., Sethi, R. and Ullman, J.D. (2007) “Modular and Efficient Top-Down (1986) " Compilers: principles, techniques, Parsing for Ambiguous Left-Recursive and tools." Addison-Wesley Longman Grammars." 10th International Workshop Publishing Co., Inc. Boston, MA, USA. on Parsing Technologies (IWPT), ACL- 12. Frost, R., Hafiz, R. and Callaghan, P. SIGPARSE , Pages: 109 - 120, June 2007, (2007) “Modular and Efficient Top-Down Prague. Parsing for Ambiguous Left-Recursive 4. Frost, R., Hafiz, R. and Callaghan, P. Grammars." 10th International Workshop (2008) “Parser Combinators for on Parsing Technologies (IWPT), ACL- Ambiguous Left-Recursive Grammars." SIGPARSE , Pages: 109 - 120, June 2007, 10th International Symposium on Prague. Practical Aspects of Declarative 13. Frost, R., Hafiz, R. and Callaghan, P. Languages (PADL), ACM-SIGPLAN , (2008) “Parser Combinators for Volume 4902/2008, Pages: 167-181, Ambiguous Left-Recursive Grammars." January 2008, San Francisco. 10th International Symposium on 5. Frost, R. and Hafiz, R. (2006) “A New Practical Aspects of Declarative Top-Down Parsing Algorithm to Languages (PADL), ACM-SIGPLAN , Accommodate Ambiguity and Left Volume 4902/2008, Pages: 167 - 181, Recursion in Polynomial Time." ACM January 2008, San Francisco. SIGPLAN Notices , Volume 41 Issue 5, 14. shproto.org Pages: 46 - 54. 15. Compilers: Principles, Techniques, and 6. Norvig, P. (1991) “Techniques for Tools (2nd Edition), by Alfred Aho, automatic memoisation with applications Monica Lam, Ravi Sethi, and Jeffrey to context-free parsing.” Journal - Ullman, Prentice Hall 2006. Computational Linguistics Volume 17, Issue 1, Pages: 91 - 98.

www.johronline.com 79 | P a g e