High-Performance Multi-Pass Unification Parsing

Total Page:16

File Type:pdf, Size:1020Kb

High-Performance Multi-Pass Unification Parsing High-Performance Multi-Pass Unication Parsing Paul Wesley Placeway May 14, 2002 CMU-LTI-02-172 Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Submitted in partial fulllment of the requirements for the degree of Doctor of Philosophy. Thesis Committee: Eric Nyberg, Chair Jaime Carbonell Alon Lavie Robert Bobrow, BBN Technologies Copyright c 2002 Paul Wesley Placeway This research was supported in part by Carnegie Mellon University The views and conclusions contained in this document are those of the author and should not be interpreted as representing the ofcial policies, either expressed or implied, of Carnegie Mellon University. Keywords: Parsing, Unication, Ambiguity For Mary, Mom and Dad. Abstract Parsing natural language is an attempt to discover some structure in a text (or textual representation) generated by a person. This structure can be put to a variety of uses, including machine translation, grammar conformance checking, and determination of prosody in text-to-speech tasks. Recent theories of Syntax use Unication to better describe the intricacies of natu- ral language [137]. For parsing systems, unication techniques have been either added to a context-free base system [152, 40, 4, 23], or replaced the context-free base en- tirely [118, 135, 45] (possibly putting it back later [136]). The seemingly small step of adding unication has opened a Pandora’s Box of computational complexity, in- creasing the difculty of the problem from polynomial [48] to somewhere between NP-complete and intractable, depending on the details of the unication system and how it was added [10]. Worse, unication on a context-free base parser can break the packing technique used to address the problem of ambiguity, leading to exponential blow-ups of the parser’s performance in both space and time in practice. I propose the use of a multi-pass strategy to avoid these problems in practice. I describe a parser which combines the use of shallow, simple value unication with some approximation techniques in order to nd a covering packed parse-forest. This parse- forest is then searched for a single-best fully-unifying value; the scoring system which drives the heuristic search encodes linguistically-based disambiguation preferences. The resulting two-pass parser is compared to an ordinary single-pass parser in the context of a heavy-weight knowledge-based machine translation system. The two-pass parser is shown to be competitive with the single-pass parser on average data, both in terms of time and space. It is also shown to be able to avoid a common class of ambiguity blow-up that the single-pass parser is subject to. These results indicate that the multi-pass technique, interleaving some of the unication equations in the parse, is the superior approach for heavy-weight unication parsing. Acknowledgements I would like to thank the many people without whom this work would not have been possible: My advisor, Eric Nyberg, for asking the critical question “Why do these sentences take so long?”, many technical and philosophical discussions, and for help in turning my writing into English. The members of my committee, Jaime Carbonell, Alon Lavie, and Robert “Rusty” Bobrow, for guidance in setting the technical direction of this work, many useful discussions, and their patience. Robert Moore, for technical advise related to high-performance context-free parsing. Kathy Baker, for supporting the KANT grammar, and for helpful technical discussions. Krzysztof Czuba, for providing the Broadcast News grammar and test set, and many helpful technical discussions. David Svoboda, for help in organizing the Catalyst 10,000 sentence test corpus. Robert Igo, for organizing the FOATS regression test corpus. The many other members of the KANT team. Finally, my wife Mary Placeway, for her love and support throughout the adventure of grad- uate school. vi Contents 1 Introduction 1 1.1 Introduction . 1 1.1.1 Statement of Thesis . 3 1.1.2 Summary of Contributions . 3 1.1.3 Motivation: Why parsing is useful? . 4 1.1.4 As part of Knowledge-based translation . 4 1.1.5 Checking conformance to a restricted language . 5 1.1.6 Discovering prosody in text-to-speech . 5 1.2 Dissertation Overview . 6 2 Background 9 2.1 General Background . 9 2.1.1 Preliminaries . 9 2.1.2 Families of grammars . 10 2.1.3 Context-Free Parsing . 12 2.1.4 Unication Parsing . 13 2.1.5 Unication grammars are computationally powerful . 20 2.1.6 Parsing as Constraint Satisfaction . 22 2.2 Parsing Applications . 23 2.2.1 Machine Translation Systems . 23 3 Unication Parsing 31 3.1 Unication Parsing . 31 3.1.1 Pure Unication Parsing . 32 3.1.2 Unication Parsing on Context Free Spine . 34 3.2 About Pseudo-Unication . 39 vii 3.3 Parsing and Ambiguity . 40 3.3.1 Ambiguity is a problem for context-free parsing . 40 3.3.2 Context free parsing with packing . 43 3.3.3 Context free parsing with packing and unication . 47 3.3.4 Ambiguity inherently causes disjunction . 53 3.3.5 Solution to ambiguity Packing is not Subsumption . 54 3.3.6 Problems with Packing in Disjunctions . 57 3.3.7 Pseudo-unication and Disjunction . 58 4 Delayed Unication Parsing 61 4.1 Delaying Unication Until After Parsing . 61 4.1.1 Interleaved unication Versus Delayed unication . 62 4.2 Delaying Some Unication Until After Parsing . 64 4.2.1 Negative Restriction . 64 4.3 The Two Purposes of Interleaved Unication . 66 4.3.1 ‘Cheating’ in the interleaved unication . 66 4.3.2 Our unication approach . 70 5 Overview of the Approach 71 5.1 Conceptual Design . 72 5.1.1 Don’t try to do the parse all in one shot. 72 5.1.2 Don’t keep the unication values from the parse phase. 73 5.1.3 Don’t follow the grammar precisely early in the process. 74 5.1.4 Don’t try to nd all possible nal unication values. 75 5.1.5 Don’t pick just any single unication value; pick a good one. 75 5.2 System Requirements . 76 5.3 System Architecture . 79 5.3.1 Preprocessing . 80 5.3.2 Run-time Processing . 84 5.4 Evaluation during Development . 90 5.4.1 Development test conditions . 90 5.4.2 Development hardware . 92 5.4.3 Run-time Performance and Optimization Priorities . 92 5.5 Summary . 93 viii 6 Efcient Chart Parsing 95 6.1 Motivation . 95 6.1.1 Chapter Outline . 96 6.2 Prior Context-Free Parsers . 96 6.3 The Tree-Structured Grammar . 102 6.3.1 Building a Tree-Structured Grammar . 104 6.3.2 Using a Tree-Structured Grammar . 105 6.3.3 Previous Approaches . 106 6.4 Left-Corner and Look-ahead Filtering . 107 6.4.1 Left-Corner Constraint . 108 6.4.2 Look-ahead Constraint . 113 6.4.3 Left-corner of Look-ahead . 115 6.5 Other Parser Features . 116 6.5.1 Complete algorithm . 117 6.6 Context-free Parsing Results . 118 6.6.1 Discussion of results . 118 7 Pseudo-Unication: Implementation and Optimization 125 7.1 Introduction to Pseudo-Unication . 126 7.1.1 On Interpreting Pseudo-Unication . 126 7.2 Modications to the Pseudo-Unier . 134 7.2.1 ‘Gray-Box’ Adaptation . 135 7.2.2 Handling of Data Disjunctions . 135 7.2.3 Explicit No-Value Values . 136 7.2.4 Wild-Carded Values . 136 7.2.5 Complements of Unications . 138 7.2.6 Explicit over-write value equation . 141 7.3 Compilation and Optimization of Pseudo-Unication . 141 7.3.1 Unwinding of Conditional ORs . 143 7.3.2 Disjunction Flattening . 145 7.3.3 Multiple-Value Strength Reduction . 147 7.4 Shallow Pseudo-Unication as a First-Pass Filter . 150 7.4.1 Wild-carding deep structure assignments . 150 7.4.2 Pseudo-Optimizations for Shallow Unication . 152 ix 7.4.3 Effectiveness of Shallow Approximate Unication . 159 7.5 Optimizations That Did Not Help . 170 7.5.1 Approximated unication packing in disjunctions . 170 7.5.2 Length limits in approximate packing . 170 7.5.3 Vector Unier (is not faster) . 171 8 Post-parse Search 173 8.1 Introduction . 173 8.1.1 Previous Approaches . 174 8.1.2 Method of Attack . 175 8.2 The Search Component . 176 8.2.1 Best-First Search . 177 8.2.2 Searching a parse forest . 180 8.2.3 An All-Paths Search of a Parse Forest . 183 8.2.4 A Backtracking Greedy search for a best parse . 185 8.2.5 Full branch-and-bound search . 200 8.2.6 N-Best search . 209 8.3 Disambiguation Cost Calculator . ..
Recommended publications
  • A Protein Interaction Extraction Systemusing a Link Grammar Parser from Biomedical Abstracts
    World Academy of Science, Engineering and Technology International Journal of Biomedical and Biological Engineering Vol:1, No:5, 2007 PIELG: A Protein Interaction Extraction System using a Link Grammar Parser from Biomedical Abstracts Rania A. Abul Seoud, Nahed H. Solouma, Abou-Baker M. Youssef, and Yasser M. Kadah, Senior Member, IEEE failure. Applications that repair or replace portions of or Abstract—Due to the ever growing amount of publications about whole living tissues (e.g., bone, dentine, or bladder) using protein-protein interactions, information extraction from text is living cells is named Tissue Engineering (TE). For example, increasingly recognized as one of crucial technologies in dentine formation is the process of regenerating dental tissues bioinformatics. This paper presents a Protein Interaction Extraction by tissue engineering principles and technology. Dentine System using a Link Grammar Parser from biomedical abstracts formation is governed by biological mediators or growth (PIELG). PIELG uses linkage given by the Link Grammar Parser to start a case based analysis of contents of various syntactic roles as factors (protein) and interactions amongst different proteins. well as their linguistically significant and meaningful combinations. Dentine formation needs the support of continuous updated The system uses phrasal-prepositional verbs patterns to overcome information about protein-protein interactions. preposition combinations problems. The recall and precision are Researches in the last decade have resulted in the 74.4% and 62.65%, respectively. Experimental evaluations with two production of a large amount of information about protein other state-of-the-art extraction systems indicate that PIELG system functions involved in dentine formation process. That achieves better performance.
    [Show full text]
  • Journal of Computer Science and Engineering Parsing
    IJRDO - Journal of Computer Science and Engineering ISSN: 2456-1843 JOURNAL OF COMPUTER SCIENCE AND ENGINEERING PARSING TECHNIQUES Ojesvi Bhardwaj Abstract: ‘Parsing’ is the term used to describe the process of automatically building syntactic analyses of a sentence in terms of a given grammar and lexicon. The resulting syntactic analyses may be used as input to a process of semantic interpretation, (or perhaps phonological interpretation, where aspects of this, like prosody, are sensitive to syntactic structure). Occasionally, ‘parsing’ is also used to include both syntactic and semantic analysis. We use it in the more conservative sense here, however. In most contemporary grammatical formalisms, the output of parsing is something logically equivalent to a tree, displaying dominance and precedence relations between constituents of a sentence, perhaps with further annotations in the form of attribute-value equations (‘features’) capturing other aspects of linguistic description. However, there are many different possible linguistic formalisms, and many ways of representing each of them, and hence many different ways of representing the results of parsing. We shall assume here a simple tree representation, and an underlying context-free grammatical (CFG) formalism. *Student, CSE, Dronachraya Collage of Engineering, Gurgaon Volume-1 | Issue-12 | December, 2015 | Paper-3 29 - IJRDO - Journal of Computer Science and Engineering ISSN: 2456-1843 1. INTRODUCTION Parsing or syntactic analysis is the process of analyzing a string of symbols, either in natural language or in computer languages, according to the rules of a formal grammar. The term parsing comes from Latin pars (ōrātiōnis), meaning part (of speech). The term has slightly different meanings in different branches of linguistics and computer science.
    [Show full text]
  • An Efficient Implementation of the Head-Corner Parser
    An Efficient Implementation of the Head-Corner Parser Gertjan van Noord" Rijksuniversiteit Groningen This paper describes an efficient and robust implementation of a bidirectional, head-driven parser for constraint-based grammars. This parser is developed for the OVIS system: a Dutch spoken dialogue system in which information about public transport can be obtained by telephone. After a review of the motivation for head-driven parsing strategies, and head-corner parsing in particular, a nondeterministic version of the head-corner parser is presented. A memorization technique is applied to obtain a fast parser. A goal-weakening technique is introduced, which greatly improves average case efficiency, both in terms of speed and space requirements. I argue in favor of such a memorization strategy with goal-weakening in comparison with ordinary chart parsers because such a strategy can be applied selectively and therefore enormously reduces the space requirements of the parser, while no practical loss in time-efficiency is observed. On the contrary, experiments are described in which head-corner and left-corner parsers imple- mented with selective memorization and goal weakening outperform "standard" chart parsers. The experiments include the grammar of the OV/S system and the Alvey NL Tools grammar. Head-corner parsing is a mix of bottom-up and top-down processing. Certain approaches to robust parsing require purely bottom-up processing. Therefore, it seems that head-corner parsing is unsuitable for such robust parsing techniques. However, it is shown how underspecification (which arises very naturally in a logic programming environment) can be used in the head-corner parser to allow such robust parsing techniques.
    [Show full text]
  • Traits: Experience with a Language Feature
    7UDLWV([SHULHQFHZLWKD/DQJXDJH)HDWXUH (PHUVRQ50XUSK\+LOO $QGUHZ3%ODFN 7KH(YHUJUHHQ6WDWH&ROOHJH 2*,6FKRRORI6FLHQFH1(QJLQHHULQJ$ (YHUJUHHQ3DUNZD\1: 2UHJRQ+HDOWKDQG6FLHQFH8QLYHUVLW\ 2O\PSLD$:$ 1::DONHU5G PXUHPH#HYHUJUHHQHGX %HDYHUWRQ$25 EODFN#FVHRJLHGX ABSTRACT the desired semantics of that method changes, or if a bug is This paper reports our experiences using traits, collections of found, the programmer must track down and fix every copy. By pure methods designed to promote reuse and understandability reusing a method, behavior can be defined and maintained in in object-oriented programs. Traits had previously been used to one place. refactor the Smalltalk collection hierarchy, but only by the crea- tors of traits themselves. This experience report represents the In object-oriented programming, inheritance is the normal way first independent test of these language features. Murphy-Hill of reusing methods—classes inherit methods from other classes. implemented a substantial multi-class data structure called ropes Single inheritance is the most basic and most widespread type of that makes significant use of traits. We found that traits im- inheritance. It allows methods to be shared among classes in an proved understandability and reduced the number of methods elegant and efficient way, but does not always allow for maxi- that needed to be written by 46%. mum reuse. Consider a small example. In Squeak [7], a dialect of Smalltalk, Categories and Subject Descriptors the class &ROOHFWLRQ is the superclass of all the classes that $UUD\ +HDS D.2.3 [Programming Languages]: Coding Tools and Tech- implement collection data structures, including , , 6HW niques - object-oriented programming and . The property of being empty is common to many ob- jects—it simply requires that the object have a size method, and D.3.3 [Programming Languages]: Language Constructs and that the method returns zero.
    [Show full text]
  • Implementing a Portable Clinical NLP System with a Common Data Model: a Lisp Perspective
    Implementing a portable clinical NLP system with a common data model: a Lisp perspective The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Luo, Yuan, and Peter Szolovits, "Implementing a portable clinical NLP system with a common data model: a Lisp perspective." Proceedings, 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2018), December 3-6, 2018, Madrid, Spain (Piscataway, N.J.: IEEE, 2018): doi 10.1109/BIBM.2018.8621521 ©2018 Author(s) As Published 10.1109/BIBM.2018.8621521 Publisher Institute of Electrical and Electronics Engineers (IEEE) Version Original manuscript Citable link https://hdl.handle.net/1721.1/124439 Terms of Use Creative Commons Attribution-Noncommercial-Share Alike Detailed Terms http://creativecommons.org/licenses/by-nc-sa/4.0/ Implementing a Portable Clinical NLP System with a Common Data Model – a Lisp Perspective Yuan Luo* Peter Szolovits* Dept. of Preventive Medicine CSAIL Northwestern University MIT Chicago, USA Cambridge, USA [email protected] [email protected] Abstract— This paper presents a Lisp architecture for a annotations, often in idiosyncratic representations. This makes portable NLP system, termed LAPNLP, for processing clinical it quite difficult to chain together sequences of operations. Alt- notes. LAPNLP integrates multiple standard, customized and hough several recent projects have achieved reasonable in-house developed NLP tools. Our system facilitates portability success in analyzing certain types of clinical narratives [3-6], across different institutions and data systems by incorporating an enriched Common Data Model (CDM) to standardize neces- efforts towards a common data model (CDM) to ensure port- sary data elements.
    [Show full text]
  • Generalized Probabilistic LR Parsing of Natural Language (Corpora) with Unification-Based Grammars
    Generalized Probabilistic LR Parsing of Natural Language (Corpora) with Unification-Based Grammars Ted Briscoe* John Carroll* University of Cambridge University of Cambridge We describe work toward the construction of a very wide-coverage probabilistic parsing system for natural language (NL), based on LR parsing techniques. The system is intended to rank the large number of syntactic analyses produced by NL grammars according to the frequency of occurrence of the individual rules deployed in each analysis. We discuss a fully automatic procedure for constructing an LR parse table from a unification-based grammar formalism, and consider the suitability of alternative LALR(1) parse table construction methods for large grammars. The parse table is used as the basis for two parsers; a user-driven interactive system that provides a computationally tractable and labor-efficient method of supervised training of the statistical information required to drive the probabilistic parser. The latter is constructed by associating probabilities with the LR parse table directly. This technique is superior to parsers based on probabilistic lexical tagging or probabilistic context-free grammar because it allows for a more context-dependent probabilistic language model, as well as use of a more linguistically adequate grammar formalism. We compare the performance of an optimized variant of Tomita's (1987) generalized LR parsing algorithm to an (efficiently indexed and optimized) chart parser. We report promising results of a pilot study training on 150 noun definitions from the Longman Dictionary of Contemporary English (LDOCE) and retesting on these plus a further 55 definitions. Finally, we discuss limitations of the current system and possible extensions to deal with lexical (syntactic and semantic)frequency of occurrence.
    [Show full text]
  • Parser Tables for Non-LR(1) Grammars with Conflict Resolution Joel E
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Elsevier - Publisher Connector Science of Computer Programming 75 (2010) 943–979 Contents lists available at ScienceDirect Science of Computer Programming journal homepage: www.elsevier.com/locate/scico The IELR(1) algorithm for generating minimal LR(1) parser tables for non-LR(1) grammars with conflict resolution Joel E. Denny ∗, Brian A. Malloy School of Computing, Clemson University, Clemson, SC 29634, USA article info a b s t r a c t Article history: There has been a recent effort in the literature to reconsider grammar-dependent software Received 17 July 2008 development from an engineering point of view. As part of that effort, we examine a Received in revised form 31 March 2009 deficiency in the state of the art of practical LR parser table generation. Specifically, LALR Accepted 12 August 2009 sometimes generates parser tables that do not accept the full language that the grammar Available online 10 September 2009 developer expects, but canonical LR is too inefficient to be practical particularly during grammar development. In response, many researchers have attempted to develop minimal Keywords: LR parser table generation algorithms. In this paper, we demonstrate that a well known Grammarware Canonical LR algorithm described by David Pager and implemented in Menhir, the most robust minimal LALR LR(1) implementation we have discovered, does not always achieve the full power of Minimal LR canonical LR(1) when the given grammar is non-LR(1) coupled with a specification for Yacc resolving conflicts. We also detail an original minimal LR(1) algorithm, IELR(1) (Inadequacy Bison Elimination LR(1)), which we have implemented as an extension of GNU Bison and which does not exhibit this deficiency.
    [Show full text]
  • CS 375, Compilers: Class Notes Gordon S. Novak Jr. Department Of
    CS 375, Compilers: Class Notes Gordon S. Novak Jr. Department of Computer Sciences University of Texas at Austin [email protected] http://www.cs.utexas.edu/users/novak Copyright c Gordon S. Novak Jr.1 1A few slides reproduce figures from Aho, Lam, Sethi, and Ullman, Compilers: Principles, Techniques, and Tools, Addison-Wesley; these have footnote credits. 1 I wish to preach not the doctrine of ignoble ease, but the doctrine of the strenuous life. { Theodore Roosevelt Innovation requires Austin, Texas. We need faster chips and great compilers. Both those things are from Austin. { Guy Kawasaki 2 Course Topics • Introduction • Lexical Analysis: characters ! words lexer { Regular grammars { Hand-written lexical analyzer { Number conversion { Regular expressions { LEX • Syntax Analysis: words ! sentences parser { Context-free grammars { Operator precedence { Recursive descent parsing { Shift-reduce parsing, YACC { Intermediate code { Symbol tables • Code Generation { Code generation from trees { Register assignment { Array references { Subroutine calls • Optimization { Constant folding, partial evaluation, Data flow analysis • Object-oriented programming 3 Pascal Test Program program graph1(output); { Jensen & Wirth 4.9 } const d = 0.0625; {1/16, 16 lines for [x,x+1]} s = 32; {32 character widths for [y,y+1]} h = 34; {character position of x-axis} c = 6.28318; {2*pi} lim = 32; var x,y : real; i,n : integer; begin for i := 0 to lim do begin x := d*i; y := exp(-x)*sin(c*x); n := round(s*y) + h; repeat write(' '); n := n-1 until n=0; writeln('*') end end. * * * * * * * * * * * * * * * * * * * * * * * * * * 4 Introduction • What a compiler does; why we need compilers • Parts of a compiler and what they do • Data flow between the parts 5 Machine Language A computer is basically a very fast pocket calculator attached to a large memory.
    [Show full text]
  • Chapter 14: Dependency Parsing
    Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright © 2021. All rights reserved. Draft of September 21, 2021. CHAPTER 14 Dependency Parsing The focus of the two previous chapters has been on context-free grammars and constituent-based representations. Here we present another important family of dependency grammars grammar formalisms called dependency grammars. In dependency formalisms, phrasal constituents and phrase-structure rules do not play a direct role. Instead, the syntactic structure of a sentence is described solely in terms of directed binary grammatical relations between the words, as in the following dependency parse: root dobj det nmod (14.1) nsubj nmod case I prefer the morning flight through Denver Relations among the words are illustrated above the sentence with directed, labeled typed dependency arcs from heads to dependents. We call this a typed dependency structure because the labels are drawn from a fixed inventory of grammatical relations. A root node explicitly marks the root of the tree, the head of the entire structure. Figure 14.1 shows the same dependency analysis as a tree alongside its corre- sponding phrase-structure analysis of the kind given in Chapter 12. Note the ab- sence of nodes corresponding to phrasal constituents or lexical categories in the dependency parse; the internal structure of the dependency parse consists solely of directed relations between lexical items in the sentence. These head-dependent re- lationships directly encode important information that is often buried in the more complex phrase-structure parses. For example, the arguments to the verb prefer are directly linked to it in the dependency structure, while their connection to the main verb is more distant in the phrase-structure tree.
    [Show full text]
  • LATE Ain't Earley: a Faster Parallel Earley Parser
    LATE Ain’T Earley: A Faster Parallel Earley Parser Peter Ahrens John Feser Joseph Hui [email protected] [email protected] [email protected] July 18, 2018 Abstract We present the LATE algorithm, an asynchronous variant of the Earley algorithm for pars- ing context-free grammars. The Earley algorithm is naturally task-based, but is difficult to parallelize because of dependencies between the tasks. We present the LATE algorithm, which uses additional data structures to maintain information about the state of the parse so that work items may be processed in any order. This property allows the LATE algorithm to be sped up using task parallelism. We show that the LATE algorithm can achieve a 120x speedup over the Earley algorithm on a natural language task. 1 Introduction Improvements in the efficiency of parsers for context-free grammars (CFGs) have the potential to speed up applications in software development, computational linguistics, and human-computer interaction. The Earley parser has an asymptotic complexity that scales with the complexity of the CFG, a unique, desirable trait among parsers for arbitrary CFGs. However, while the more commonly used Cocke-Younger-Kasami (CYK) [2, 5, 12] parser has been successfully parallelized [1, 7], the Earley algorithm has seen relatively few attempts at parallelization. Our research objectives were to understand when there exists parallelism in the Earley algorithm, and to explore methods for exploiting this parallelism. We first tried to naively parallelize the Earley algorithm by processing the Earley items in each Earley set in parallel. We found that this approach does not produce any speedup, because the dependencies between Earley items force much of the work to be performed sequentially.
    [Show full text]
  • Development of a Persian Syntactic Dependency Treebank
    Development of a Persian Syntactic Dependency Treebank Mohammad Sadegh Rasooli Manouchehr Kouhestani Amirsaeid Moloodi Department of Computer Science Department of Linguistics Department of Linguistics Columbia University Tarbiat Modares University University of Tehran New York, NY Tehran, Iran Tehran, Iran [email protected] [email protected] [email protected] Abstract tions in tasks such as machine translation. Depen- dency treebanks are collections of sentences with This paper describes the annotation process their corresponding dependency trees. In the last and linguistic properties of the Persian syn- decade, many dependency treebanks have been de- tactic dependency treebank. The treebank veloped for a large number of languages. There are consists of approximately 30,000 sentences at least 29 languages for which at least one depen- annotated with syntactic roles in addition to morpho-syntactic features. One of the unique dency treebank is available (Zeman et al., 2012). features of this treebank is that there are al- Dependency trees are much more similar to the hu- most 4800 distinct verb lemmas in its sen- man understanding of language and can easily rep- tences making it a valuable resource for ed- resent the free word-order nature of syntactic roles ucational goals. The treebank is constructed in sentences (Kubler¨ et al., 2009). with a bootstrapping approach by means of available tagging and parsing tools and man- Persian is a language with about 110 million ually correcting the annotations. The data is speakers all over the world (Windfuhr, 2009), yet in splitted into standard train, development and terms of the availability of teaching materials and test set in the CoNLL dependency format and annotated data for text processing, it is undoubt- is freely available to researchers.
    [Show full text]
  • The Design & Implementation of an Abstract Semantic Graph For
    Clemson University TigerPrints All Dissertations Dissertations 12-2011 The esiD gn & Implementation of an Abstract Semantic Graph for Statement-Level Dynamic Analysis of C++ Applications Edward Duffy Clemson University, [email protected] Follow this and additional works at: https://tigerprints.clemson.edu/all_dissertations Part of the Computer Sciences Commons Recommended Citation Duffy, Edward, "The eD sign & Implementation of an Abstract Semantic Graph for Statement-Level Dynamic Analysis of C++ Applications" (2011). All Dissertations. 832. https://tigerprints.clemson.edu/all_dissertations/832 This Dissertation is brought to you for free and open access by the Dissertations at TigerPrints. It has been accepted for inclusion in All Dissertations by an authorized administrator of TigerPrints. For more information, please contact [email protected]. THE DESIGN &IMPLEMENTATION OF AN ABSTRACT SEMANTIC GRAPH FOR STATEMENT-LEVEL DYNAMIC ANALYSIS OF C++ APPLICATIONS A Dissertation Presented to the Graduate School of Clemson University In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Computer Science by Edward B. Duffy December 2011 Accepted by: Dr. Brian A. Malloy, Committee Chair Dr. James B. von Oehsen Dr. Jason P. Hallstrom Dr. Pradip K. Srimani In this thesis, we describe our system, Hylian, for statement-level analysis, both static and dynamic, of a C++ application. We begin by extending the GNU gcc parser to generate parse trees in XML format for each of the compilation units in a C++ application. We then provide verification that the generated parse trees are structurally equivalent to the code in the original C++ application. We use the generated parse trees, together with an augmented version of the gcc test suite, to recover a grammar for the C++ dialect that we parse.
    [Show full text]