Developing Language Processing Components with GATE Version 4 (A User Guide)

Developing Language Processing Components with GATE Version 4 (a User Guide) For GATE version 4.0 (July 2007) (built July 12, 2007) Hamish Cunningham Diana Maynard Kalina Bontcheva Valentin Tablan Cristian Ursu Marin Dimitrov Mike Dowman Niraj Aswani Ian Roberts Yaoyong Li Andrey Shafirin c The University of Sheffield 2001-2007 http://gate.ac.uk/ HTML version: http://gate.ac.uk/sale/tao/ Work on GATE has been partly supported by EPSRC grants GR/K25267 (Large-Scale Information Extraction), GR/M31699 (GATE 2), RA007940 (EMILLE), GR/N15764/01 (AKT) and GR/R85150/01 (MIAKT), AHRB grant APN16396 (ETCSL/GATE), and several EU-funded projects (SEKT, TAO, NeOn, MediaCampaign, MUSING, KnowledgeWeb, PrestoSpace, h-TechSight, enIRaF). Brief Contents 1 Introduction 2 1.1 How to Use This Text . 3 1.2 Context . 4 1.3 Overview . 5 1.4 Structure of the Book . 9 1.5 Further Reading . 10 2 Change Log 17 2.1 Version 4.0 (July 2007) . 17 2.2 Version 3.1 (April 2006) . 22 2.3 January 2005 . 24 2.4 December 2004 . 25 2.5 September 2004 . 25 2.6 Version 3 Beta 1 (August 2004) . 26 2.7 July 2004 . 27 2.8 June 2004 . 27 2.9 April 2004 . 27 2.10 March 2004 . 28 2.11 Version 2.2 – August 2003 . 28 2.12 Version 2.1 – February 2003 . 29 2.13 June 2002 . 29 3 How To. 30 3.1 Download GATE . 30 3.2 Install and Run GATE . 30 3.3 [D,F] Use System Properties with GATE . 32 3.4 [D,F] Use (CREOLE) Plug-ins . 34 3.5 Troubleshooting . 35 3.6 [D] Get Started with the GUI . 35 3.7 [D,F] Configure GATE . 37 3.8 Build GATE . 38 3.9 [D,F] Create a New CREOLE Resource . 39 3.10 [F] Instantiate CREOLE Resources . 43 3.11 [D] Load CREOLE Resources . 46 3.12 [D,F] Configure CREOLE Resources . 47 3.13 [D] Create and Run an Application . 50 3.14 [D] Run PRs Conditionally on Document Features . 51 3.15 [D] View Annotations . 52 3.16 [D] Do Information Extraction with ANNIE . 53 3.17 [D] Modify ANNIE . 53 3.18 [D] Create and Edit Test Data . 53 3.19 [D,F] Create a New Annotation Schema . 55 i Brief Contents ii 3.20 [D] Save and Restore LRs in Data Stores . 56 3.21 [D] Save Resource Parameter State to File . 57 3.22 [D,F] Perform Evaluation with the AnnotationDiff tool . 57 3.23 [D] Use the Corpus Benchmark Evaluation tool . 58 3.24 [D] Write JAPE Grammars . 60 3.25 [F] Embed NLE in other Applications . 61 3.26 [F] Use GATE within a Spring application . 62 3.27 [F] Use GATE within a Tomcat Web Application . 63 3.28 [F] Use GATE in a Multithreaded Environment . 66 3.29 [D,F] Add support for a new document format . 67 3.30 [D] Dump Results to File . 69 3.31 [D] Stop GUI ‘Freezing’ on Linux . 70 3.32 [D] Stop GUI Crashing on Linux . 70 3.33 [D] Stop GATE Restoring GUI Sessions/Options . 70 3.34 Work with Unicode . 71 3.35 Work with Oracle and PostgreSQL . 72 4 CREOLE: the GATE Component Model 73 4.1 The Web and CREOLE . 74 4.2 Java Beans: a Simple Component Architecture . 75 4.3 The GATE Framework . 76 4.4 Language Resources and Processing Resources . 77 4.5 The Lifecycle of a CREOLE Resource . 78 4.6 Processing Resources and Applications . 79 4.7 Language Resources and Datastores . 80 4.8 Built-in CREOLE Resources . 80 5 Visual CREOLE 81 5.1 Gazetteer Visual Resource - GAZE . 81 5.2 Ontogazetteer . 83 5.3 The Co-reference Editor . 85 6 Language Resources: Corpora, Documents and Annotations 86 6.1 Features: Simple Attribute/Value Data . 86 6.2 Corpora: Sets of Documents plus Features . 87 6.3 Documents: Content plus Annotations plus Features . 87 6.4 Annotations: Directed Acyclic Graphs . 87 6.5 Document Formats . 92 6.6 XML Input/Output . 106 7 JAPE: Regular Expressions Over Annotations 107 7.1 Use of Context . 113 7.2 Use of Priority . 114 7.3 Useful tricks . 116 7.4 Ontology aware grammar transduction . 119 Brief Contents iii 7.5 Using Java code in JAPE rules . 119 7.6 Optimising for speed . 123 7.7 Serializing JAPE Transducer . 124 7.8 The JAPE Debugger . 125 8 ANNIE: a Nearly-New Information Extraction System 130 8.1 Tokeniser . 131 8.2 Gazetteer . 134 8.3 Sentence Splitter . 136 8.4 Part of Speech Tagger . 136 8.5 Semantic Tagger . 137 8.6 Orthographic Coreference (OrthoMatcher) . 137 8.7 Pronominal Coreference . 138 8.8 A Walk-Through Example . 144 9 (More CREOLE) Plugins 147 9.1 Document Reset . 148 9.2 Verb Group Chunker . 148 9.3 Noun Phrase Chunker . 148 9.4 OntoText Gazetteer . 149 9.5 Flexible Gazetteer . 151 9.6 Gazetteer List Collector . 152 9.7 TreeTagger .................................... 153 9.8 Stemmer . 155 9.9 GATE Morphological Analyzer . 156 9.10 MiniPar Parser . 158 9.11 RASP Parser . 161 9.12 SUPPLE Parser (formerly BuChart) . 162 9.13 Montreal Transducer . 168 9.14 Language Plugins . 168 9.15 Chemistry Tagger . 171 9.16 Flexible Exporter . 171 9.17 Annotation Set Transfer . 172 9.18 Information Retrieval in GATE . 173 9.19 Crawler . ..

Load more