A User Guide)
Total Page:16
File Type:pdf, Size:1020Kb
Developing Language Processing Components with GATE Version 5 (a User Guide) For GATE version 5.1-beta1 (built October 30, 2009) Hamish Cunningham Diana Maynard Kalina Bontcheva Valentin Tablan Marin Dimitrov Mike Dowman Niraj Aswani Ian Roberts Yaoyong Li Adam Funk Genevieve Gorrell Johann Petrak Horacio Saggion Danica Damljanovic Angus Roberts ©The University of Sheffield 2001-2009 http://gate.ac.uk/ HTML version: http://gate.ac.uk/userguide Work on GATE has been partly supported by EPSRC grants GR/K25267 (Large-Scale Information Extraction), GR/M31699 (GATE 2), RA007940 (EMILLE), GR/N15764/01 (AKT) and GR/R85150/01 (MIAKT), AHRB grant APN16396 (ETCSL/GATE), Matrixware, the Information Retrieval Facility and several EU-funded projects (SEKT, TAO, NeOn, MediaCampaign, MUSING, KnowledgeWeb, PrestoSpace, h-TechSight, enIRaF). Brief Contents I GATE Basics 2 1 Introduction 3 2 Installing and Running GATE 25 3 Using GATE Developer 31 4 CREOLE: the GATE Component Model 58 5 Language Resources: Corpora, Documents and Annotations 76 6 ANNIE: a Nearly-New Information Extraction System 97 II GATE for Advanced Users 115 7 GATE Embedded 116 8 JAPE: Regular Expressions over Annotations 147 9 ANNIC: ANNotations-In-Context 180 10 Performance Evaluation of Language Analysers 189 11 Profiling Processing Resources 210 12 Developing GATE 217 III CREOLE Plugins 228 13 Gazetteers 229 14 Working with Ontologies 247 15 Machine Learning 281 16 Tools for Alignment Tasks 329 17 Parsers and Taggers 342 18 Combining GATE and UIMA 368 i Contents ii 19 More (CREOLE) Plugins 379 Appendices 415 A Change Log 415 B Version 5.1 Plugins Name Map 435 C Design Notes 437 D JAPE: Implementation 444 E Ant Tasks for GATE 453 F Named-Entity State Machine Patterns 460 G Part-of-Speech Tags used in the Hepple Tagger 468 References 469 Contents I GATE Basics 2 1 Introduction 3 1.1 How to Use this Text . .6 1.2 Context . .6 1.3 Overview . .7 1.3.1 Developing and Deploying Language Processing Facilities . .7 1.3.2 Built-In Components . .9 1.3.3 Additional Facilities . .9 1.3.4 An Example . 10 1.4 Some Evaluations . 12 1.5 Changes in this Version . 13 1.5.1 Version 5.1 beta 1 (Autumn 2009) . 13 1.5.2 July 2009 (FIG'09 Summer School) . 14 1.6 Further Reading . 16 2 Installing and Running GATE 25 2.1 Downloading GATE . 25 2.2 Installing and Running GATE . 25 2.2.1 The Easy Way . 25 2.2.2 The Hard Way (1) . 26 2.2.3 The Hard Way (2): Subversion . 27 2.3 Using System Properties with GATE . 27 2.4 Configuring GATE . 28 2.5 Building GATE . 29 2.6 Troubleshooting . 30 3 Using GATE Developer 31 3.1 The GATE Developer Main Window . 32 3.2 Loading and Viewing Documents . 34 3.3 Creating and Viewing Corpora . 35 3.4 Working with Annotations . 38 3.4.1 The Annotation Sets View . 38 3.4.2 The Annotations List View . 39 3.4.3 The Annotations Stack View . 39 iii Contents iv 3.4.4 The Co-reference Editor . 40 3.4.5 Creating and Editing Annotations . 41 3.4.6 Schema-Driven Editing . 44 3.5 Using CREOLE Plugins . 45 3.6 Loading and Using Processing Resources . 47 3.7 Creating and Running an Application . 47 3.7.1 Running PRs Conditionally on Document Features . 48 3.7.2 Doing Information Extraction with ANNIE . 49 3.7.3 Modifying ANNIE . 50 3.8 Saving Applications and Language Resources . 50 3.8.1 Saving Documents to File . 50 3.8.2 Saving and Restoring LRs in Data Stores . 51 3.8.3 Saving Resource Parameter States to File . 52 3.8.4 Saving an Application with its Resources (e.g. GATE Teamware) . 53 3.9 Keyboard Shortcuts . 54 3.10 Miscellaneous . 56 3.10.1 Stopping GATE from Restoring Developer Sessions/Options . 56 3.10.2 Working with Unicode . 56 3.10.3 Using GATE with Maven or JPF . 57 4 CREOLE: the GATE Component Model 58 4.1 The Web and CREOLE . 59 4.2 The GATE Framework . 60 4.3 The Lifecycle of a CREOLE Resource . 60 4.4 Processing Resources and Applications . 61 4.5 Language Resources and Datastores . 62 4.6 Built-in CREOLE Resources . 63 4.7 CREOLE Resource Configuration . 63 4.7.1 Configuration with XML . 63 4.7.2 Configuring Resources using Annotations . 69 4.7.3 Mixing the Configuration Styles . 73 5 Language Resources: Corpora, Documents and Annotations 76 5.1 Features: Simple Attribute/Value Data . 76 5.2 Corpora: Sets of Documents plus Features . 77 5.3 Documents: Content plus Annotations plus Features . 77 5.4 Annotations: Directed Acyclic Graphs . 77 5.4.1 Annotation Schemas . 77 5.4.2 Examples of Annotated Documents . 79 5.4.3 Creating, Viewing and Editing Diverse Annotation Types . 82 5.5 Document Formats . 82 5.5.1 Detecting the Right Reader . 84 5.5.2 XML . 85 5.5.3 HTML . 91 Contents v 5.5.4 SGML . 92 5.5.5 Plain text . 93 5.5.6 RTF . 94 5.5.7 Email . 94 5.6 XML Input/Output . 96 6 ANNIE: a Nearly-New Information Extraction System 97 6.1 Document Reset . 98 6.2 Tokeniser . 99 6.2.1 Tokeniser Rules . 99 6.2.2 Token Types . 99 6.2.3 English Tokeniser . 101 6.3 Gazetteer . 101 6.4 Sentence Splitter . 103 6.5 RegEx Sentence Splitter . 103 6.6 Part of Speech Tagger . 105 6.7 Semantic Tagger . 106 6.8 Orthographic Coreference (OrthoMatcher) . 106 6.8.1 GATE Interface . 106 6.8.2 Resources . 106 6.8.3 Processing . 106 6.9 Pronominal Coreference . 107 6.9.1 Quoted Speech Submodule . 108 6.9.2 Pleonastic It Submodule . 108 6.9.3 Pronominal Resolution Submodule . 108 6.9.4 Detailed Description of the Algorithm . 109 6.10 A Walk-Through Example . 112 6.10.1 Step 1 - Tokenisation . 113 6.10.2 Step 2 - List Lookup . 114 6.10.3 Step 3 - Grammar Rules . 114 II GATE for Advanced Users 115 7 GATE Embedded 116 7.1 Quick Start with GATE Embedded . 116 7.2 Resource Management in GATE Embedded . 117 7.3 Using CREOLE Plugins . 120 7.4 Language Resources . 120 7.4.1 GATE Documents . 120 7.4.2 Feature Maps . 122 7.4.3 Annotation Sets . 123 7.4.4 Annotations . 126 7.4.5 GATE Corpora . ..