Interoperability and Standards
Total Page:16
File Type:pdf, Size:1020Kb
Common Language Resources and Technology Infrastructure Interoperability and Standards 2010-12-23 Version: 1 Editors: Erhard Hinrichs, Iris Vogel www.clarin.eu Common Language Resources and Technology Infrastructure The ultimate objective of CLARIN is to create a European federation of existing digital repositories that include language-based data, to provide uniform access to the data, wherever it is, and to provide existing language and speech technology tools as web services to retrieve, manipulate, enhance, explore and exploit the data. The primary target audience is researchers in the humanities and social sciences and the aim is to cover all languages relevant for the user community. The objective of the current CLARIN Preparatory Phase Project (2008-2010) is to lay the technical, linguistic and organizational foundations, to provide and validate specifications for all aspects of the infrastructure (including standards, usage, IPR) and to secure sustainable support from the funding bodies in the (now 23) participating countries for the subsequent construction and exploitation phases beyond 2010. CLARIN-D5C-3 2 Common Language Resources and Technology Infrastructure Interoperability and Standards CLARIN-D5C-3 EC FP7 project no. 212230 Deliverable: D5.C-3 - Deadline: 31.12.2010 Responsible: Erhard Hinrichs CLARIN-D5C-3 3 Common Language Resources and Technology Infrastructure Contributing Partners: UTU, UHEL, IPPBAS, KTH, IMCS, IPIPAN, ILC-CNR, ILSP, MPI Contributing Members: BBAW, U Hamburg, U Stuttgart, Vienna With contributions from: Piotr Bański, Kathrin Beck, Gerhard Budin, Tommaso Caselli, Kerstin Eckart, Kjell Elenius, Gertrud Faaß, Maria Gavrilidou, Verena Henrich, Valeria Quochi, Lothar Lemnitzer, Wolfgang Maier, Monica Monachini, Jan Odijk, Maciej Ogrodniczuk, Petya Osenova, Petr Pajas, Maciej Piasecki, Adam Przepiórkowski, Dieter Van Uytvanck, Thomas Schmidt, Ineke Schuurman, Kiril Simov, Claudia Soria, Inguna Skadina, Jan Stepanek, Pavel Stranak, Paul Trilsbeek, Thorsten Trippel, Iris Vogel © all rights reserved by MPI for Psycholinguistics on behalf of CLARIN Contents Contents.................................................................................................................................................. 4 Background ............................................................................................................................................ 7 Introduction ............................................................................................................................................ 7 1. General Standards ......................................................................................................................... 10 1.1. Unicode ................................................................................................................................. 10 1.2. XML...................................................................................................................................... 11 1.3. TEI......................................................................................................................................... 12 2. Lexica and Terminology Standards............................................................................................... 13 2.1. Terminology, Definitions...................................................................................................... 13 2.2. Lexical Markup Framework (LMF)...................................................................................... 15 2.2.1. Content Descriptors/Tags: the Data Categories ............................................................. 17 2.2.2. Sample Entries from different types of lexica................................................................ 18 2.3. TEI......................................................................................................................................... 26 2.4. Terminological Markup Framework (TMF) ......................................................................... 30 2.5. Widespread Lexicon Formats................................................................................................ 33 2.5.1. Wordnet Structure and Formats ..................................................................................... 33 2.5.2. Existing Wordnet-LMF Formats.................................................................................... 42 2.5.3. Suggested Wordnet-LMF Format .................................................................................. 46 3. Ontologies ..................................................................................................................................... 49 CLARIN-D5C-3 4 Common Language Resources and Technology Infrastructure 3.1. Terminology/Definitions....................................................................................................... 51 3.2. Ontology................................................................................................................................ 53 3.2.1. Upper Ontologies ........................................................................................................... 53 3.2.2. Domain Ontologies ........................................................................................................ 56 3.3. Ontology Languages ............................................................................................................. 57 3.3.1. Languages for definition of ontologies .......................................................................... 58 3.3.2. Rules definition languages ............................................................................................. 60 3.3.3. Queries definition languages.......................................................................................... 61 3.3.4. Mapping definition languages........................................................................................ 63 4. Written Corpora............................................................................................................................. 64 4.1. Scope, Terminology Definitions ........................................................................................... 64 4.2. The TEI encoding of the National Corpus of Polish............................................................. 65 4.2.1. Introduction .................................................................................................................... 65 4.2.2. Corpus Header................................................................................................................ 65 4.2.3. Text Header .................................................................................................................... 68 4.2.4. Text Structure................................................................................................................. 70 4.3. Written corpora in CES/XCES.............................................................................................. 72 4.3.1. Introduction .................................................................................................................... 72 4.3.2. The CES Header............................................................................................................. 73 4.3.3. Encoding of primary data............................................................................................... 73 4.3.4. Linguistic annotation...................................................................................................... 73 5. Annotation..................................................................................................................................... 74 5.1. General annotation frameworks (TEI, LAF)......................................................................... 74 5.1.1. TEI annotation of the National Corpus of Polish........................................................... 74 5.1.2. Linguistic Annotation Framework (LAF)...................................................................... 78 5.2. Standards for morphological annotation ............................................................................... 81 5.2.1. Morpho-Syntactic Аnnotation Framework (MAF)........................................................ 81 5.3. Standards for syntactic annotation ........................................................................................ 84 5.3.1. Syntactic Annotation Framework (SynAF).................................................................... 84 5.3.2. Penn Treebank (Phrase Structure Treebank).................................................................. 89 5.3.3. NeGra Format (Phrase Structure Treebank)................................................................... 91 5.3.4. Prague Markup Language (Dependency Structure Treebank)....................................... 96 5.3.5. Kyoto Annotation Format (KAF)................................................................................. 101 5.4. Standards for semantic annotation ...................................................................................... 107 5.4.1. Dialogue Acts (DiAML) .............................................................................................. 107 CLARIN-D5C-3 5 Common Language Resources and Technology Infrastructure 5.4.2. Events and Time Expression (TimeML-ISO) .............................................................. 111 5.4.3. Coreference (MATE) ..................................................................................................