Olia – Ontologies of Linguistic Annotation

Total Page:16

File Type:pdf, Size:1020Kb

Olia – Ontologies of Linguistic Annotation Semantic Web 0 (0) 1 1 IOS Press OLiA – Ontologies of Linguistic Annotation Editor(s): Sebastian Hellmann, AKSW, University of Leipzig, Germany; Steven Moran, LMU Munich, Germany; Martin Brümmer, AKSW, University of Leipzig, Germany; John P. McCrae, CITEC, Bielefeld University, Germany Solicited review(s): Menzo Windhouwer, Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands; Steve Cassidy, Macquarie University, Sydney, Australia; one anonymous reviewer Christian Chiarcos ∗ and Maria Sukhareva Applied Computational Linguistics (ACoLi), Department of Computer Science and Mathematics, Goethe-University Frankfurt am Main, Germany, http://acoli.cs.uni-frankfurt.de Abstract. This paper describes the Ontologies of Linguistic Annotation (OLiA) as one of the data sets currently available as part of Linguistic Linked Open Data (LLOD) cloud. Within the LLOD cloud, the OLiA ontologies serve as a reference hub for annotation terminology for linguistic phenomena on a great band-width of languages, they have been used to facilitate interoperability and information integration of linguistic annotations in corpora, NLP pipelines, and lexical-semantic resources and mediate their linking with multiple community-maintained terminology repositories. Keywords: linguistic terminology, grammatical categories, annotation schemes, corpora, NLP, tag sets, interoperability 1. Background repositories have converged into a generally accepted repository of reference terminology: They introduce The heterogeneity of linguistic annotations has been an intermediate level of representation between ISO- recognized as a key problem limiting the interoper- cat, GOLD and other repositories of linguistic refer- ability and reusability of NLP tools and linguistic data ence terminology and are interconnected with these re- collections. Several repositories of linguistic annota- sources, and they provide not only means to formalize tion terminology have been developed to facilitate an- reference categories, but also annotation schemes, and notation interoperability by means of a joint level of the way that these are linked with reference categories. representation, or an ‘interlingua’, the most prominent probably being the General Ontology of Linguistic De- scription [12, GOLD] and the ISO TC37/SC4 Data 2. Architecture Category Registry [21, ISOcat]. Still, these repositories are developed by different The Ontologies of Linguistic Annotations [3] rep- communities, and are thus not always compatible with resent a modular architecture of OWL2/DL ontolo- each other, neither with respect to definitions nor tech- gies that formalize the mapping between annotations, nologies (e.g., there is no commonly agreed formalism a ‘Reference Model’ and existing terminology reposi- to link linguistic annotations to terminology reposito- tories (‘External Reference Models’). ries). The OLiA ontologies were developed as part of an The Ontologies of Linguistic Annotation (OLiA) infrastructure for the sustainable maintenance of lin- have been developed to facilitate the development of guistic resources [28], and their primary fields of appli- applications that take benefit of a well-defined termi- cation include the formalization of annotation schemes nological backbone even before the GOLD and ISOcat and concept-based querying over heterogeneously an- notated corpora [16,10]. *Corresponding author. E-mail: [email protected] In the OLiA architecture, four different types of on- frankfurt.de tologies are distinguished (cf. Fig. 1 for an example): 1570-0844/0-1900/$27.50 c 0 – IOS Press and the authors. All rights reserved 2 Chiarcos and Sukhareva / OLiA – Ontologies of Linguistic Annotation – The OLIAREFERENCE MODEL specifies the tions (coreference, discourse relations, discourse struc- common terminology that different annotation ture and information structure). Annotations for lexi- schemes can refer to. It is derived from existing cal semantics are only covered to the extent that they repositories of annotation terminology and ex- are found in syntactic and morphosyntactic annotation tended in accordance with the annotation schemes schemes. Other aspects of lexical semantics are be- that it was applied to. yond the scope of OLiA: Existing reference resources – Multiple OLIAANNOTATION MODELs formal- for lexical semantics available in RDF include Word- ize annotation schemes and tagsets. Annotation Net, VerbNet and FrameNet, their linking with OLiA is Models are based on the original documentation, recommended as part of the lexicon model lemon [22], so that they provide an interpretation-independent and has been implemented, for example, in lemonUby representation of the annotation scheme. [11]. – For every Annotation Model, a LINKING MODEL Since their first presentation [3], the OLiA ontolo- defines v relationships between concepts/proper- gies have been continuously extended. At the time of ties in the respective Annotation Model and the writing, the OLiA Reference Model distinguishes 280 Reference Model. Linking Models are interpreta- MorphosyntacticCategorys (word classes), 68 tions of Annotation Model concepts and proper- SyntacticCategorys (phrase labels), 18 Mor- ties in terms of the Reference Model. phologicalCategorys (morphemes), 7 Mor- – Existing terminology repositories can be inte- phologicalProcesses, and 405 different values grated as EXTERNAL REFERENCE MODELs, if for 18 MorphosyntacticFeatures, 5 Syntac- they are represented in OWL2/DL. Then, Link- ticFeatures and 6 SemanticFeatures (for ing Models specify v relationships between Ref- glosses, part-of-speech annotation and for edge labels erence Model concepts and External Reference in syntax annotation). Model concepts. As for morphological, morphosyntactic and syntac- tic annotations, the OLiA ontologies include 32 An- The OLiA Reference Model specifies classes for lin- notation Models for about 70 different languages, in- guistic categories (e.g., olia:Determiner) and cluding several multi-lingual annotation schemes, e.g., grammatical features (e.g., olia:Accusative), as EAGLES [3] for 11 Western European languages, well as properties that define relations between these and MULTEXT/East [8] for 15 (mostly) Eastern Eu- (e.g., olia:hasCase). ropean languages. As for non-(Indo-)European lan- Conceptually, Annotation Models differ from the guages, the OLiA ontologies include morphosyntac- Reference Model in that they include not only concepts tic annotation schemes for languages of the Indian and properties, but also individuals: Individuals rep- subcontinent, for Arabic, Basque, Chinese, Estonian, resent concrete tags, while classes represent abstract Finnish, Hausa, Hungarian and Turkish, as well as concepts similar to those of the Reference Model. Fig- multi-lingual schemes applied to languages of Africa, ure 1 gives an example for the individual PDAT from the Americas, the Pacific and Australia. The OLiA on- the STTS Annotation Model, the corresponding STTS tologies also cover historical varieties, including Old concepts, and their linking with Reference Model con- High German, Old Norse and Old Tibetan. Addition- cepts. Taken together, these allow to interpret the indi- ally, 7 Annotation Models for different resources with vidual (and the part-of-speech tag it represents) as an discourse annotations have been developed. olia:Determiner, etc. External reference models currently linked to the OLiA Reference Model include GOLD [3], the Onto- Tag ontologies [1], an ontological remodeling of ISO- 3. Data Set Description cat [4], and the Typological Database System (TDS) ontologies [26]. From these, only the TDS ontolo- The OLiA ontologies are available from http:// gies are currently available under an open (CC-BY) purl.org/olia under a Creative Commons Attri- license,1 but these take a focus on typological data bution license (CC-BY). bases rather than NLP and annotation interoperability. The OLiA ontologies cover different grammatical phenomena, including inflectional morphology, word classes, phrase and edge labels of different syntax an- 1http://languagelink.let.uu.nl/tds/ notations, as well as prototypes for discourse annota- ontology/LinguisticOntology.owl Chiarcos and Sukhareva / OLiA – Ontologies of Linguistic Annotation 3 GOLD and ISOcat should be available under an open ability between GOLD and ISOcat (that – despite on- license,2 but can currently not be redistributed because going harmonization efforts [20] – are maintained by of an uncertain licensing situation (no explicit license different communities and developed independently). information). The OntoTag ontologies are available to Using the OLiA Reference Model, it is thus possible to the author but have not been publicly released. develop applications that are interoperable in terms of In this context, the function of the OLiA Reference GOLD and ISOcat even though both are still under de- Model is not to provide a novel and independent view velopment and both differ in their conceptualizations. on linguistic terminology, but rather to serve as a sta- Such applications are briefly described in the follow- ble intermediate representation between (ontological ing section. models of) annotation schemes and these terminology repositories. This allows any concept that can be ex- pressed in terms of the OLiA Reference Model also 4. Application to be interpreted in the context of ISOcat, GOLD, On- toTag or TDS. OLiA serves to aggregate annotation Initially, the OLiA ontologies have been intended to terminology as found in linguistic resources
Recommended publications
  • MASC: the Manually Annotated Sub-Corpus of American English
    MASC: The Manually Annotated Sub-Corpus of American English Nancy Ide*, Collin Baker**, Christiane Fellbaum†, Charles Fillmore**, Rebecca Passonneau†† *Vassar College Poughkeepsie, New York USA **International Computer Science Institute Berkeley, California USA †Princeton University Princeton, New Jersey USA ††Columbia University New York, New York USA E-mail: [email protected], [email protected], [email protected], [email protected], [email protected] Abstract To answer the critical need for sharable, reusable annotated resources with rich linguistic annotations, we are developing a Manually Annotated Sub-Corpus (MASC) including texts from diverse genres and manual annotations or manually-validated annotations for multiple levels, including WordNet senses and FrameNet frames and frame elements, both of which have become significant resources in the international computational linguistics community. To derive maximal benefit from the semantic information provided by these resources, the MASC will also include manually-validated shallow parses and named entities, which will enable linking WordNet senses and FrameNet frames within the same sentences into more complex semantic structures and, because named entities will often be the role fillers of FrameNet frames, enrich the semantic and pragmatic information derivable from the sub-corpus. All MASC annotations will be published with detailed inter-annotator agreement measures. The MASC and its annotations will be freely downloadable from the ANC website, thus providing
    [Show full text]
  • Background and Context for CLASP
    Background and Context for CLASP Nancy Ide, Vassar College The Situation Standards efforts have been on-going for over 20 years Interest and activity mainly in Europe in 90’s and early 2000’s Text Encoding Initiative (TEI) – 1987 Still ongoing, used mainly by humanities EAGLES/ISLE Developed standards for morpho-syntax, syntax, sub-categorization, etc. (links on CLASP wiki) Corpus Encoding Standard (now XCES - http://www.xces.org) Main Aspects" ! Harmonization of formats for linguistic data and annotations" ! Harmonization of descriptors in linguistic annotation" ! These two are often mixed, but need to deal with them separately (see CLASP wiki)" Formats: The Past 20 Years" 1987 TEI Myriad of formats 1994 MULTEXT, CES ~1996 XML 2000 ISO TC37 SC4 2001 LAF model introduced now LAF/GrAF, ISO standards Myriad of formats Actually…" ! Things are better now" ! XML use" ! Moves toward common models, especially in Europe" ! US community seeing the need for interoperability " ! Emergence of common processing platforms (GATE, UIMA) with underlying common models " Resources 1990 ! WordNet gains ground as a “standard” LR ! Penn Treebank, Wall Street Journal Corpus World Wide Web ! British National Corpus ! EuroWordNet XML ! Comlex ! FrameNet ! American National Corpus Semantic Web ! Global WordNet ! More FrameNets ! SUMO ! VerbNet ! PropBank, NomBank ! MASC present NLP software 1994 ! MULTEXT > LT tools, LT XML 1995 ! GATE (Sheffield) 1996 1998 ! Alembic Workbench ! ATLAS (NIST) 2003 ! What happened to this? 200? ! Callisto ! UIMA Now: GATE
    [Show full text]
  • Informatics 1: Data & Analysis
    Informatics 1: Data & Analysis Lecture 12: Corpora Ian Stark School of Informatics The University of Edinburgh Friday 27 February 2015 Semester 2 Week 6 http://www.inf.ed.ac.uk/teaching/courses/inf1/da Student Survey Final Day ! ESES: The Edinburgh Student Experience Survey http://www.ed.ac.uk/students/surveys Please log on to MyEd before 1 March to complete the survey. Help guide what we do at the University of Edinburgh, improving your future experience here and that of the students to follow. Ian Stark Inf1-DA / Lecture 12 2015-02-27 Lecture Plan XML We start with technologies for modelling and querying semistructured data. Semistructured Data: Trees and XML Schemas for structuring XML Navigating and querying XML with XPath Corpora One particular kind of semistructured data is large bodies of written or spoken text: each one a corpus, plural corpora. Corpora: What they are and how to build them Applications: corpus analysis and data extraction Ian Stark Inf1-DA / Lecture 12 2015-02-27 Homework ! Tutorial Exercises Tutorial 5 exercises went online earlier this week. In these you use the xmllint command-line tool to check XML validity and run your own XPath queries. Reading T. McEnery and A. Wilson. Corpus Linguistics. Second edition, Edinburgh University Press, 2001. Chapter 2: What is a corpus and what is in it? (§2.2.2 optional) Photocopied handout, also available from the ITO. Ian Stark Inf1-DA / Lecture 12 2015-02-27 Remote Access to DICE ! Much coursework can be done on your own machines, but sometimes it’s important to be able to connect to and use DICE systems.
    [Show full text]
  • The Expanding Horizons of Corpus Analysis
    The expanding horizons of corpus analysis Brian MacWhinney Carnegie Mellon University Abstract By including a focus on multimedia interactions linked to transcripts, corpus linguistics can vastly expand its horizons. This expansion will rely on two continuing developments. First, we need to develop easily used methods for each of the ten analytic methods we have examined, including lexical analyses, QDA (qualitative data analysis), automatic tagging, language profiles, group comparisons, change scores, error analysis, feedback studies, conversation analysis, and modeling. Second, we need to work together to construct a unified database for language studies and related sciences. This database must be grounded on the principles of open access, data-sharing, interoperability, and integrated structure. It must provide powerful tools for searching, multilinguality, and multimedia analysis. If we can build this infrastructure, we will be able to explore more deeply the key questions underlying the structure and functioning of language, as it emerges from the intermeshing of processes operative on eight major timeframes. 1. Introduction Corpus linguistics has benefitted greatly from continuing advances in computer and Internet hardware and software. These advances have made it possible to develop facilities such as BNCweb (bncweb.lancs.ac.uk), LDC (Linguistic Data Consortium) online, the American National Corpus (americannationalcorpus. org), TalkBank (talkbank.org), and CHILDES (childes.psy.cmu.edu). In earlier periods, these corpora were limited to written and transcribed materials. However, most newer corpora now include transcripts linked to either audio or video recordings. The development of this newer corpus methodology is facilitated by technology which makes it easy to produce high-quality video recordings of face-to-face interactions.
    [Show full text]
  • Unit 3: Available Corpora and Software
    Corpus building and investigation for the Humanities: An on-line information pack about corpus investigation techniques for the Humanities Unit 3: Available corpora and software Irina Dahlmann, University of Nottingham 3.1 Commonly-used reference corpora and how to find them This section provides an overview of commonly-used and readily available corpora. It is also intended as a summary only and is far from exhaustive, but should prove useful as a starting point to see what kinds of corpora are available. The Corpora here are divided into the following categories: • Corpora of General English • Monitor corpora • Corpora of Spoken English • Corpora of Academic English • Corpora of Professional English • Corpora of Learner English (First and Second Language Acquisition) • Historical (Diachronic) Corpora of English • Corpora in other languages • Parallel Corpora/Multilingual Corpora Each entry contains the name of the corpus and a hyperlink where further information is available. All the information was accurate at the time of writing but the information is subject to change and further web searches may be required. Corpora of General English The American National Corpus http://www.americannationalcorpus.org/ Size: The first release contains 11.5 million words. The final release will contain 100 million words. Content: Written and Spoken American English. Access/Cost: The second release is available from the Linguistic Data Consortium (http://projects.ldc.upenn.edu/ANC/) for $75. The British National Corpus http://www.natcorp.ox.ac.uk/ Size: 100 million words. Content: Written (90%) and Spoken (10%) British English. Access/Cost: The BNC World Edition is available as both a CD-ROM or via online subscription.
    [Show full text]
  • English Corpus Linguistics: an Introduction - Charles F
    Cambridge University Press 0521808790 - English Corpus Linguistics: An Introduction - Charles F. Meyer Index More information Index Aarts, Bas, 4, 102 Biber, Douglas, et al. (1999) 14 adequacy, 2–3, 10–11 Birmingham Corpus, 15, 142 age, 49–50 Blachman, Edward, 76–7 Altenberg, Bengt, 26–7 BNC see British National Corpus AMALGAM Tagging Project, 86–7, 89 Brill, Eric, 86 American National Corpus, 24, 84, 142 Brill Tagger, 86–8 American Publishing House for the Blind British National Corpus (BNC), 143 Corpus, 17, 142 annotation, 84 analyzing a corpus, 100 composition, 18, 31t, 34, 36, 38, 40–1, 49 determining suitability, 103–7, 107t copyright, 139–40 exploring a corpus, 123–4 planning, 30–2, 33, 43, 51, 138 extracting information: defining parameters, record keeping, 66 107–9; coding and recording, 109–14, research using, 15, 36 112t; locating relevant constructions, speech samples, 59 114–19, 116f, 118f tagging, 87 framing research question, 101–3 time-frame, 45 future prospects, 140–1 British National Corpus (BNC) Sampler, see also pseudo-titles (corpus analysis 139–40, 143 case study); statistical analysis Brown Corpus, xii, 1, 143 anaphors, 97 genre variation, 18 annotation, 98–9 length, 32 future prospects, 140 research using, 6, 9–10, 12, 42, 98, 103 grammatical markup, 81 sampling methodology, 44 parsing, 91–6, 98, 140 tagging, 87, 90 part-of-speech markup, 81 time-frame, 45 structural markup, 68–9, 81–6 see also FROWN (Freiburg–Brown) tagging, 86–91, 97–8, 111, 117–18, 140 Corpus types, 81 Burges, Jen, 52 appositions, 42, 98 Burnard,
    [Show full text]
  • A Study of Issues and Techniques for Creating Core Vocabulary Lists for English As an International Language
    A STUDY OF ISSUES AND TECHNIQUES FOR CREATING CORE VOCABULARY LISTS FOR ENGLISH AS AN INTERNATIONAL LANGUAGE BY C. JOSEPH SORELL A thesis submitted to Victoria University of Wellington in fulfilment of the requirements for the degree of Doctor of Philosophy Victoria University of Wellington 2013 ABSTRACT Core vocabulary lists have long been a tool used by language learners and instructors seeking to facilitate the initial stages of foreign language learning (Fries & Traver, 1960: 2). In the past, these lists were typically based on the intuitions of experienced educators. Even before the advent of computer technology in the mid-twentieth century, attempts were made to create such lists using objective methodologies. These efforts regularly fell short, however, and – in the end – had to be tweaked subjectively. Now, in the 21st century, this is unfortunately still true, at least for those lists whose methodologies have been published. Given the present availability of sizable English- language corpora from around the world and affordable personal computers, this thesis seeks to fill this methodological gap by answering the research question: How can valid core vocabulary lists for English as an International Language be created? A practical taxonomy is proposed based on Biber’s (1988, 1995) multi-dimensional analysis of English texts. This taxonomy is based on correlated linguistic features and reasonably covers representative spoken and written texts in English. The four-part main study assesses the variance in vocabulary data within each of the four key text types: interactive (face-to-face conversation), academic exposition, imaginative narrative, and general reported exposition. The variation in word types found at progressive intervals in corpora of various sizes is measured using the Dice coefficient, a coefficient originally used to measure species variation in different biotic regions (Dice, 1945).
    [Show full text]
  • Towards a Global, Multilingual Framenet
    LREC 2020 WORKSHOP Language Resources and Evaluation Conference 11–16 May 2020 International FrameNet Workshop 2020 Towards a Global, Multilingual FrameNet PROCEEDINGS Edited by: Tiago T. Torrent, Collin F. Baker, Oliver Czulo, Kyoko Ohara and Miriam R. L. Petruck Proceedings of the LREC International FrameNet Workshop 2020: Towards a Global, Multilingual FrameNet Edited by: Tiago T. Torrent, Collin F. Baker, Oliver Czulo, Kyoko Ohara and Miriam R. L. Petruck ISBN: 979-10-95546-58-0 EAN: 9791095546580 For more information: European Language Resources Association (ELRA) 9 rue des Cordelières 75013, Paris France http://www.elra.info Email: [email protected] c European Language Resources Association (ELRA) These Workshop Proceedings are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License ii Introduction This workshop continues the series of International FrameNet Workshops, based on frame semantics and FrameNets around the world, including meetings in Berkeley, California in 2013 (sponsored by the Swedish FrameNet group), in Juiz de Fora, Brazil in 2016 (sponsored by FrameNet Brasil), and in Miyazaki, Japan in 2018 (in conjunction with the LREC conference there). The last of these was specifically designed to bring together two sometimes separate groups of researchers, those concentrating on frame semantics and those concentrating on construction grammar, which Charles J. Fillmore and many of the same colleagues developed in parallel over a period of several decades. The call for papers of the current conference emphasized that the workshop would welcome studies of both theoretical and practical issues, and we are fortunate to have strong contributions of both types, sometimes within a single paper.
    [Show full text]
  • The Architecture of a Multipurpose Australian National Corpus
    The Architecture of a Multipurpose Australian National Corpus Pam Peters Macquarie University 1. Introduction Collaborative planning by Australian researchers for a large national corpus presents us with quite diverse views on: (a) what a corpus is, (b) what research agendas it should support, (c) which varieties of discourse it should contain, (d) how many languages it could include, (e) how the material might be collected, and (f) what kinds of annotation are needed to add value to the texts. The challenge for us all is to develop a comprehensive plan and architecture for a corpus which will encompass as many research agendas as possible. A modular design which allows independent compilation of segments of the notional whole recommends itself, so long as common systems of metadata and annotation can be set down at the start. 2. What is a Corpus? Divergent understandings of what a corpus is reflect some aspects of the word’s history, and the different academic disciplines which have made use of them. In literary studies since the earlier C18, the Latin word corpus has been used to refer to the body of work by a single author, for example, Shakespeare, or set of them, such as the romantic poets. It implies the total output of those authors, not a sampling. In linguistic science since the 1960s, corpus has referred to a structured collection of texts sampled from various types of discourse, including written and – since the 1980s – spoken as well (e.g., Collins Cobuild). It is thus deliberately heterogenous, and designed to be in some sense ‘representative’ of a standard language or variety of it.
    [Show full text]
  • Representation Learning for Information Extraction
    Representation Learning for Information Extraction by Ehsan Amjadian A thesis submitted to the Faculty of Graduate and Postdoctoral Affairs in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Cognitive Science Carleton University Ottawa, Ontario ©2019 Ehsan Amjadian Abstract Distributed representations, predominantly acquired via neural networks, have been applied to natural language processing tasks including speech recognition and machine translation with a success comparable to sophisticated state-of-the-art algo- rithms. The present thesis offers an investigation of the application of such represen- tations to information extraction. Specifically, I explore the suitability of applying shallow distributed representations to the automatic terminology extraction task, as well as the bridging reference resolution task. I created a dataset as a gold standard for automatic term extraction in the mathematical education domain. I carefully as- sessed the performance of the existing terminology extraction methods on this dataset. Then, I introduce a novel method for automatic terminology extraction for one word terms, and I evaluate the performance of the novel algorithm in various terminological domains. The introduced algorithm leverages the distributed representation of words from the local and global perspectives to encode syntactic, semantic, association, and frequency information at the same time. Furthermore, this novel algorithm can be trained with a minimal number of data points. I show that the algorithm is robust to the change of domain, and that information can be transferred from one technical domain to another, leveraging what we call anchor words with consistent semantics shared between the domains. As for the bridging reference resolution task, a dataset is built on the letter portion of the Open American National Corpus and I compare the performance of a preliminary method against a majority class baseline.
    [Show full text]
  • ΓE 77 Computational Linguistics Essays/ Project Topics
    ΓE 77 Computational Linguistics Essays/ Project Topics 1. General Information • Instead of EXAMS • 2.000 to 3.000 words +15% (depends on the topic) • Essay (1-student) OR Project (group of 2-3 students) • Starting date: 2 Apr 2018 • Topic notification date: until 18 Apr 2018 • (Short) Presentation date: Last lecture • Deadline: 20 Jun 2018 2. Summary • (Research | Survey | Literature Review) Papers • Existing NLP Tools (testing | evaluation) • Annotation Project • Corpus data Project • Phonological, Morphological, Syntactic, Semantic Project • Language Learning Platform evaluation Languages: English or Greek or another language 3. Essay Topics • You select an area and I will recommend some articles/books to get you started. One book chapter or big article is going to be the core part of an essay; several articles will be the additional and supplementary bibliography. • Research Areas: Machine Translation, Machine Learning, Summarization, Question Answering, Sentiment Analysis, Information Extraction, Reference Resolution, Speech Analysis/ Synthesis, Thesauri and Ontologies, Building Corpora, Corpus annotation, Language Learning Platforms, etc. You should use examples from English plus another language. Also you can have joint research areas. You can check the book of Speech and Language Processing for additional research areas. 4. Project Topics 4.1. Annotation Project (APHASIA) OR any other spoken data (even video) • Using a tool for multi-level annotations (ELAN), you will annotate aphasic speech data based on an annotation template. • Task 1: Multi-level annotation of a file (2-4 minutes). • Task 2: Point out the linguistic errors and speech events. • Task 3: Present the aphasic speech and its characteristics/parameters based on annotations. • Task 4: Short evaluation of the tool and the annotation template.
    [Show full text]
  • New Kazakh Parallel Text Corpora with On-Line Access
    New Kazakh parallel text corpora with on-line access Zhandos Zhumanov1, Aigerim Madiyeva2 and Diana Rakhimova3 al-Farabi Kazakh National University, Laboratory of Intelligent Information Systems, Almaty, Kazakhstan [email protected], [email protected], [email protected] Abstract. This paper presents a new parallel resource – text corpora – for Ka- zakh language with on-line access. We describe 3 different approaches to col- lecting parallel text and how much data we managed to collect using them, par- allel Kazakh-English text corpora collected from various sources and aligned on sentence level, and web accessible corpus management system that was set up using open source tools – corpus manager Mantee and web GUI KonText. As a result of our work we present working web-accessible corpus management sys- tem to work with collected corpora. Keywords: parallel text corpora, Kazakh language, corpus management system 1 Introduction Linguistic text corpora are large collections of text used for different language studies. They are used in linguistics and other fields that deal with language studies as an ob- ject of study or as a resource. Text corpora are needed for almost any language study since they are basically a representation of the language itself. In computer linguistics text corpora are used for various parsing, machine translation, speech recognition, etc. As shown in [1] text corpora can be classified by many categories: Size: small (to 1 million words); medium (from 1 to 10 million words); large (more than 10 million words). Number of Text Languages: monolingual, bilingual, multilingual. Language of Texts: English, German, Ukrainian, etc. Mode: spoken; written; mixed.
    [Show full text]