Computational Approaches to the Comparison of Regional Variety Corpora – Prototyping a Semi-Automatic System for German

Total Page:16

File Type:pdf, Size:1020Kb

Computational Approaches to the Comparison of Regional Variety Corpora – Prototyping a Semi-Automatic System for German Computational Approaches to the Comparison of Regional Variety Corpora – Prototyping a Semi-automatic System for German Von der Philosophisch-Historischen Fakult¨at der Universit¨atStuttgart zur Erlangung der W¨urdeeines Doktors der Philosophie (Dr. phil.) genehmigte Abhandlung Vorgelegt von Stefanie Anstein aus Rottweil Hauptberichter: Prof. Dr. phil. habil. Ulrich Heid 1. Mitberichter: Prof. Dr. phil. habil. Achim Stein 2. Mitberichter: Univ.-Prof. Mag. Dr. Gerhard Budin Tag der m¨undlichen Pr¨ufung:31. Januar 2013 Institut f¨urMaschinelle Sprachverarbeitung Universit¨atStuttgart 2013 Erkl¨arung Hiermit versichere ich, dass ich – unter Verwendung der aufgef¨uhrtenQuellen und unter fachlicher Betreuung – diese Dissertation selbst¨andigverfasst habe. (Stefanie Anstein) Danksagung Ich bedanke mich ganz herzlich bei allen, die mich in den letzten Jahren aus verschiedensten und in verschiedenste Richtungen begleitet und mich auf unterschiedlichste Arten unterst¨utzthaben. Dabei gilt ein besonderer Dank meinem Hauptbetreuer Ulrich Heid, der mich mit seinem unersch¨opflichen Wissens- und Erfahrungssschatz ausgezeichnet geleitet hat. Seine ¨uberaus wertvollen R¨uckmeldungen und Ratschl¨age,f¨urdie er sich immer viel Zeit nahm, weiß ich sehr zu sch¨atzen. Den Mitberichtern Achim Stein und Gerhard Budin danke ich herzlich f¨ur ihre Bereitschaft zur Begutachtung und zur Pr¨ufungsowie f¨urihre hilfreichen Kommentare – bei Rainer B¨auerlebedanke ich mich f¨urseinen kurzfristigen Einsatz. Diese Arbeit entstand w¨ahrendmeiner Zeit am Institut f¨urFachkommunika- tion und Mehrsprachigkeit der EURAC in Bozen, dessen Koordinatorin Andrea Abel ich ebenfalls sehr dankbar bin – sowohl f¨urihre inhaltlichen Anregungen als auch f¨urihre organisatorische Flexibilit¨at. Ich bedanke mich bei allen weiteren ProfessorInnen, DozentInnen, KollegIn- nen und FreundInnen am IMS, an der EURAC und von außerhalb, die mir viel beigebracht, geholfen und mit auf den Weg gegeben haben, besonders bei Heidi Abfalterer, der C4 -Gruppe, Chris Culy, Henrik Dittmann, Grzegorz Dogil, Hans Drumbl, Stefan Evert, Peter Farbridge, Arne Fitschen, Hannah Kermes, Adam Kilgarriff, Jonas Kuhn, Anke L¨udeling,Margit Oberhammer, Sebastian Pad´o,Uwe Reyle, Helmut Schmid, Sabine Schulte im Walde, Marcello Soffritti, Egon Stemle, Barbara Taferner, Renata Zanin und Heike Zinsmeister. Danke f¨urden wohltuenden Rahmen und den immer wieder erfrischenden Ausgleich an Anke & Micha, Anne & Nat, Familie Bayer, Britta, Fabienne, Herrn Fischl, Frank & Anna, Franzi, Goenkaji, Gotte & Katharina, Harald, Katrin, Lionel, Magdalena & Michi, Monika, Nadi & Diana, Omar & Smail, Regi & Sims, Sandra, Simone und Stef. Und einfach meinen herzliebsten Dank f¨uralles – an meine Eltern und an Gerhard, Kati und Verena. iv Publikationen Aspekte der hier beschriebenen Forschung finden sich auch in folgenden begutachteten Publikationen: Abel, Andrea & Anstein, Stefanie (2008): ‘Approaches to Computational Lexicography for German Varieties’. In: Proceedings of the 13th EURALEX International Conference; pp. 251–260; Barcelona. Abel, Andrea & Anstein, Stefanie (2011): ‘Korpus S¨udtirol- Variet¨aten- linguistische Untersuchungen’. In: Korpusinstrumente in Lehre und Forschung, ed. by Abel, Andrea & Zanin, Renata; pp. 29–53; Bolzano: Bolzano University Press. Abel, Andrea; Anstein, Stefanie & Ties, Isabella (2008): ‘Ans¨atze einer intralingualen kontrastiven Korpuslinguistik – aufgezeigt am Beispiel administrativer Rechtstexte aus Deutschland, Osterreich¨ und S¨udtirol’. In: Formulierungsmuster in deutscher und italienischer Fachkommunikation. Intra- und interlinguale Perspektiven, ed. by Heller, Dorothee; Linguistic Insights; pp. 243–270; Bern: Peter Lang. Anstein, Stefanie (2007): ‘Korpuslinguistische Fallstudien zum S¨udtiroler Standardschriftdeutsch - das Projekt ’Korpus S¨udtirol”. Linguistik online; vol. 32. http://www.linguistik-online.org/32 07/anstein.pdf, last accessed 2012-10-14. Anstein, Stefanie (2009a): ‘Vis-A-Vis` – a System for the Comparison of Linguistic Varieties on the Basis of Corpora’. In: Proceedings of the 2nd Col- loquium on Lesser Used Languages and Computer Linguistics (LULCL); pp. Publikationen v 59–64. http://www.eurac.edu/Org/LanguageLaw/Multilingualism/Projects/ LULCL II proceedings.htm, last accessed 2012-10-16. Anstein, Stefanie (2009b): ‘Vis-A-Vis` - a System to Compare Variety Corpora’. In: Proceedings of the 5th Corpus Linguistics Conference; Liverpool. http://ucrel.lancs.ac.uk/publications/cl2009, last accessed 2012-10-16. Anstein, Stefanie (2012): ‘Comparing Variety Corpora with Vis-A-Vis` — a Prototype System Presentation’. In: Proceedings of the 11th Conference on Natural Language Processing (KONVENS); pp. 243–247; Vienna. http: //www.oegai.at/konvens2012/proceedings/35 anstein12p, last accessed 2012- 10-16. Anstein, Stefanie & Glaznieks, Aivars (2011): ‘Comparing Geographical and Learner Varieties on the Basis of Corpora’. In: Comparative Methods and Analysis in the Language Science. Proceedings of the 3rd edition of J´eTou; pp. 179–188; Toulouse. http://jetou2011.free.fr/ARTICLES/S4A2.pdf, last accessed 2012-10-17. vi Contents List of abbreviations ix List of figures xi List of tables xii German abstract xiii English abstract xviii 1 Introduction and background1 1.1 Objectives and research questions.................1 1.1.1 Aims and scope of this work...............2 1.1.2 Research questions.....................8 1.1.3 Structure of this thesis...................9 1.2 Background in the relevant research areas............ 10 1.2.1 Linguistics and language variation............ 10 1.2.1.1 Language phenomena and their investigation. 10 1.2.1.2 Linguistic variation............... 14 1.2.2 Language in South Tyrol................. 19 1.2.2.1 History and current situation.......... 20 1.2.2.2 Standards and norms............... 22 1.2.3 Computational approaches to corpus studies....... 25 1.2.3.1 Inter-disciplinarity................ 25 1.2.3.2 Corpora and their linguistic annotation.... 26 1.2.3.3 Data extraction from corpora.......... 34 1.2.3.4 Comparative corpus linguistics......... 40 vii 1.2.3.5 Statistics for comparing corpora........ 44 1.2.3.6 Evaluation of corpus processing tools..... 50 2 Related work and research desiderata 53 2.1 Resources and methods for corpus comparison.......... 53 2.1.1 Variety corpora and dictionaries............. 54 2.1.2 Comparative corpus studies................ 59 2.1.2.1 Studies on corpus comparability......... 60 2.1.2.2 General variety studies............. 65 2.1.3 Computational systems for corpus studies........ 74 2.1.3.1 Corpus annotation tools............. 75 2.1.3.2 Corpus analysis and comparison tools..... 77 2.2 Investigations on South Tyrolean German............ 82 2.2.1 South Tyrolean German variety linguistics........ 82 2.2.2 Linguistic characteristics of South Tyrolean German.. 91 2.3 Research desiderata derived from the state of the art...... 105 3 The system Vis-A-Vis` 107 3.1 Requirements for a corpus comparison system.......... 107 3.2 Methodology and system architecture............... 108 3.2.1 Approaches and methods................. 109 3.2.2 Workflow and modules................... 114 3.3 System functionalities and usage modes.............. 116 3.3.1 Technical and functional specification.......... 116 3.3.1.1 Technical system features............ 117 3.3.1.2 Input verification................. 119 3.3.1.3 Comparability check............... 119 3.3.1.4 Annotation.................... 119 3.3.1.5 Analysis levels.................. 120 3.3.1.6 Linguistic filtering................ 122 3.3.1.7 Statistical comparison.............. 124 3.3.1.8 Output presentation............... 126 3.3.2 Coverage and limitations of the system.......... 126 viii 3.3.3 System usage scenarios.................. 127 3.4 System output........................... 135 3.4.1 General corpus comparison output............ 135 3.4.2 Output by analysis levels.................. 135 4 Quantitative and qualitative system evaluation 146 4.1 Quantitative system performance................. 146 4.1.1 Evaluation procedures................... 146 4.1.2 Evaluation data and gold standard............ 147 4.1.3 Quantitative evaluation results.............. 148 4.2 Qualitative case studies...................... 154 4.2.1 Newspaper corpora..................... 154 4.2.2 Web corpora......................... 158 4.2.3 Learner corpora....................... 158 4.3 Discussion of evaluation results.................. 162 5 Outlook and conclusion 164 5.1 Potential further research..................... 164 5.1.1 General resource and system enhancements....... 164 5.1.2 Refinement of analysis levels............... 169 5.2 Summary.............................. 178 5.2.1 Principal findings of this work............... 178 5.2.2 Contributions to the relevant research areas....... 180 A System documentation 182 B Gold standard list of S¨udtirolisms 198 B.1 Primary S¨udtirolisms........................ 198 B.2 Extract of secondary S¨udtirolisms................. 203 C Online resources 204 Bibliography 205 ix List of abbreviations ADJ adjective ADV adverb AM association measure AT Austria BNC British National Corpus1 CH Switzerland CQP Corpus Query Processor CWB Corpus Workbench DE Germany DOLO Dolomiten newspaper corpus DWDS Digitales W¨orterbuchder Deutschen Sprache; Digital dictionary of the German language f frequency FR Frankfurter Rundschau newspaper corpus GUI graphical user interface ICE International Corpus of English IT Italy KWIC keyword in context LNRE large number of rare events LL log-likelihood
Recommended publications
  • The Bulgarian National Corpus: Theory and Practice in Corpus Design
    The Bulgarian National Corpus: Theory and Practice in Corpus Design Svetla Koeva, Ivelina Stoyanova, Svetlozara Leseva, Tsvetana Dimitrova, Rositsa Dekova, and Ekaterina Tarpomanova Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences, Sofia, Bulgaria abstract The paper discusses several key concepts related to the development of Keywords: corpora and reconsiders them in light of recent developments in NLP. corpus design, On the basis of an overview of present-day corpora, we conclude that Bulgarian National Corpus, the dominant practices of corpus design do not utilise the technologies computational adequately and, as a result, fail to meet the demands of corpus linguis- linguistics tics, computational lexicology and computational linguistics alike. We proceed to lay out a data-driven approach to corpus design, which integrates the best practices of traditional corpus linguistics with the potential of the latest technologies allowing fast collection, automatic metadata description and annotation of large amounts of data. Thus, the gist of the approach we propose is that corpus design should be centred on amassing large amounts of mono- and multilin- gual texts and on providing them with a detailed metadata description and high-quality multi-level annotation. We go on to illustrate this concept with a description of the com- pilation, structuring, documentation, and annotation of the Bulgar- ian National Corpus (BulNC). At present it consists of a Bulgarian part of 979.6 million words, constituting the corpus kernel, and 33 Bulgarian-X language corpora, totalling 972.3 million words, 1.95 bil- lion words (1.95×109) altogether. The BulNC is supplied with a com- prehensive metadata description, which allows us to organise the texts according to different principles.
    [Show full text]
  • Psyneuroling
    ABSTRACT BOOK International Online Conference of Psycholinguistic and Neurolinguistic Research: Methods, Materials, and Approaches PSYNEUROLING 17-19 December 2020 17 DECEMBER 2020 Session 1 Language and speech in speakers with neurocognitive diseases (1) Niharika Mahajan, Sonal Chitnis, Hemina Dawar, Sujit Jagtap, Poornima Karandikar & Sankar Prasad Gorthi Effect of education on cognitive and communicative abilities in mild dementia: Preliminary study from India (2) Niharika Mahajan, Abhishek Chaudhari, Sonal Chitnis & Sujit Jagtap Cognitive and communicative reserve in bilingual biliterate lady with Alzheimer’s dementia: A case study (3) Ana Varela Suárez Communicative guidelines for caregivers of people with dementia: A decalogue (4) Olga Ivanova Humor and dementia: what can the type of neurodegeneration tell us about pragmatic competence in human language? Session 2 Brain correlates of language processing (5) Jordi Martorell, Simona Mancini, Nicola Molinaro & Manuel Carreiras Oscillatory tracking of syntactic structure and cross-linguistic variation (6) Effrosyni Ntemou, Ann-Katrin Ohlerth, Sebastian Ille, Sandro M. Krie, Roelien Bastiaanse & Adrià Rofes The effect of transitivity on cortical regions involved in action naming: evidence from navigated Transcranial Magnetic Stimulation (7) Monica Vanoncini, Olga Dragoy, Victoria Pozdnyakova & Adrià Rofes The Contribution of the Motor and Auditory Cortex in Priming Action and Sound Verbs: a Pilot Study (8) Roha M. Kaipa Hemispheric Differences in Conflict Sentence Processing in Multilinguals Session 3 Emotions & emotivity in language processing, learning and production (9) Suma Raju & N.P. Nataraja Assessment of verbal fluency skills in Kannada-English speaking bilingual and Kannada speaking monolingual children in the age range of eight to ten years (10) Spandan Chowdhury Emotion Categorization based on Phrase Order Preferences in Bengali (11) Ratul Ghosh An acoustic and neural study of emotions expressed in Bengali speech: A Pilot Study (12) Lucía Sabater, Marta Ponari, Juan Haro, Eva Moreno, Miguel A.
    [Show full text]
  • Ad Hoc and General-Purpose Corpus Construction from Web Sources Adrien Barbaresi
    Ad hoc and general-purpose corpus construction from web sources Adrien Barbaresi To cite this version: Adrien Barbaresi. Ad hoc and general-purpose corpus construction from web sources. Linguistics. ENS Lyon, 2015. English. tel-01167309 HAL Id: tel-01167309 https://tel.archives-ouvertes.fr/tel-01167309 Submitted on 24 Jun 2015 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Distributed under a Creative Commons Attribution| 4.0 International License THÈSE en vue de l'obtention du grade de Docteur de l’Université de Lyon, délivré par l’École Normale Supérieure de Lyon Discipline : Linguistique Laboratoire ICAR École Doctorale LETTRES, LANGUES, LINGUISTIQUE, ARTS présentée et soutenue publiquement le 19 juin 2015 par Monsieur Adrien Barbaresi _____________________________________________ Construction de corpus généraux et spécialisés à partir du web ______________________________________________ Directeur de thèse : M. Benoît HABERT Devant la commission d'examen formée de : M. Benoît HABERT, ENS Lyon, Directeur M. Thomas LEBARBÉ, Université Grenoble
    [Show full text]
  • Henning Lobin, Roman Schneider Und Andreas Witt (Hrsg.) Digitale Infrastrukturen Für Die Germanistische Forschung Germanistische Sprachwissenschaft Um 2020
    Henning Lobin, Roman Schneider und Andreas Witt (Hrsg.) Digitale Infrastrukturen für die germanistische Forschung Germanistische Sprachwissenschaft um 2020 Herausgegeben von Albrecht Plewnia und Andreas Witt Band 6 Digitale Infrastrukturen für die germanistische Forschung Herausgegeben von Henning Lobin, Roman Schneider und Andreas Witt Die Open-Access-Publikation dieses Bandes wurde gefördert vom Institut für Deutsche Sprache, Mannheim. ISBN 978-3-11-053675-1 e-ISBN (PDF) 978-3-11-053866-3 e-ISBN (EPUB) 978-3-11-053681-2 Dieses Werk ist lizenziert unter der Creative Commons Attribution 4.0 Lizenz. Weitere Informationen finden Sie unter http://creativecommons.org/licenses/by/4.0/. Bibliografische Information der Deutschen Nationalbibliothek Die Deutsche Nationalbibliothek verzeichnet diese Publikation in der Deutschen Nationalbibliografie; detaillierte bibliografische Daten sind im Internet über http://dnb.dnb.de abrufbar. © 2018 Henning Lobin, Roman Schneider und Andreas Witt, publiziert von Walter de Gruyter GmbH, Berlin/Boston Foto Einbandabbildung: © Oliver Schonefeld, Institut für Deutsche Sprache, Mannheim Portrait Ludwig M. Eichinger, Seite V: © David Ausserhofer, Leibniz-Gemeinschaft Satz: Meta Systems Publishing & Printservices GmbH, Wustermark Druck und Bindung: CPI books GmbH, Leck www.degruyter.com Ludwig M. Eichinger gewidmet Vorwort Wo steht die germanistische Sprachwissenschaft aktuell? Der vorliegende Band mit dem Titel „Digitale Infrastrukturen für die germanistische Forschung“ ist der sechste Teil einer auf sechs Bände angelegten Reihe, die eine zwar nicht exhaustive, aber doch umfassende Bestandsaufnahme derjenigen Themen- felder innerhalb der germanistischen Linguistik bieten will, die im Kontext der Arbeiten des Instituts für Deutsche Sprache in den letzten Jahren für das Fach von Bedeutung waren und in den kommenden Jahren von Bedeutung sein werden (und von denen nicht wenige auch vom Institut für Deutsche Sprache bedient wurden und werden).
    [Show full text]
  • Book of Proceedings
    LREC 2018 Workshop Challenges in the Management of Large Corpora (CMLC-6) PROCEEDINGS Edited by Piotr Banski,´ Marc Kupietz, Adrien Barbaresi, Hanno Biber, Evelyn Breiteneder, Simon Clematide, Andreas Witt ISBN: 979-10-95546-14-6 EAN: 9791095546146 7 May 2018 Proceedings of the LREC 2018 Workshop “Challenges in the Management of Large Corpora (CMLC-6)” 07 May 2018 – Miyazaki, Japan Edited by Piotr Banski,´ Marc Kupietz, Adrien Barbaresi, Hanno Biber, Evelyn Breiteneder, Simon Clematide, Andreas Witt http://corpora.ids-mannheim.de/cmlc-2018.html Organising Committee Piotr Banski´ (IDS Mannheim) • Marc Kupietz (IDS Mannheim) • Adrien Barbaresi (Academy Corpora, Austrian Academy of Sciences) • Stefan Evert (Friedrich-Alexander-Universität Nürnberg/Erlangen) • Hanno Biber (Academy Corpora, Austrian Academy of Sciences) • Evelyn Breiteneder (Academy Corpora, Austrian Academy of Sciences) • Simon Clematide (Institute of Computational Linguistics, UZH) • Andreas Witt (University of Cologne, IDS Mannheim, University of Heidelberg) • ii Programme Committee Vladimír Benko (Slovak Academy of Sciences) • Felix Bildhauer (IDS Mannheim) • Hennie Brugman (Meertens Institute, Amsterdam) • Steve Cassidy (Macquarie University) • Dan Cristea ("Alexandru Ioan Cuza" University of Iasi) • Damir Cavar´ (Indiana University, Bloomington) • Tomaž Erjavec (Jožef Stefan Institute) • Stefan Evert (Friedrich-Alexander-Universität Nürnberg/Erlangen) • Alexander Geyken (Berlin-Brandenburgische Akademie der Wissenschaften) • Andrew Hardie (Lancaster University) • Serge
    [Show full text]
  • Proceedings of the Ninth Interna- Kamocki, P
    Challenges in the Management of Large Corpora (CMLC-2) Workshop Programme 14:00 – 14:10 – Introduction 14:10 – 14:30 Marc Kupietz, Harald Lüngen, Piotr Bański and Cyril Belica, Maximizing the Potential of Very Large Corpora: 50 Years of Big Language Data at IDS Mannheim 14:30 – 15:00 Adam Kilgarriff, Pavel Rychlý and Miloš Jakubíček, Effective Corpus Virtualization 15:00 – 15:30 Dirk Goldhahn, Steffen Remus, Uwe Quasthoff and Chris Biemann, Top-Level Domain Crawling for Producing Comprehensive Monolingual Corpora from the Web 15:30 – 16:00 Vincent Vandeghinste and Liesbeth Augustinus, Making a large treebank searchable online. The SONAR case. 16:00 – 16:30 Coffee break 16:30 – 17:00 John Vidler, Andrew Scott, Paul Rayson, John Mariani and Laurence Anthony, Dealing With Big Data Outside Of The Cloud: GPU Accelerated Sort 17:00 – 17:30 Jordi Porta, From several hundred million words to near one thousand million words: Scaling up a corpus indexer and a search engine with MapReduce 17:30 – 17:50 Hanno Biber and Evelyn Breiteneder, Text Corpora for Text Studies. About the foundations of the AAC-Austrian Academy Corpus 17:50 – 18:00 – Closing remarks i Editors Marc Kupietz Institut für Deutsche Sprache, Mannheim Hanno Biber Institute for Corpus Linguistics and Text Technology, Vienna Harald Lüngen Institut für Deutsche Sprache, Mannheim Piotr Bański Institut für Deutsche Sprache, Mannheim Evelyn Breiteneder Institute for Corpus Linguistics and Text Technology, Vienna Karlheinz Mörth Institute for Corpus Linguistics and Text Technology, Vienna
    [Show full text]
  • From Storyboard to Sustainability and LR Lifecycle Management
    Language Resources: From Storyboard to Sustainability and LR Lifecycle Management WORKSHOP PROGRAMME Sunday May 23, 2010 09:00 – 09:15 Welcome and Introduction Victoria Arranz and Laura van Eerten, ELDA-ELRA, France and TST-Centrale, The Netherlands 09:15 – 10:00 Invited talk: Sustainability for Open-Access Language Resources Mark Liberman, LDC, USA 10:00 – 10:30 The Flemish-Dutch HLT Agency: a Comprehensive Approach to Language Resources Lifecycle Management & Sustainability for the Dutch Language Remco van Veenendaal, Laura van Eerten and Catia Cucchiarini, TST-Centrale and Dutch Language Union, The Netherlands 10:30 – 11:00 Coffee break 11:00 – 11:30 Creating and Maintaining Language Resources: the Main Guidelines of the Victoria Project Lionel Nicolas, Miguel Angel Molinero Alvarez, Benoît Sagot, Nieves Fernández Formoso and Vanesa Vidal Castro, UNSA & CNRS, France, Universidade da Coruña, Spain, Université Paris 7, France and Universidade de Vigo, Spain 11:30 – 12:00 Laundry Symbols and License Management - Practical Considerations for the Distribution of LRs based on Experiences from CLARIN Ville Oksanen, Krister Lindén and Hanna Westerlund, Aalto University and University of Helsinki, Finland 12:00 – 13:30 Poster session: Resource Lifecycle Management: Changing Cultures Peter Wittenburg, Jacquelijn Ringersma, Paul Trilsbeek, Willem Elbers and Daan Broeder, MPI for Psycholinguistics, The Netherlands The Open-Content Text Corpus project Piotr Bański and Beata Wójtowicz, University of Warsaw, Poland Very Large Language Resources?
    [Show full text]
  • A CORPUS LINGUISTICS STUDY of TRANSLATION CORRESPONDENCES in ENGLISH and GERMAN by ALEKSANDAR TRKLJA
    A CORPUS LINGUISTICS STUDY OF TRANSLATION CORRESPONDENCES IN ENGLISH AND GERMAN by ALEKSANDAR TRKLJA A thesis submitted to The University of Birmingham for the degree of DOCTOR OF PHILOSOPHY School of English The University of Birmingham November 2013 University of Birmingham Research Archive e-theses repository This unpublished thesis/dissertation is copyright of the author and/or third parties. The intellectual property rights of the author or third parties in respect of this work are as defined by The Copyright Designs and Patents Act 1988 or as modified by any successor legislation. Any use made of information contained in this thesis/dissertation must be in accordance with that legislation and must be properly acknowledged. Further distribution or reproduction in any format is prohibited without the permission of the copyright holder. ABSTRACT This thesis aims at developing an analytical model for differentiation of translation correspondences and for grouping lexical items according to their semantic similarities. The model combines the language in use theory of meaning with the distributional corpus linguistics method. The identification of translation correspondences derives from the exploration of the occurrence of lexical items in the parallel corpus. The classification of translation correspondences into groups is based on the substitution principle, whereas the distinguishing features used to differentiate between lexical items emerge as a result of the study of local contexts in which these lexical items occur. The distinguishing features are analysed with the help of various statistical measurements. The results obtained indicate that the proposed model has advantages over the traditional approaches that rely on the referential theory of meaning.
    [Show full text]
  • Book of Abstracts
    xx Table of Contents Session O1 - Machine Translation & Evaluation . 1 Session O2 - Semantics & Lexicon (1) . 2 Session O3 - Corpus Annotation & Tagging . 3 Session O4 - Dialogue . 4 Session P1 - Anaphora, Coreference . 5 Session: Session P2 - Collaborative Resource Construction & Crowdsourcing . 7 Session P3 - Information Extraction, Information Retrieval, Text Analytics (1) . 9 Session P4 - Infrastructural Issues/Large Projects (1) . 11 Session P5 - Knowledge Discovery/Representation . 13 Session P6 - Opinion Mining / Sentiment Analysis (1) . 14 Session P7 - Social Media Processing (1) . 16 Session O5 - Language Resource Policies & Management . 17 Session O6 - Emotion & Sentiment (1) . 18 Session O7 - Knowledge Discovery & Evaluation (1) . 20 Session O8 - Corpus Creation, Use & Evaluation (1) . 21 Session P8 - Character Recognition and Annotation . 22 Session P9 - Conversational Systems/Dialogue/Chatbots/Human-Robot Interaction (1) . 23 Session P10 - Digital Humanities . 25 Session P11 - Lexicon (1) . 26 Session P12 - Machine Translation, SpeechToSpeech Translation (1) . 28 Session P13 - Semantics (1) . 30 Session P14 - Word Sense Disambiguation . 33 Session O9 - Bio-medical Corpora . 34 Session O10 - MultiWord Expressions . 35 Session O11 - Time & Space . 36 Session O12 - Computer Assisted Language Learning . 37 Session P15 - Annotation Methods and Tools . 38 Session P16 - Corpus Creation, Annotation, Use (1) . 41 Session P17 - Emotion Recognition/Generation . 43 Session P18 - Ethics and Legal Issues . 45 Session P19 - LR Infrastructures and Architectures . 46 xxi Session I-O1: Industry Track - Industrial systems . 48 Session O13 - Paraphrase & Semantics . 49 Session O14 - Emotion & Sentiment (2) . 50 Session O15 - Semantics & Lexicon (2) . 51 Session O16 - Bilingual Speech Corpora & Code-Switching . 52 Session P20 - Bibliometrics, Scientometrics, Infometrics . 54 Session P21 - Discourse Annotation, Representation and Processing (1) . 55 Session P22 - Evaluation Methodologies .
    [Show full text]
  • Best Practices for Speech Corpora in Linguistic Research
    Best Practices for Speech Corpora in Linguistic Research Workshop Programme 21 May 2012 14:00 – Case Studies: Corpora & Methods Janne Bondi Johannessen, Øystein Alexander Vangsnes, Joel Priestley and Kristin Hagen: A linguistics-based speech corpus Adriana Slavcheva and Cordula Meißner: GeWiss – a comparable corpus of academic German, English and Polish Elena Grishina, Svetlana Savchuk and Dmitry Sichinava: Multimodal Parallel Russian Corpus (MultiPARC): Multimodal Parallel Russian Corpus (MultiPARC): Main Tasks and General Structure Sukriye Ruhi and E. Eda Isik Tas: Constructing General and Dialectal Corpora for Language Variation Research: Two Case Studies from Turkish Theodossia-Soula Pavlidou: The Corpus of Spoken Greek: goals, challenges, perspectives Ines Rehbein, Sören Schalowski and Heike Wiese: Annotating spoken language Seongsook Choi and Keith Richards: Turn-taking in interdisciplinary scientific research meetings: Using ‘R’ for a simple and flexible tagging system 16:30 – 17:00 Coffee break i 17:00 – Panel: Best Practices Pavel Skrelin and Daniil Kocharov Russian Speech Corpora Framework for Linguistic Purposes Peter M. Fischer and Andreas Witt Developing Solutions for Long-Term Archiving of Spoken Language Data at the Institut für Deutsche Sprache Christoph Draxler Using a Global Corpus Data Model for Linguistic and Phonetic Research Brian MacWhinney, Leonid Spektor, Franklin Chen and Yvan Rose TalkBank XML as a Best Practices Framework Christopher Cieri and Malcah Yaeger-Dror Toward the Harmonization of Metadata Practice
    [Show full text]
  • Corpus Construction and Annotation
    Corpus Construction and Annotation Why are annotated corpora important for computational linguists? training and evaluation of NLP tools classification (POS, word sense) parsing (syntactic structure) extraction (named entity, semantic role, coreference, events) make it possible to search for particular linguistic phenomena 1 / 25 Annotation Process target phenomena corpus selection annotation efficiency and consistency (annotation infrastructure) annotation evaluation 2 / 25 Annotation Process: Target Phenomena What linguistic phenomena do you want to annotate? Do we really need manual annotation or can this be done automatically? What resources and prior annotation are needed? syntactic annotation often depends on POS annotation semantic annotation often depends on syntactic annotation 3 / 25 Annotation Process: Corpus Selection What data do you want to annotate? Written or spoken data? (Transcribed spoken data?) Single genre or mixed genres? Representative sampling of genres? How much data do you need to find enough examples of your phenomena? a 1-million-word corpus doesn’t always contain enough occurrences of particular words for semantic role labelling or sense tagging 4 / 25 Annotation Process: Efficiency and Consistency How difficult is the annotation task? What kind of annotation guidelines need to be written? OntoNotes verb sense annotation: 11 pages Penn Treebank syntactic annotation guidelines: 300 pages How much training do the annotators need? several weeks? degree in linguistics? How consistent are the annotators? are errors due
    [Show full text]
  • Conference Abstracts
    ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Held under the patronage of the Japanese Ministry of Education, Culture, Sports, Science and Technology (MEXT) MAY 7– 12, 2018 PHOENIX SEAGAIA CONFERENCE CENTRE Miyazaki, JAPAN CONFERENCE ABSTRACTS Editors: Nicoletta Calzolari (Conference chair), Khalid Choukri, Christopher Cieri, Thierry Declerck, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga Assistant Editors: Sara Goggi, Hélène Mazo The LREC 2018 Proceedings are licensed under a Creative Commons Attribution- NonCommercial 4.0 International License LREC 2018, ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Title: LREC 2018 Conference Abstracts Distributed by: ELRA – European Language Resources Association 9, rue des Cordelières 75013 Paris France Tel.: +33 1 43 13 33 33 Fax: +33 1 43 13 33 30 www.elra.info and www.elda.org Email: [email protected] and [email protected] ISBN 979-10-95546-00-9 EAN 9791095546009 ii Introduction to LREC 2018 by Nicoletta Calzolari Chair of the 11th edition of LREC ELRA Honorary President Welcome to the 11th edition of LREC in Miyazaki, first LREC in Asia! LREC 20th Anniversary It is the LREC 20th Anniversary and LREC has become one of the most successful conferences of the field. Data are pervasive in Natural Language Processing and Language Technology: we call our data Language Resources (LR). But when LREC was started by ELRA, in 1998 in Granada, from an idea of Antonio Zampolli and Joseph Mariani, it was really a new adventure and a challenge. There were well established big conferences but he thought that the new emerging field of Language Resources deserved its own dedicated forum.
    [Show full text]