Computational Approaches to the Comparison of Regional Variety Corpora – Prototyping a Semi-Automatic System for German

Computational Approaches to the Comparison of Regional Variety Corpora – Prototyping a Semi-automatic System for German Von der Philosophisch-Historischen Fakultät der UniversitätStuttgart zur Erlangung der Würdeeines Doktors der Philosophie (Dr. phil.) genehmigte Abhandlung Vorgelegt von Stefanie Anstein aus Rottweil Hauptberichter: Prof. Dr. phil. habil. Ulrich Heid 1. Mitberichter: Prof. Dr. phil. habil. Achim Stein 2. Mitberichter: Univ.-Prof. Mag. Dr. Gerhard Budin Tag der mündlichen Prüfung:31. Januar 2013 Institut fürMaschinelle Sprachverarbeitung UniversitätStuttgart 2013 Erklärung Hiermit versichere ich, dass ich – unter Verwendung der aufgeführtenQuellen und unter fachlicher Betreuung – diese Dissertation selbständigverfasst habe. (Stefanie Anstein) Danksagung Ich bedanke mich ganz herzlich bei allen, die mich in den letzten Jahren aus verschiedensten und in verschiedenste Richtungen begleitet und mich auf unterschiedlichste Arten unterstützthaben. Dabei gilt ein besonderer Dank meinem Hauptbetreuer Ulrich Heid, der mich mit seinem unerschöpflichen Wissens- und Erfahrungssschatz ausgezeichnet geleitet hat. Seine überaus wertvollen Rückmeldungen und Ratschläge,fürdie er sich immer viel Zeit nahm, weiß ich sehr zu schätzen. Den Mitberichtern Achim Stein und Gerhard Budin danke ich herzlich für ihre Bereitschaft zur Begutachtung und zur Prüfungsowie fürihre hilfreichen Kommentare – bei Rainer Bäuerlebedanke ich mich fürseinen kurzfristigen Einsatz. Diese Arbeit entstand währendmeiner Zeit am Institut fürFachkommunika- tion und Mehrsprachigkeit der EURAC in Bozen, dessen Koordinatorin Andrea Abel ich ebenfalls sehr dankbar bin – sowohl fürihre inhaltlichen Anregungen als auch fürihre organisatorische Flexibilität. Ich bedanke mich bei allen weiteren ProfessorInnen, DozentInnen, KollegIn- nen und FreundInnen am IMS, an der EURAC und von außerhalb, die mir viel beigebracht, geholfen und mit auf den Weg gegeben haben, besonders bei Heidi Abfalterer, der C4 -Gruppe, Chris Culy, Henrik Dittmann, Grzegorz Dogil, Hans Drumbl, Stefan Evert, Peter Farbridge, Arne Fitschen, Hannah Kermes, Adam Kilgarriff, Jonas Kuhn, Anke Lüdeling,Margit Oberhammer, Sebastian Padó,Uwe Reyle, Helmut Schmid, Sabine Schulte im Walde, Marcello Soffritti, Egon Stemle, Barbara Taferner, Renata Zanin und Heike Zinsmeister. Danke fürden wohltuenden Rahmen und den immer wieder erfrischenden Ausgleich an Anke & Micha, Anne & Nat, Familie Bayer, Britta, Fabienne, Herrn Fischl, Frank & Anna, Franzi, Goenkaji, Gotte & Katharina, Harald, Katrin, Lionel, Magdalena & Michi, Monika, Nadi & Diana, Omar & Smail, Regi & Sims, Sandra, Simone und Stef. Und einfach meinen herzliebsten Dank füralles – an meine Eltern und an Gerhard, Kati und Verena. iv Publikationen Aspekte der hier beschriebenen Forschung finden sich auch in folgenden begutachteten Publikationen: Abel, Andrea & Anstein, Stefanie (2008): ‘Approaches to Computational Lexicography for German Varieties’. In: Proceedings of the 13th EURALEX International Conference; pp. 251–260; Barcelona. Abel, Andrea & Anstein, Stefanie (2011): ‘Korpus Südtirol- Varietäten- linguistische Untersuchungen’. In: Korpusinstrumente in Lehre und Forschung, ed. by Abel, Andrea & Zanin, Renata; pp. 29–53; Bolzano: Bolzano University Press. Abel, Andrea; Anstein, Stefanie & Ties, Isabella (2008): ‘Ansätze einer intralingualen kontrastiven Korpuslinguistik – aufgezeigt am Beispiel administrativer Rechtstexte aus Deutschland, Osterreich¨ und Südtirol’. In: Formulierungsmuster in deutscher und italienischer Fachkommunikation. Intra- und interlinguale Perspektiven, ed. by Heller, Dorothee; Linguistic Insights; pp. 243–270; Bern: Peter Lang. Anstein, Stefanie (2007): ‘Korpuslinguistische Fallstudien zum Südtiroler Standardschriftdeutsch - das Projekt ’Korpus Südtirol”. Linguistik online; vol. 32. http://www.linguistik-online.org/32 07/anstein.pdf, last accessed 2012-10-14. Anstein, Stefanie (2009a): ‘Vis-A-Vis` – a System for the Comparison of Linguistic Varieties on the Basis of Corpora’. In: Proceedings of the 2nd Col- loquium on Lesser Used Languages and Computer Linguistics (LULCL); pp. Publikationen v 59–64. http://www.eurac.edu/Org/LanguageLaw/Multilingualism/Projects/ LULCL II proceedings.htm, last accessed 2012-10-16. Anstein, Stefanie (2009b): ‘Vis-A-Vis` - a System to Compare Variety Corpora’. In: Proceedings of the 5th Corpus Linguistics Conference; Liverpool. http://ucrel.lancs.ac.uk/publications/cl2009, last accessed 2012-10-16. Anstein, Stefanie (2012): ‘Comparing Variety Corpora with Vis-A-Vis` — a Prototype System Presentation’. In: Proceedings of the 11th Conference on Natural Language Processing (KONVENS); pp. 243–247; Vienna. http: //www.oegai.at/konvens2012/proceedings/35 anstein12p, last accessed 2012- 10-16. Anstein, Stefanie & Glaznieks, Aivars (2011): ‘Comparing Geographical and Learner Varieties on the Basis of Corpora’. In: Comparative Methods and Analysis in the Language Science. Proceedings of the 3rd edition of JéTou; pp. 179–188; Toulouse. http://jetou2011.free.fr/ARTICLES/S4A2.pdf, last accessed 2012-10-17. vi Contents List of abbreviations ix List of figures xi List of tables xii German abstract xiii English abstract xviii 1 Introduction and background1 1.1 Objectives and research questions.................1 1.1.1 Aims and scope of this work...............2 1.1.2 Research questions.....................8 1.1.3 Structure of this thesis...................9 1.2 Background in the relevant research areas............ 10 1.2.1 Linguistics and language variation............ 10 1.2.1.1 Language phenomena and their investigation. 10 1.2.1.2 Linguistic variation............... 14 1.2.2 Language in South Tyrol................. 19 1.2.2.1 History and current situation.......... 20 1.2.2.2 Standards and norms............... 22 1.2.3 Computational approaches to corpus studies....... 25 1.2.3.1 Inter-disciplinarity................ 25 1.2.3.2 Corpora and their linguistic annotation.... 26 1.2.3.3 Data extraction from corpora.......... 34 1.2.3.4 Comparative corpus linguistics......... 40 vii 1.2.3.5 Statistics for comparing corpora........ 44 1.2.3.6 Evaluation of corpus processing tools..... 50 2 Related work and research desiderata 53 2.1 Resources and methods for corpus comparison.......... 53 2.1.1 Variety corpora and dictionaries............. 54 2.1.2 Comparative corpus studies................ 59 2.1.2.1 Studies on corpus comparability......... 60 2.1.2.2 General variety studies............. 65 2.1.3 Computational systems for corpus studies........ 74 2.1.3.1 Corpus annotation tools............. 75 2.1.3.2 Corpus analysis and comparison tools..... 77 2.2 Investigations on South Tyrolean German............ 82 2.2.1 South Tyrolean German variety linguistics........ 82 2.2.2 Linguistic characteristics of South Tyrolean German.. 91 2.3 Research desiderata derived from the state of the art...... 105 3 The system Vis-A-Vis` 107 3.1 Requirements for a corpus comparison system.......... 107 3.2 Methodology and system architecture............... 108 3.2.1 Approaches and methods................. 109 3.2.2 Workflow and modules................... 114 3.3 System functionalities and usage modes.............. 116 3.3.1 Technical and functional specification.......... 116 3.3.1.1 Technical system features............ 117 3.3.1.2 Input verification................. 119 3.3.1.3 Comparability check............... 119 3.3.1.4 Annotation.................... 119 3.3.1.5 Analysis levels.................. 120 3.3.1.6 Linguistic filtering................ 122 3.3.1.7 Statistical comparison.............. 124 3.3.1.8 Output presentation............... 126 3.3.2 Coverage and limitations of the system.......... 126 viii 3.3.3 System usage scenarios.................. 127 3.4 System output........................... 135 3.4.1 General corpus comparison output............ 135 3.4.2 Output by analysis levels.................. 135 4 Quantitative and qualitative system evaluation 146 4.1 Quantitative system performance................. 146 4.1.1 Evaluation procedures................... 146 4.1.2 Evaluation data and gold standard............ 147 4.1.3 Quantitative evaluation results.............. 148 4.2 Qualitative case studies...................... 154 4.2.1 Newspaper corpora..................... 154 4.2.2 Web corpora......................... 158 4.2.3 Learner corpora....................... 158 4.3 Discussion of evaluation results.................. 162 5 Outlook and conclusion 164 5.1 Potential further research..................... 164 5.1.1 General resource and system enhancements....... 164 5.1.2 Refinement of analysis levels............... 169 5.2 Summary.............................. 178 5.2.1 Principal findings of this work............... 178 5.2.2 Contributions to the relevant research areas....... 180 A System documentation 182 B Gold standard list of Südtirolisms 198 B.1 Primary Südtirolisms........................ 198 B.2 Extract of secondary Südtirolisms................. 203 C Online resources 204 Bibliography 205 ix List of abbreviations ADJ adjective ADV adverb AM association measure AT Austria BNC British National Corpus1 CH Switzerland CQP Corpus Query Processor CWB Corpus Workbench DE Germany DOLO Dolomiten newspaper corpus DWDS Digitales Wörterbuchder Deutschen Sprache; Digital dictionary of the German language f frequency FR Frankfurter Rundschau newspaper corpus GUI graphical user interface ICE International Corpus of English IT Italy KWIC keyword in context LNRE large number of rare events LL log-likelihood

Computational Approaches to the Comparison of Regional Variety Corpora – Prototyping a Semi-Automatic System for German

The Bulgarian National Corpus: Theory and Practice in Corpus Design

Psyneuroling

Ad Hoc and General-Purpose Corpus Construction from Web Sources Adrien Barbaresi

Henning Lobin, Roman Schneider Und Andreas Witt (Hrsg.) Digitale Infrastrukturen Für Die Germanistische Forschung Germanistische Sprachwissenschaft Um 2020

Book of Proceedings

Proceedings of the Ninth Interna- Kamocki, P

From Storyboard to Sustainability and LR Lifecycle Management

A CORPUS LINGUISTICS STUDY of TRANSLATION CORRESPONDENCES in ENGLISH and GERMAN by ALEKSANDAR TRKLJA

Book of Abstracts

Best Practices for Speech Corpora in Linguistic Research

Corpus Construction and Annotation

Conference Abstracts