An Information-Theoretic Approach to Language Complexity: Variation in Naturalistic Corpora

An information-theoretic approach to language complexity: variation in naturalistic corpora Inaugural-Dissertation zur Erlangung der Doktorwürde der Philologischen Fakultät der Albert-Ludwigs-Universität Freiburg i.Br. vorgelegt von Katharina Luisa Ehret aus Freiburg i.Br. WS 2015/2016 Erstgutachter/in: Prof. Dr. Benedikt Szmrecsanyi Zweitgutachter/in: Prof. Dr. Dr. h.c. Bernd Kortmann Vorsitzende/r des Promotionsausschusses der Gemeinsamen Kommission der Philologischen, Philosophischen und Wirtschafts- und Verhaltenswissenschaftlichen Fakultät:Prof. Dr. Joachim Grage Datum der Disputation: 07.10.2016 To my parents Contents List of Figures iii List of Tables v Preface and Acknowledgements vii 1. Introduction 1 2. Literature review and theoretical background 11 2.1. Linguistic Complexity . 11 2.1.1. Defining linguistic complexity . 11 2.1.2. Explaining linguistic complexity . 22 2.2. Information theory . 31 2.2.1. Information theory, Shannon entropy and Kolmogorov complexity . 31 2.2.2. Information-theoretic complexity . 36 3. Experimenting with the compression technique 43 3.1. Parallel texts . 43 3.1.1. Method and data . 43 3.1.2. The Gospel of Mark . 50 3.2. Parallel, semi-parallel and non-parallel texts . 56 3.2.1. Method and data . 56 3.2.2. Alice's Adventures in Wonderland . 66 3.2.3. Newspaper texts . 74 3.3. Summary . 81 4. Excursion: Targeted file manipulation 83 4.1. Method and data . 83 4.2. Analysing morphological markers . 92 4.3. Analysing constructions . 97 4.4. Summary . 104 5. Exploring compression algorithms 105 5.1. Comparing and interpreting algorithmic complexity . 105 5.1.1. Interpreting compressed strings . 105 5.1.2. Comparing compressed strings . 116 5.2. Summary . 128 6. Case studies 131 i Contents 6.1. Assessing complexity variation in British English registers . 131 6.1.1. Method and data . 131 6.1.2. Register variability . 137 6.2. Measuring complexity in learner English . 141 6.2.1. Method and data . 141 6.2.2. Complexity variation in learner essays . 147 6.2.3. Does mother tongue matter? . 151 6.3. Summary . 159 7. Summary and Conclusion 161 7.1. Results . 162 7.2. Discussion . 168 Bibliography 175 Appendix A. Tables 187 A.1. Targeted file manipulation: Tukey's HSD tables . 187 A.2. Case studies: Standard deviation in the ICLE national subsets . 194 Appendix B. R scripts 197 B.1. Basic web scraper . 197 B.2. Multiple distortion and compression script . 199 B.3. Random sampling function . 207 Appendix C. Python script 209 Appendix D. Shell scripts 211 D.1. Fix fullstops . 211 D.2. Remove corpus markup . 212 D.3. Remove punctuation . 213 D.4. Remove UTF-8 characters . 213 Appendix E. Zusammenfassung 215 ii List of Figures 2.1. Extra-linguistic factors impacting on complexity variation . 22 2.2. Diagram of a communication system . 32 2.3. Entropy of a two-choice situation . 34 3.1. Overall complexity hierarchy in Mark . 51 3.2. Morphological by syntactic complexity in Mark . 54 3.3. Real time drifts in English . 55 3.4. Overall complexity hierarchy of the parallel Alice corpus . 67 3.5. Overall complexity hierarchy of the semi-parallel Alice corpus . 68 3.6. Morphological by syntactic complexity in the parallel Alice corpus 69 3.7. Morphological by syntactic complexity in the semi-parallel Alice corpus . 71 3.8. Overall complexity hierarchy in the Euro-Congo-Tunisia news corpus 75 3.9. Overall complexity hierarchy in the Euro-Congo news corpus . 76 3.10. Morphological by syntactic complexity in the Euro-Congo-Tunisia news corpus . 77 3.11. Morphological by syntactic complexity in the Euro-Congo news corpus . 78 4.1. Morphological by syntactic complexity of morphological markers in Alice . 94 4.2. Morphological by syntactic complexity of morphological markers in Mark . 96 4.3. Morphological by syntactic complexity of morphological markers in the Euro-Congo news corpus . 97 4.4. Morphological by syntactic complexity of constructions in Alice . 99 4.5. Morphological by syntactic complexity of constructions in Mark . 102 4.6. Morphological by syntactic complexity of constructions in the Euro- Congo news corpus . 103 5.1. Unique strings in the original Alice lexicon . 110 5.2. Compressed strings by linguistic category in the three Alice lexica 125 6.1. Overall complexity hierarchy of written BNC registers . 138 6.2. Morphological by syntactic complexity of written BNC registers . 140 6.3. Overall complexity hierarchy of ICLE learner groups . 148 6.4. Morphological by syntactic complexity of ICLE learner groups . 151 6.5. Overall complexity hierarchy of ICLE learner groups in national varieties . 155 iii List of Figures 6.6. Overall complexity hierarchy of German learner groups in ICLE . 156 6.7. Morphological by syntactic complexity of ICLE learner groups by national variety . 157 6.8. Morphological by syntactic complexity of German learner groups in ICLE . 158 7.1. Overall complexity ranking of newspapers . 167 7.2. Analytical pipeline . 171 E.1. Globale Komplexitätshierarchie der Zeitungsgenres . 222 iv List of Tables 2.1. Overview of relevant empirical studies on complexity . 26 3.1. Number of words and sentences in the Gospel of Mark . 45 3.2. Spelling variance in the Gospel of Mark . 46 3.3. Segmentable inflected word tokens in Mark . 52 3.4. Word order patterns in Mark . 52 3.5. Number of words and sentences in the Alice database . 57 3.6. Number of sentences in the newspaper corpora . 58 3.7. Standard deviations of complexity scores in the parallel Alice corpus 61 3.8. Standard deviations of complexity scores in the Euro-Congo corpus 62 3.9. Standard deviations of complexity scores in the Euro-Congo-Tunisia corpus . 63 3.10. Standard deviations of complexity scores in the semi-parallel Alice corpus . 64 3.11. Standard deviations of file sizes in the semi-parallel Alice corpus . 65 3.12. Syntactic complexity rankings in parallel and semi-parallel Alice corpora . 73 3.13. Syntactic complexity rankings in parallel and semi-parallel Alice copora . 80 4.1. Text frequency of morphorphological markers and constructions per text type . 84 4.2. Number of sentences and words per text genre . 85 4.3. Standard deviations of complexity scores by text and construction 90 4.4. Standard deviations of complexity scores by text and construction 91 4.5. Syntactic correlation of morphological markers . 92 4.6. Morphological correlation of morphological markers . 93 4.7. Morphological ranking of morphological markers . 95 4.8. Syntactic ranking of morphological markers . 95 4.9. Syntactic correlation of constructions . 98 4.10. Morphological correlation of constructions . 98 4.11. Morphological ranking of constructions . 100 4.12. Syntactic ranking of constructions . 101 5.1. Distribution of unique strings in the original Alice lexicon . 109 5.2. String length in the original Alice lexicon . 111 5.3. Compressed strings by linguistic category in the original Alice lexicon114 5.4. Distorted passages from Alice . 117 v List of Tables 5.5. Distribution of unique strings in the morphologically distorted Alice lexicon . 119 5.6. String length in the morphologically distorted Alice lexicon . 120 5.7. Distribution of unique strings in the syntactically distorted Alice lexicon . 122 5.8. String length in the syntactically distorted Alice lexicon . 123 5.9. Compressed strings by linguistic category in the three Alice lexica 127 6.1. Overview of the written macro-registers and the newspaper micro- registers in the BNC . 132 6.2. Standard deviations of file sizes in the BNC . 135 6.3. Standard deviations of complexity scores in the BNC . 136 6.4. Number of argumentative essays according to years of studying English at school and university . 142 6.5. Learner groups by years of instruction in English at school and university . 143 6.6. Standard deviations of file sizes in ICLE . 145 6.7. Standard deviations of complexity scores in ICLE . 146 6.8. SLA measures versus Kolmogorov measures . 150 6.9. ICLE learner groups by national background . 153 A.1. Tukey's HSD for morphs in Alice . 188 A.2. Tukey's HSD for morphs in Mark . 189 A.3. Tukey's HSD for morphs in the Euro-Congo corpus . 190 A.4. Tukey's HSD for constructions in Alice . 191 A.5. Tukey's HSD for constructions in Mark . 192 A.6. Tukey's HSD for constructions in the Euro-Congo corpus . 193 A.7. Standard deviations of complexity scores in the ICLE national subset195 A.8. Standard deviations of complexity scores in the German ICLE subset196 vi Preface and Acknowledgements Partial summaries of the research discussed in this study have appeared as Ehret & Szmrecsanyi (2016a), Ehret & Szmrecsanyi (2016b), Ehret (2014). I am grateful for the generous funding provided by the Cusanuswerk (Bonn) without which none of this work would have been possible. I would also like to acknowledge and thank the following individuals: { Benedikt Szmrecsanyi, for helpful discussion, valuable feedback, advice, support, and for being a really good supervisor. { Christoph Wolk, for discussion, extensive advice on statitistics and, for help with advanced R magic. { Jens Stimpfle, for practical support with Python and shell programming as well as the retrieval of the gzip lexicon. { Annemarie Verkerk and the Max Planck Institute for Psycholinguistics (Nijmegen) for providing me with some of the Alice's Adventures in Won- derland texts. { Bernd Kortmann and Christian Mair, for helpful advice and feedback. { Alex Housen, Lourdes Ortega, Amir Zeldes, and Kimmo Kettunen, for helpful comments and feedback. { The audiences of the following conferences and workshops, for their crit- ical questions, feedback and comments: • IClaVE 8, Leipzig, May 2015. • Colloquium on \Cross-linguistic aspects of complexity in SLA", VUB Brussels, December 2014. • ISLE 3, Zurich, August 2014. • ICAME 35, Nottingham, May 2014. • Workshop on \Theoretical and Computational Morphology: New Trends and Synergies", ICL 19, Geneva, July 2013.

Load more