Introduction About complexity Profiles Kolmogorov Variational Conclusion

Recent advances in the corpus-based study of linguistic complexity

Benedikt Szmrecsanyi

KU Leuven Quantitative Lexicology and Variational Linguistics

Download these slides at http://www.benszm.net/AACL.pdf Introduction About complexity Profiles Kolmogorov Variational Conclusion Prologue: Complexity is beautiful. Or is it?

“Bavarian Baroque”: Kloster Roggenburg. Introduction About complexity Profiles Kolmogorov Variational Conclusion

Introduction Introduction About complexity Profiles Kolmogorov Variational Conclusion Introduction

• linguistic complexity: one of the currently most hotly debated issues in linguistics (e.g. Sampson et al. 2009; Trudgill 2011; Pallotti 2014, among many others) • theoretical linguistics: are all languages, or language varieties, equally complex? If not, what are the sociolinguistic factors that condition language complexity? • applied linguistics: how can we use complexity measures as proxies for tracking learners’ proficiency, and/or for benchmarking development? this presentation: outline three complexity measures that are usage/corpus-based and holistic

Introduction About complexity Profiles Kolmogorov Variational Conclusion Problems with popular complexity measures

• complexity notions in the theoretical literature (e.g. distinctions beyond communicative necessity): Á holistic but typically not usage-based • complexity measures in the applied literature (e.g. mean length of T-unit, extent of clausal subordination): Á usage-based but selective & reductionist Introduction About complexity Profiles Kolmogorov Variational Conclusion Problems with popular complexity measures

• complexity notions in the theoretical literature (e.g. distinctions beyond communicative necessity): Á holistic but typically not usage-based • complexity measures in the applied literature (e.g. mean length of T-unit, extent of clausal subordination): Á usage-based but selective & reductionist

this presentation: outline three complexity measures that are usage/corpus-based and holistic Introduction About complexity Profiles Kolmogorov Variational Conclusion Road map

1. Introduction 2. Why linguists are concerned about complexity 3. Three new measures 3.1 Typological profiles 3.2 Kolmogorov complexity 3.3 Some thoughts on variational complexity 4. Conclusion Introduction About complexity Profiles Kolmogorov Variational Conclusion

Why linguists are concerned about complexity Introduction About complexity Profiles Kolmogorov Variational Conclusion 20th century structuralism: Hockett (1958)

“it would seem that the total grammatical complexity of any language, counting both and syntax, is about the same as that of any other [. . . ] ”

(Hockett 1958: 180–181)

Charles F. Hockett (1916–2000) Introduction About complexity Profiles Kolmogorov Variational Conclusion Challenging the equicomplexity dogma: McWhorter (2001)

“a subset of creole languages display less overall grammatical complexity than older languages, by virtue of the fact that they were born as pidgins, and thus stripped of almost all features unnecessary to communication [...]”

(McWhorter 2001: 125)

John McWhorter Introduction About complexity Profiles Kolmogorov Variational Conclusion Sociolinguistic theory: Trudgill (2011), Sociolinguistic Typology

contact, social instability, adult SLA, population growth Á change Á simplification, i.e.: Á increase in morphological transparency Á loss of redundancy Á loss of “historical baggage” Peter Trudgill Introduction About complexity Profiles Kolmogorov Variational Conclusion Some popular complexity measures in the theoretical literature

• quantitative complexity: more contrasts, rules, markers, etc. Á complexity • opacity: for example, allomorphies Á complexity • ornamental complexity: communicatively dispensable contrasts, rules, markers etc. Á complexity • L2 acquisition difficulty: contrasts, rules, etc. that are hard to learn for adults Á complexity

(see Szmrecsanyi and Kortmann 2012 for a review) Introduction About complexity Profiles Kolmogorov Variational Conclusion Complexity in applied linguistics: Larsen-Freeman (2012)

“Second Language Acquisition (SLA) researchers have long harbored aspirations of developing developmental indices, which at least in part would be based on the growing complexity of interlanguage”

Diane Larsen-Freeman (Larsen-Freeman 2012: 1) Introduction About complexity Profiles Kolmogorov Variational Conclusion Why we measure interlanguage complexity: Ortega (2012)

“Second language acquisition researchers use interlanguage complexity measures [. . . ] with at least three main purposes in mind: (a) to gauge proficiency, (b) to describe performance, and (c) to benchmark development.” (Ortega 2012: 128)

Lourdes Ortega Introduction About complexity Profiles Kolmogorov Variational Conclusion Some popular complexity measures in the applied linguistics literature

• length of units (e.g. mean length of T-unit): long units Á complexity • density of subordination: subordination Á complexity • frequency of occurrence of ‘complex’ forms (e.g. passive voice): many complicated forms Á complexity (see Ortega 2003, 2012 for reviews) Introduction About complexity Profiles Kolmogorov Variational Conclusion The quest for new complexity measures

• “theoretical” measures: nicely holistic but not easily amenable to operationalization in usage data (how do you measure, e.g., “historical baggage”?) • “applied” measures: nicely amenable to operationalization in usage data but suffering from “concept reductionism” (Ortega 2012: 128) Introduction About complexity Profiles Kolmogorov Variational Conclusion

Typological profiles Introduction About complexity Profiles Kolmogorov Variational Conclusion Drawing inspiration from quantitative typology

• time-honored holistic-typology notions: • analyticity: free markers (+ order) • syntheticity: bound markers (e.g. inflections) • Greenberg (1960): seemingly abstract typological notions are amenable to

precise measurements Joseph H. Greenberg • calculate indices to profile the way (1915–2011) grammatical information is coded in corpus texts Introduction About complexity Profiles Kolmogorov Variational Conclusion The link to language complexity

• analyticity increases explicitness and transparency while easing comprehension difficulty (e.g. Humboldt 1836: 284-285) • syntheticity difficult because of the allomorphies it creates (e.g. Braunmuller¨ 1990: 627) • known interlanguage universal: learners avoid inflectional marking & prefer analyticity (e.g. Klein and Perdue 1997) • rule of thumb: analyticity Á simple, syntheticity Á complex Introduction About complexity Profiles Kolmogorov Variational Conclusion Coding

use POS tags to group word tokens in corpus texts into three broad categories:

1. analytic word tokens, i.e. function that are members of synchronically closed word classes (conjunctions, determiners, pronouns, prepositions, infinitive markers, modal & auxiliary verbs, negators)

2. synthetic word tokens, which carry bound grammatical markers (e.g. inflectional affixes)

3. a small number of analytic & synthetic word tokens (e.g. inflected auxiliary verbs) Introduction About complexity Profiles Kolmogorov Variational Conclusion Calculating Greenberg-inspired indices

• the analyticity index (AI): calculated as the ratio of the number of free grammatical markers (i.e. function words) in a text to the total number of words in the text, normalized to a sample size of 1,000 words of running text • the syntheticity index (SI): calculated as the ratio of the number of words in a text that bear a bound Greenberg (1960: Tab. 1) grammatical marker to the total number of words in the sample text, normalized to a sample size of 1,000 words of running text Introduction About complexity Profiles Kolmogorov Variational Conclusion The cross-linguistic perspective

600

AI versus SI, medium: written. European Tok Pisin languages:

500 contemporary newspaper prose. Creole English languages: texts Italian obtained & annotated German by Geoff Smith and Aya Hawai'i Creole 400 Inoue (see Siegel,

analyticity index Szmrecsanyi & Kortmann 2014).

300 Russian

0 200 400 600 syntheticity index Introduction About complexity Profiles Kolmogorov Variational Conclusion The British National Corpus (BNC)

• 90 million words of written standard British English • 10 million words of spoken British English • 34 major registers (16 spoken, 18 written; for instance, S conv and W fict) (see Aston and Burnard 1998 for details) Introduction About complexity Profiles Kolmogorov Variational Conclusion BNC registers

550 medium sermon spoken written AI versus SI in the British National Corpus

demonstratn (BNC). Spoken 500 interview registers versus written courtroom meetingconv tutorial registers (adapted from consultclassroomunclassified speechlect Szmrecsanyi 2009: Fig pub fict parliament religion brdcast 2). hansard

450 biography essay sportslive letters analyticity index admin ac commerce

miscnon instructional pop email institut 400 newsp news

advert

140 160 180 200 syntheticity index Introduction About complexity Profiles Kolmogorov Variational Conclusion BNC registers

550 medium sermon spoken written AI versus SI in the British National Corpus

demonstratn (BNC). Spoken 500 interview registers versus written courtroom meetingconv tutorial registers (adapted from consultclassroomunclassified speechlect Szmrecsanyi 2009: Fig pub fict parliament religion brdcast 2). hansard

450 biography essay sportslive letters analyticity index admin ac commerce

miscnon instructional pop email institut 400 newsp news

advert

140 160 180 200 syntheticity index Introduction About complexity Profiles Kolmogorov Variational Conclusion BNC registers

550 medium sermon spoken written AI versus SI in the British National Corpus

demonstratn (BNC). Spoken 500 interview registers versus written courtroom meetingconv tutorial registers (adapted from consultclassroomunclassified speechlect Szmrecsanyi 2009: Fig pub fict parliament religion brdcast 2). hansard

450 biography essay sportslive letters analyticity index admin ac commerce

miscnon instructional pop email institut 400 newsp news

advert

140 160 180 200 syntheticity index Introduction About complexity Profiles Kolmogorov Variational Conclusion The International Corpus of Learner English (ICLE)

• Version 1.1 • essays by advanced learners of English with different mother tongue backgrounds • 11 subcorpora: Bulgarian, Czech, Dutch, Finnish, French, German, Italian, Polish, Russian, Spanish, and Swedish • typological profiling study: targeted medium-advanced learners (5–6 yrs of E @ school, 2–3 yrs of E @ uni); dataset subject to analysis: 266,000 words of running text

(see Granger et al. 2002 for details) Introduction About complexity Profiles Kolmogorov Variational Conclusion ICLE Learner Englishes

type learner ICLE−Spanish native AI versus SI, International Corpus of

520 Learner English (ICLE) versus British National ICLE−Swedish Corpus (BNC). Text type: essays. Adapted ICLE−RussianICLE−French ICLE−BulgarianBNC−S_conversation ICLE−Italian from Szmrecsanyi and ICLE−PolishICLE−Finnish ICLE−Dutch Kortmann (2011: Fig 480 ICLE−German 1). analyticity index BNC−W_essay_school

440 ICLE−Czech

BNC−W_essay_univ

130 150 170 190 syntheticity index Introduction About complexity Profiles Kolmogorov Variational Conclusion ICLE Learner Englishes

type learner ICLE−Spanish native AI versus SI, International Corpus of

520 Learner English (ICLE) versus British National ICLE−Swedish Corpus (BNC). Text type: essays. Adapted ICLE−RussianICLE−French ICLE−BulgarianBNC−S_conversation ICLE−Italian from Szmrecsanyi and ICLE−PolishICLE−Finnish ICLE−Dutch Kortmann (2011: Fig 480 ICLE−German 1). analyticity index BNC−W_essay_school

440 ICLE−Czech

BNC−W_essay_univ

130 150 170 190 syntheticity index Introduction About complexity Profiles Kolmogorov Variational Conclusion Typological profiling: interim summary

• using syntheticity as a proxy for complexity, we obtain the following complexity rankings: • Russian > German > Italian > English > creoles • written text types > spoken text types • native essays > learner essays (though there are interesting differences between learner groups!) (the inverse holds for analyticity) Introduction About complexity Profiles Kolmogorov Variational Conclusion Research topics awaiting to be explored

• more detailed investigation of learner data (taking into account e.g. amount of instruction received as a proxy for proficiency) • lexical analyticity and syntheticity? (e.g. eye doctor vs. ophthalmologist) Introduction About complexity Profiles Kolmogorov Variational Conclusion

Kolmogorov complexity Introduction About complexity Profiles Kolmogorov Variational Conclusion

Work by and joint work with Katharina Ehret (University of Freiburg) Introduction About complexity Profiles Kolmogorov Variational Conclusion Kolmogorov complexity

• information theory (Shannon 1948) • Kolmogorov complexity: • unsupervised • holistic • text-based • can be approximated using file compression programs

Á text samples that can be compressed Andrei Kolmogorov efficiently are linguistically simple (1903–1987) Examples:

® ababababab (10 characters) Á 5×ab (4 characters)

® kl!f7S23y0 (10 characters) Á kl!f7S23y0 (10 characters)

Introduction About complexity Profiles Kolmogorov Variational Conclusion Defining Kolmogorov

“for any sequence of symbols, the Kolmogorov complexity of the sequence is the length of the shortest algorithm that will exactly generate the sequence [. . . ] the more predictable the sequence, the shorter the algorithm needed and thus the Kolmogorov complexity of the sequence is also lower” (Sadeniemi et al. 2008: 191; see also Li and Vit´anyi1997; Li et al. 2004) ® kl!f7S23y0 (10 characters) Á kl!f7S23y0 (10 characters)

Introduction About complexity Profiles Kolmogorov Variational Conclusion Defining Kolmogorov

“for any sequence of symbols, the Kolmogorov complexity of the sequence is the length of the shortest algorithm that will exactly generate the sequence [. . . ] the more predictable the sequence, the shorter the algorithm needed and thus the Kolmogorov complexity of the sequence is also lower” (Sadeniemi et al. 2008: 191; see also Li and Vit´anyi1997; Li et al. 2004)

Examples:

® ababababab (10 characters) Á 5×ab (4 characters) Introduction About complexity Profiles Kolmogorov Variational Conclusion Defining Kolmogorov

“for any sequence of symbols, the Kolmogorov complexity of the sequence is the length of the shortest algorithm that will exactly generate the sequence [. . . ] the more predictable the sequence, the shorter the algorithm needed and thus the Kolmogorov complexity of the sequence is also lower” (Sadeniemi et al. 2008: 191; see also Li and Vit´anyi1997; Li et al. 2004)

Examples:

® ababababab (10 characters) Á 5×ab (4 characters)

® kl!f7S23y0 (10 characters) Á kl!f7S23y0 (10 characters) Introduction About complexity Profiles Kolmogorov Variational Conclusion Kolmogorov complexity in linguistics

• pioneered by Patrick Juola, based on parallel corpora (Juola 1998, 2008) • increased Kolmogorov complexity Á higher complexity mandated by the language used to encode (constant) propositional content • interpretation: entirely agnostic about form-meaning relationships etc. text-based linguistic surface complexity/redundancy Introduction About complexity Profiles Kolmogorov Variational Conclusion How to measure Kolmogorov complexity

• modern file compression programs use adaptive entropy estimation, which approximates Kolmogorov complexity (Ziv and Lempel, 1977; Juola, 1998) • feed in corpus texts, note down files sizes before & after compression • better compression rates Á less Kolmogorov complexity • gzip (GNU zip) version 1.2.4 Introduction About complexity Profiles Kolmogorov Variational Conclusion What exactly does gzip do?

gzip compresses new text strings on the basis of previously encountered strings: • the algorithm loads a certain amount of text • the algorthm creates a temporary • the algorithm recognises new text (sub)strings on the basis of the lexicon • the text is compressed by eliminating redundancy

(see Ehret in preparation) Introduction About complexity Profiles Kolmogorov Variational Conclusion Dataset 1: the Gospel of Mark

• parallel texts: rule out differences due to propositional content; increasingly popular in cross- (see, e.g., Cysouw and W¨alchli 2007) • Gospel of Mark in a number of (historical) varieties of English and in a handful of other languages Introduction About complexity Profiles Kolmogorov Variational Conclusion 7 languages

(Esperanto Londona Biblio, 20th century [1926]) • Finnish (Pyh¨a Raamattu, 20th century [1992]) • French (Ostervald, 20th century [1996 revision]) • German (Schlachter, “Miniaturbibel”, 20th century [1951 revision]) • Hungarian (Vizsoly Bible [a.k.a. K´aroliBible], 16th century) • Jamaican Patois (via J. Farquharson & B. W¨alchli, 20th c.) • Latin (Vulgata Clementina, 4th century) Introduction About complexity Profiles Kolmogorov Variational Conclusion 10 varieties of English

• West Saxon (approx. 10th c. [from Bright 1905]) • Wycliffe’s Bible (14th c. [1395]) • The Douay-Rheims Bible (16th c. [1582]) • The King James Version (17th c. [1611]) • Webster’s Revision (19th c. [1833]) • The Darby Bible (19th c. [1867]) • Young’s Literal Translation (19th c. [1862]) • The American Standard Version (20th c. [1901]) • The Bible In Basic English (20th c. [1941]), using mostly 850 Basic English words & simplified grammar (Ogden 1934, 1942) • The English Standard Version (21st c. [2001]) Introduction About complexity Profiles Kolmogorov Variational Conclusion Technicalities

• 2 measurements per text: 1. file size before compression (in bytes) 2. file size after compression (in bytes) • regress out trivial correlation Á adjusted complexity scores (regression residuals, in bytes) • bigger adjusted complexity scores Á more Kolmogorov complexity Introduction About complexity Profiles Kolmogorov Variational Conclusion An overall complexity hierarchy of Mark

Adjusted overall complexity Latin Jamaican Patois scores. Negative residuals indicate Hungarian below-average complexity, positive German residuals indicate above-average French complexity (adapted from Ehret Finnish and Szmrecsanyi in press: Fig 1). Esperanto E−Young's−19c E−Wycliffe−14c E−WestSaxon−10c E−Webster's−19c E−KingJames−17c E−ESV−21c E−DouayRheims−16c E−Darby−19c E−BasicE−20c E−ASV−20c −2000−1000 0 1000 2000 3000 avg adjusted overall complexity score Introduction About complexity Profiles Kolmogorov Variational Conclusion Dataset 2: The International Corpus of Learner English (ICLE)

work in progress (Ehret in preparation)

• focus on argumentative essays by German learners of English (one of the biggest ICLE components)

• grouping variable: time spent studying English at school/uni Á distinguish between 6 groups: Group 5 (most instruction) 7-9 yrs @ school, 4-5 yrs @ uni Group 4 7-9 yrs @ school, 3 yrs @ uni Group 3b 7-9 yrs @ school, 1-2 yrs @ uni Group 3a 4-6 yrs @ school, 4-5 yrs @ uni Group 2 4-6 yrs @ school, 3 yrs @ uni Group 1 (least instruction) 4-6 yrs @ school, 1-2 yrs @ uni • amount of instruction received: proxy for proficiency Introduction About complexity Profiles Kolmogorov Variational Conclusion ICLE-German: overall complexity hierarchy

Average adjusted overall complexity scores. Negative group 5 residuals indicate below-average complexity, positive residuals group 4 indicate above-average complexity. N = 1000 iterations sampling 10% of sentences in sample (adapted group 3b from Ehret in preparation). Legend: group 3a Group 5: most instruction ... group 2 Group 1: least instruction

group 1

−100 −50 0 50 avg adjusted overall complexity score Introduction About complexity Profiles Kolmogorov Variational Conclusion Kolmogorov complexity: interim summary

• exploring the frontiers of linguistically responsible complexity research • measure yields interpretable results: • crosslinguistic complexity variation in line with expectations • learner language: longitudinal development towards more complexity Introduction About complexity Profiles Kolmogorov Variational Conclusion Extensions & research topics awaiting to be explored

• distorting texts prior to compression to address syntactic & morphological complexity (see Juola 2008, Ehret and Szmrecsanyi to appear, Ehret in preparation for pilot studies) • how to measure phonetic & phonological complexity? Introduction About complexity Profiles Kolmogorov Variational Conclusion

Some thoughts on variational complexity Introduction About complexity Profiles Kolmogorov Variational Conclusion Variation analysis meets complexity research

• variationists/variation analysts are interested in the probabilistic factors that constrain choices between “alternate ways of saying ‘the same’ thing” (Labov 1972: 188) • probe constraint systems & variation patterns through multivariate analysis of meticulously annotated datasets • recent interest in how variation analysis is William Labov relevant to theorizing about complexity (e.g. Huber 2012; Meyerhoff and Schleef 2013; Shin 2014) Introduction About complexity Profiles Kolmogorov Variational Conclusion Exploring constraint systems: how it’s done

• Bresnan et al. (2007) explore the dative alternation (He gave him toys vs he gave toys to him) • extract dative occurrences from Switchboard, rich annotation & regression modeling to predict/explain choices (predictive accuracy: > 90%) • alternation is constrained by at least 10 constraints in Switchboard (e.g. recipient, Bresnan et al. (2007: Fig. 4) definiteness theme, priming, . . . ) Introduction About complexity Profiles Kolmogorov Variational Conclusion Operationalizing complexity in variationist terms

• axiomatic assumption: language or language variety A is more complex than language or language variety B to the extent that linguistic variation in A is more constrained than variation in B • Shin (2014: 3): “The loss of a linguistic factor that constrains linguistic choice is a type of simplification, while the emergence of a new factor is a type of complexification.” • rationale: more constrained variational patterns (1) need more description, (2) are harder to acquire than less constrained variational patterns Introduction About complexity Profiles Kolmogorov Variational Conclusion A case study: Future marker choice in BrE and GhE

• Schneider (in preparation) studies future marker choice:

(1) a.I will sit down quietly. b.I’ m gonna sit down quietly.

• compares constraint system of Ghanaian English (indigenized L2 variety presumably subject to simplification pressures) and British E • regression analysis: 5 significant constraints + interactions in BrE; only 3 significant constraints in GhE (clause type, sentence type, temporal adverbials) Á BrE future marker choice more complex than GhE Introduction About complexity Profiles Kolmogorov Variational Conclusion Variational complexity: interim summary

• making variation analysis matter to complexity research • theoretical allure: at once “deep” (not “surfacy”), usage-based (not competence-oriented), and somewhat holistic (multivariate) • drawbacks: • must restrict attention to well-researched variation phenomena • hard work (Á rich contextual annotation) Introduction About complexity Profiles Kolmogorov Variational Conclusion Future directions & research topics awaiting to be explored

• scaling it up & making the measure more holistic • beyond counting constraints Introduction About complexity Profiles Kolmogorov Variational Conclusion

Conclusion Introduction About complexity Profiles Kolmogorov Variational Conclusion Concluding remarks

• three corpus-based & holistic complexity measures: • typological profiles about the way in which grammatical information is coded: analytic Á simple, synthetic Á complex • Kolmogorov complexity about text-based surface complexity/redundancy: how difficult is it for algorithms to compress corpus texts? • variational complexity about variation patterns: how many constraints does it take to choose between ways of saying the same thing? • can be profitably applied to all kinds of corpus data including materials relevant in work on SLA and L2 writing development Introduction About complexity Profiles Kolmogorov Variational Conclusion

Thank you!

[email protected] http://www.benszm.net/

This presentation is based upon work supported by an Odysseus grant of the Research Foundation Flanders (FWO) (grant no. G.0C59.13N). Literatur Bonus material ReferencesI

Aston, G. and L. Burnard (1998). The BNC handbook: Exploring the British National Corpus with SARA. Edinburgh: Edinburgh University Press. Braunmuller,¨ K. (1990). Komplexe Flexionssysteme – (k)ein Problem fur¨ die Naturlichkeitstheorie?¨ Zeitschrift f¨ur Phonetik, Sprachwissenschaft und Kommunikationsforschung 43, 625–635. Bresnan, J., A. Cueni, T. Nikitina, and H. R. Baayen (2007). Predicting the dative alternation. In G. Boume, I. Kraemer, and J. Zwarts (Eds.), Cognitive Foundations of Interpretation, pp. 69–94. Amsterdam: Royal Netherlands Academy of Science. Bright, J. W. (1905). Euangelium secundum Marcum. The Gospel of Saint Mark in West-Saxon. New York: AMS Press. Cysouw, M. and B. W¨alchli (2007). Parallel texts: using translational equivalents in linguistic typology. Language Typology and Universals 60(2), 95–99. Ehret, K. (in preparation). A corpus-based study of information theoretic complexity in World Englishes. PhD dissertation, University of Freiburg. Ehret, K. and B. Szmrecsanyi (in press). An information-theoretic approach to assess linguistic complexity. In R. Baechler and G. Seiler (Eds.), Complexity and Isolation. Berlin: de Gruyter. Granger, S., E. Dagneaux, and F. Meunier (Eds.) (2002). The International Corpus of Learner English: Handbook and CD-ROM. Louvain-la-Neuve: Presses universitaires de Louvain. Literatur Bonus material ReferencesII

Greenberg, J. H. (1960). A quantitative approach to the morphological typology of language. International Journal of American Linguistics 26(3), 178–194. Hockett, C. F. (1954). Two models of grammatical description. Word 10, 210–231. Hockett, C. F. (1958). A Course in Modern Linguistics. New York: Macmillan. Huber, M. (2012). Syntactic and variational complexity in British and Ghanaian English relative clause formation in the written parts of the International Corpus of English. In B. Kortmann and B. Szmrecsanyi (Eds.), Linguistic Complexity: Second Language Acquisition, Indigenization, Contact. Berlin: De Gruyter. Humboldt, W. v. (1836). Uber¨ die Verschiedenheit des menschlichen Sprachbaues und ihren Einfluss auf die geistige Entwicklung des Menschengeschlechts. Berlin: Dummler.¨ Juola, P. (1998). Measuring linguistic complexity: the morphological tier. Journal of Quantitative Linguistics 5(3), 206–213. Juola, P. (2008). Assessing linguistic complexity. In M. Miestamo, K. Sinnem¨aki, and F. Karlsson (Eds.), Language Complexity: Typology, Contact, Change, pp. 89–108. Amsterdam, Philadelphia: Benjamins. Klein, W. and C. Perdue (1997). The basic variety (or: Couldn’t natural languages be much simpler?). Second Language Research 13, 301–347. Labov, W. (1972). Sociolinguistic patterns. Philadelphia: University of Philadelphia Press. Literatur Bonus material ReferencesIII

Larsen-Freeman, D. (2012). Preface: A closer look. In B. Kortmann and B. Szmrecsanyi (Eds.), Linguistic Complexity: Second Language Acquisition, Indigenization, Contact, pp. 1–5. Berlin: De Gruyter. Li, M., X. Chen, X. Li, B. Ma, and P. M. B. Vit´anyi(2004). The similarity metric. IEEE Transactions on Information Theory 50(12), 3250–3264. Li, M. and P. M. B. Vit´anyi(1997). An introduction to Kolmogorov complexity and its applications. New York: Springer. McWhorter, J. (2001). The world’s simplest grammars are creole grammars. Linguistic Typology 6, 125–166. Meyerhoff, M. and E. Schleef (2013). Hitting an Edinburgh target: immigrant adolescents’ acquisition of variation in Edinburgh English. In R. Lawson (Ed.), Sociolinguistic perspectives on Scotland, pp. 103–128. Basingstoke: Palgrave Macmillan. Ortega, L. (2003). Syntactic complexity measures and their relationship to L2 proficiency: A research synthesis of college-level L2 writing. Applied Linguistics 24, 492–518. Ortega, L. (2012). Interlanguage complexity: A construct in search of theoretical renewal. In B. Kortmann and B. Szmrecsanyi (Eds.), Linguistic Complexity: Second Language Acquisition, Indigenization, Contact. Berlin: De Gruyter. Pallotti, G. (2014). A simple view of linguistic complexity. Second Language Research. Literatur Bonus material ReferencesIV Sadeniemi, M., K. Kettunen, T. Lindh-Knuutila, and T. Honkela (2008). Complexity of European Union languages: A comparative approach. Journal of Quantitative Linguistics 15(2), 185–211. Sampson, G., D. Gil, and P. Trudgill (2009). Language complexity as an evolving variable. Oxford, New York: Oxford University Press. Schneider, A. Aspect and Modality in Ghanaian English: A Corpus-based Study of the Progressive and the Modal WILL. PhD dissertation, University of Freiburg, Freiburg. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal 27, 379–423. Shin, N. L. (2014). Grammatical complexification in Spanish in New York: 3sg pronoun expression and verbal ambiguity. Language Variation and Change 26, 1–28. Siegel, J., B. Szmrecsanyi, and B. Kortmann (2014). Measuring analyticity and syntheticity in creoles. Journal of Pidgin and Creole Languages 29(1), 49–85. Szmrecsanyi, B. (2009). Typological parameters of intralingual variability: grammatical analyticity versus syntheticity in varieties of English. Language Variation and Change 21(3), 319–353. Szmrecsanyi, B. and B. Kortmann (2011). Typological profiling: learner Englishes versus indigenized L2 varieties of English. In J. Mukherjee and M. Hundt (Eds.), Exploring Second-Language Varieties of English and Learner Englishes: Bridging a Paradigm Gap, pp. 167–187. Amsterdam, Philadelphia: Benjamins. Literatur Bonus material ReferencesV

Szmrecsanyi, B. and B. Kortmann (2012). Introduction: Linguistic complexity – second language acquisition, indigenization, contact. In B. Szmrecsanyi and B. Kortmann (Eds.), Linguistic Complexity: Second Language Acquisition, Indigenization, Contact, pp. 6–34. Berlin: De Gruyter. Trudgill, P. (2011). Sociolinguistic typology : social determinants of linguistic complexity. Oxford; New York: Oxford University Press. Ziv, J. and A. Lempel (1977). A universal algorithm for sequential data compression. IEEE Transactions on Information Theory IT-23(3), 337–343. Literatur Bonus material

Bonus material Literatur Bonus material How compresion algorithms see the world

alice was beginning to • Ehret (in preparation) get very tired of sitt [29,4]ing by her re-programs gzip to retrieve an [15,3] sist inspectable lexicon [7,3]er on the bank an [41,5]d of hav • input text: Alice’s Adventures in [40,4]ing noth Wonderland [77,7]ing to do [40,3] on • length of compressed sequences [102,3]ce or tw [111,4]ice s ranges from 3 characters to 148 [51,3]he had peep [94,3]ed in • 85% of all strings have a length [37,3]to of three to ten characters [71,5]the book [94,12]her sister • captures linguistic structure [151,4]was read [120,5]ing but it [55,5]had no pictures ... Literatur Bonus material Modular complexity: distortion

• Juola (2008): prior to compression, we may distort (i.e. manipulate) text files to address • morphological complexity (random deletion of 10% of all characters) • syntactic complexity, specifically: rigidity (random deletion of 10% of all word tokens) Literatur Bonus material Calculating modular complexity scores

• a standardized morphological complexity score, which is m defined as − c (where m is the compressed file size after morphological distortion, and c is the compressed file size before distortion) • a standardized syntactic complexity score, which is s defined as c (where s is the compressed file size after syntactic distortion, and c is the compressed file size before distortion) Literatur Bonus material The modular complexity space of Mark

0.945 English E−BasicE−20c no yes Morphological E−Young’s−19c complexity by syntactic 0.940 E−DouayRheims−16c E−ASV−20cE−KingJames−17c complexity. Abscissa indexes morphological E−Webster’s−19c Jamaican Patois complexity, ordinate 0.935 E−Wycliffe−14cE−ESV−21c indexes syntactic complexity (fixed word Esperanto French order) (adapted from 0.930 E−Darby−19c Ehret and Szmrecsanyi in press: Fig 2). German

0.925 standardized syntactic complexity score syntactic complexity standardized Hungarian 0.920 Latin E−WestSaxon−10c

Finnish

−1.14 −1.12 −1.10 −1.08 standardized morphological complexity score Literatur Bonus material The modular complexity space of Mark

0.945 English E−BasicE−20c no yes Morphological E−Young’s−19c complexity by syntactic 0.940 E−DouayRheims−16c E−ASV−20cE−KingJames−17c complexity. Abscissa indexes morphological E−Webster’s−19c Jamaican Patois complexity, ordinate 0.935 E−Wycliffe−14cE−ESV−21c indexes syntactic complexity (fixed word Esperanto French order) (adapted from 0.930 E−Darby−19c Ehret and Szmrecsanyi in press: Fig 2). German

0.925 standardized syntactic complexity score syntactic complexity standardized Hungarian 0.920 Latin E−WestSaxon−10c

Finnish

−1.14 −1.12 −1.10 −1.08 standardized morphological complexity score Literatur Bonus material The modular complexity space of Mark

0.945 English E−BasicE−20c no yes Morphological E−Young’s−19c complexity by syntactic 0.940 E−DouayRheims−16c E−ASV−20cE−KingJames−17c complexity. Abscissa indexes morphological E−Webster’s−19c Jamaican Patois complexity, ordinate 0.935 E−Wycliffe−14cE−ESV−21c indexes syntactic complexity (fixed word Esperanto French order) (adapted from 0.930 E−Darby−19c Ehret and Szmrecsanyi in press: Fig 2). German

0.925 standardized syntactic complexity score syntactic complexity standardized Hungarian 0.920 Latin E−WestSaxon−10c

Finnish

−1.14 −1.12 −1.10 −1.08 standardized morphological complexity score Literatur Bonus material ICLE-German: modular complexity space

1 Morphological complexity by syntactic complexity. Abscissa 0.914 indexes morphological complexity, ordinate indexes syntactic 2 complexity (fixed word order). N = 1000 0.913 iterations sampling 10% of sentences in sample 3a 3b (adapted from Ehret in preparation). 0.912 Legend: Group 5: most instruction standardized avg syntactic complexity score syntactic complexity avg standardized ... 0.911 Group 1: least 4 5 instruction −0.995 −0.990 −0.985 −0.980 −0.975 −0.970 standardized avg morphological complexity score