Recent Advances in the Corpus-Based Study of Linguistic Complexity
Total Page:16
File Type:pdf, Size:1020Kb
Introduction About complexity Profiles Kolmogorov Variational Conclusion Recent advances in the corpus-based study of linguistic complexity Benedikt Szmrecsanyi KU Leuven Quantitative Lexicology and Variational Linguistics Download these slides at http://www.benszm.net/AACL.pdf Introduction About complexity Profiles Kolmogorov Variational Conclusion Prologue: Complexity is beautiful. Or is it? \Bavarian Baroque": Kloster Roggenburg. Introduction About complexity Profiles Kolmogorov Variational Conclusion Introduction Introduction About complexity Profiles Kolmogorov Variational Conclusion Introduction • linguistic complexity: one of the currently most hotly debated issues in linguistics (e.g. Sampson et al. 2009; Trudgill 2011; Pallotti 2014, among many others) • theoretical linguistics: are all languages, or language varieties, equally complex? If not, what are the sociolinguistic factors that condition language complexity? • applied linguistics: how can we use complexity measures as proxies for tracking learners' proficiency, and/or for benchmarking development? this presentation: outline three complexity measures that are usage/corpus-based and holistic Introduction About complexity Profiles Kolmogorov Variational Conclusion Problems with popular complexity measures • complexity notions in the theoretical literature (e.g. distinctions beyond communicative necessity): ê holistic but typically not usage-based • complexity measures in the applied literature (e.g. mean length of T-unit, extent of clausal subordination): ê usage-based but selective & reductionist Introduction About complexity Profiles Kolmogorov Variational Conclusion Problems with popular complexity measures • complexity notions in the theoretical literature (e.g. distinctions beyond communicative necessity): ê holistic but typically not usage-based • complexity measures in the applied literature (e.g. mean length of T-unit, extent of clausal subordination): ê usage-based but selective & reductionist this presentation: outline three complexity measures that are usage/corpus-based and holistic Introduction About complexity Profiles Kolmogorov Variational Conclusion Road map 1. Introduction 2. Why linguists are concerned about complexity 3. Three new measures 3.1 Typological profiles 3.2 Kolmogorov complexity 3.3 Some thoughts on variational complexity 4. Conclusion Introduction About complexity Profiles Kolmogorov Variational Conclusion Why linguists are concerned about complexity Introduction About complexity Profiles Kolmogorov Variational Conclusion 20th century structuralism: Hockett (1958) \it would seem that the total grammatical complexity of any language, counting both morphology and syntax, is about the same as that of any other [. ] " (Hockett 1958: 180{181) Charles F. Hockett (1916{2000) Introduction About complexity Profiles Kolmogorov Variational Conclusion Challenging the equicomplexity dogma: McWhorter (2001) \a subset of creole languages display less overall grammatical complexity than older languages, by virtue of the fact that they were born as pidgins, and thus stripped of almost all features unnecessary to communication [...]" (McWhorter 2001: 125) John McWhorter Introduction About complexity Profiles Kolmogorov Variational Conclusion Sociolinguistic theory: Trudgill (2011), Sociolinguistic Typology contact, social instability, adult SLA, population growth ê change ê simplification, i.e.: ê increase in morphological transparency ê loss of redundancy ê loss of \historical baggage" Peter Trudgill Introduction About complexity Profiles Kolmogorov Variational Conclusion Some popular complexity measures in the theoretical literature • quantitative complexity: more contrasts, rules, markers, etc. ê complexity • opacity: for example, allomorphies ê complexity • ornamental complexity: communicatively dispensable contrasts, rules, markers etc. ê complexity • L2 acquisition difficulty: contrasts, rules, etc. that are hard to learn for adults ê complexity (see Szmrecsanyi and Kortmann 2012 for a review) Introduction About complexity Profiles Kolmogorov Variational Conclusion Complexity in applied linguistics: Larsen-Freeman (2012) \Second Language Acquisition (SLA) researchers have long harbored aspirations of developing developmental indices, which at least in part would be based on the growing complexity of interlanguage" Diane Larsen-Freeman (Larsen-Freeman 2012: 1) Introduction About complexity Profiles Kolmogorov Variational Conclusion Why we measure interlanguage complexity: Ortega (2012) \Second language acquisition researchers use interlanguage complexity measures [. ] with at least three main purposes in mind: (a) to gauge proficiency, (b) to describe performance, and (c) to benchmark development." (Ortega 2012: 128) Lourdes Ortega Introduction About complexity Profiles Kolmogorov Variational Conclusion Some popular complexity measures in the applied linguistics literature • length of units (e.g. mean length of T-unit): long units ê complexity • density of subordination: subordination ê complexity • frequency of occurrence of `complex' forms (e.g. passive voice): many complicated forms ê complexity (see Ortega 2003, 2012 for reviews) Introduction About complexity Profiles Kolmogorov Variational Conclusion The quest for new complexity measures • \theoretical" measures: nicely holistic but not easily amenable to operationalization in usage data (how do you measure, e.g., \historical baggage"?) • \applied" measures: nicely amenable to operationalization in usage data but suffering from \concept reductionism" (Ortega 2012: 128) Introduction About complexity Profiles Kolmogorov Variational Conclusion Typological profiles Introduction About complexity Profiles Kolmogorov Variational Conclusion Drawing inspiration from quantitative typology • time-honored holistic-typology notions: • analyticity: free markers (+ word order) • syntheticity: bound markers (e.g. inflections) • Greenberg (1960): seemingly abstract typological notions are amenable to precise measurements Joseph H. Greenberg • calculate indices to profile the way (1915{2011) grammatical information is coded in corpus texts Introduction About complexity Profiles Kolmogorov Variational Conclusion The link to language complexity • analyticity increases explicitness and transparency while easing comprehension difficulty (e.g. Humboldt 1836: 284-285) • syntheticity difficult because of the allomorphies it creates (e.g. Braunmuller¨ 1990: 627) • known interlanguage universal: learners avoid inflectional marking & prefer analyticity (e.g. Klein and Perdue 1997) • rule of thumb: analyticity ê simple, syntheticity ê complex Introduction About complexity Profiles Kolmogorov Variational Conclusion Coding use POS tags to group word tokens in corpus texts into three broad categories: 1. analytic word tokens, i.e. function words that are members of synchronically closed word classes (conjunctions, determiners, pronouns, prepositions, infinitive markers, modal & auxiliary verbs, negators) 2. synthetic word tokens, which carry bound grammatical markers (e.g. inflectional affixes) 3. a small number of analytic & synthetic word tokens (e.g. inflected auxiliary verbs) Introduction About complexity Profiles Kolmogorov Variational Conclusion Calculating Greenberg-inspired indices • the analyticity index (AI): calculated as the ratio of the number of free grammatical markers (i.e. function words) in a text to the total number of words in the text, normalized to a sample size of 1,000 words of running text • the syntheticity index (SI): calculated as the ratio of the number of words in a text that bear a bound Greenberg (1960: Tab. 1) grammatical marker to the total number of words in the sample text, normalized to a sample size of 1,000 words of running text Introduction About complexity Profiles Kolmogorov Variational Conclusion The cross-linguistic perspective 600 AI versus SI, medium: written. European Tok Pisin languages: 500 contemporary newspaper prose. Creole English languages: texts Italian obtained & annotated German by Geoff Smith and Aya Hawai'i Creole 400 Inoue (see Siegel, analyticity index Szmrecsanyi & Kortmann 2014). 300 Russian 0 200 400 600 syntheticity index Introduction About complexity Profiles Kolmogorov Variational Conclusion The British National Corpus (BNC) • 90 million words of written standard British English • 10 million words of spoken British English • 34 major registers (16 spoken, 18 written; for instance, S conv and W fict) (see Aston and Burnard 1998 for details) Introduction About complexity Profiles Kolmogorov Variational Conclusion BNC registers 550 medium sermon spoken written AI versus SI in the British National Corpus demonstratn (BNC). Spoken 500 interview registers versus written courtroom meetingconv tutorial registers (adapted from consultclassroomunclassified speechlect Szmrecsanyi 2009: Fig pub fict parliament religion brdcast 2). hansard 450 biography essay sportslive letters analyticity index admin ac commerce miscnon instructional pop email institut 400 newsp news advert 140 160 180 200 syntheticity index Introduction About complexity Profiles Kolmogorov Variational Conclusion BNC registers 550 medium sermon spoken written AI versus SI in the British National Corpus demonstratn (BNC). Spoken 500 interview registers versus written courtroom meetingconv tutorial registers (adapted from consultclassroomunclassified speechlect Szmrecsanyi 2009: Fig pub fict parliament religion brdcast 2). hansard 450 biography essay sportslive letters analyticity index admin ac commerce miscnon instructional pop email institut