Why Are Some Languages Confused for Others? Investigating Data from the Great Language Game
Total Page:16
File Type:pdf, Size:1020Kb
RESEARCH ARTICLE Why are some languages confused for others? Investigating data from the Great Language Game Hedvig Skirgård1☯, SeaÂn G. Roberts2☯*, Lars Yencken3☯ 1 The Wellsprings of Linguistic Diversity Laureate project, Department of Linguistics, School of Culture, History and Language, College of Asia and the Pacific, Australian National University, Acton, ACT, Australia, 2 Language & Cognition Department, Max Planck Institute for Psycholinguistics, Nijmegen, Gelderland, Netherlands, 3 Independent researcher, Melbourne, Victoria, Australia a1111111111 a1111111111 ☯ These authors contributed equally to this work. [email protected] a1111111111 * a1111111111 a1111111111 Abstract In this paper we explore the results of a large-scale online game called `the Great Language Game', in which people listen to an audio speech sample and make a forced-choice guess OPEN ACCESS about the identity of the language from 2 or more alternatives. The data include 15 million Citation: Skirgård H, Roberts SG, Yencken L guesses from 400 audio recordings of 78 languages. We investigate which languages are (2017) Why are some languages confused for confused for which in the game, and if this correlates with the similarities that linguists iden- others? Investigating data from the Great Language Game. PLoS ONE 12(4): e0165934. tify between languages. This includes shared lexical items, similar sound inventories and https://doi.org/10.1371/journal.pone.0165934 established historical relationships. Our findings are, as expected, that players are more Editor: Niels O. Schiller, Leiden University, likely to confuse two languages that are objectively more similar. We also investigate factors NETHERLANDS that may affect players' ability to accurately select the target language, such as how many Received: December 22, 2015 people speak the language, how often the language is mentioned in written materials and the economic power of the target language community. We see that non-linguistic factors Accepted: October 20, 2016 affect players' ability to accurately identify the target. For example, languages with wider Published: April 5, 2017 `global reach' are more often identified correctly. This suggests that both linguistic and cul- Copyright: © 2017 Skirgård et al. This is an open tural knowledge influence the perception and recognition of languages and their similarity. access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability Statement: All relevant data are within the paper and its Supporting Information Introduction files. Objective studies of similarities and differences of languages allow researchers to reconstruct Funding: SGR is supported by the Max Planck patterns of historical change in linguistic structures [1], or social changes such as migrations society and an ERC Advanced Grant No. 269484 INTERACT to Stephen Levinson. The funders had [2, 3]. However, knowledge of linguistic variation is also important in the day-to-day life of no role in study design, data collection and non-linguists. Being confronted with, and often confounded by, a different accent or a foreign analysis, decision to publish, or preparation of the language is perhaps a universal human experience; we are often in situations where we hear manuscript. snippets of unfamiliar conversation in a public space or on the radio, and we try to identify Competing interests: The authors have declared what it is we hear. People are sensitive to differences in pronunciation and vocabulary and are that no competing interests exist. quick to link these differences to their knowledge of cultural differences. Our attitudes and the PLOS ONE | https://doi.org/10.1371/journal.pone.0165934 April 5, 2017 1 / 35 Why are some languages confused for others? way we change our behaviour when trying to communicate past these differences form part of the basis for language change, but is very hard to study. In some cases, identifying non-group members by linguistic means may be a matter of life or death, as in the Biblical episode involving the pronunciation of Hebrew s(h)ibboleth `ear of wheat' (Judges 12:6). More recent examples include the 20th-century civil war in Sri Lanka where allophonic differences were used to differentiate Tamil and Sinhalese speakers (see [4]). Indeed, linguistic diversity can be driven by the need to differentiate between cultural groups [5] (see also Thurston's concept of `esoterogeny' [6] and Larsen's `nabo-opposisjon' [7]). In extreme cases, a biased attitude towards a language can severely affect perception, even to the extent of reducing comprehension [8, 9]. In spite of the importance of perceiving linguistic differences in daily life, there have been few studies exploring the factors which influence people's ability to recognise the languages of the world and their similarities. This game presents us with a new kind of data that has previ- ously not been available, and offers the chance to investigate the perception of language simi- larity. In this paper, we analyse the results of an online game played millions of times by a global sample of people, in which the goal was to accurately identify languages based on a 20-second audio clip of speech. Based on the players' behaviour, we can extract information about which languages they can easily identify, and where they get confused. We use the results of this game to investigate what factors make a language easy to identify accurately, including non-linguistic factors such as economic power and the quality of the audio recordings. We also assess whether languages which are often confused also have some objective linguistic similarity, for example being closely related, geographically close, or having similar sound sys- tems or lexicon. More specifically, the research questions we address in this paper are: · Q1: Which languages are confused for which others? · Q2: Are there any asymmetries of confusion? · Q3: What factors can predict whether players confuse two languages for each other? Candi- dates include: ± geographical closeness; ± genealogy; ± similarity of phoneme inventories; ± lexical similarity. · Q4: What factors can predict players' accuracy? Candidates include: ± acoustic quality of the speech samples; ± proportion of non-native speakers (L2 speakers); ± total native speaker (L1) speaker population; ± linguistic diversity of the main country in which the language is spoken; ± number of countries the language is spoken in; ± ªlanguage name transparencyºÐhow clear the link is between the language name and the name of the main country; ± economic power of main country as measured by Gross Domestic Product (GDP); PLOS ONE | https://doi.org/10.1371/journal.pone.0165934 April 5, 2017 2 / 35 Why are some languages confused for others? ± Familiarity with the language, as measured by the frequency of occurrence of the lan- guage name in written materials such as the English and Chinese corpora from Google Books. · Q5: Is players' performance more accurately predicted by linguistic or non-linguistic factors? · Q6: Are some phonological cues more important than others in predicting players' accuracy? Identifying languages and relationships between them Dividing a continuum of related linguistic varieties into discrete languages and families is a difficult task. As Evans & Levinson note, ªwe are the only species with a communication sys- tem which is fundamentally variable at all levels [of linguistic organization]º [10]. The most common methods of separating varieties into languages are based on tests of mutual intelligi- bility, shared literary tradition and/or shared lexicons [11, 12]. However, speaker/signer com- munities may have their own perceptions and labelling of group identity (cf [13]). The editors of the Ethnologue recognise that ªa language is more often comprised of continua of features that extend across time, geography, and social space. There is growing attention being given to the roles or functions that language varieties play within the linguistic ecology of a region or a speech communityº [14]. Language users' own perception of what are salient contrasts and important categories can often differ significantly from what linguists focus on. Often, catego- rizations by language users are politically motivated. For example, Norwegian and Swedish are considered different languages by their respective language communities, despite there being large groups of speakers from the two communities who are able to understand each other and a large shared lexicon (cf. [15]). There are at least three types of relationship between languages: temporal, spatial and social. For example, the gradual differences leading from Middle English to Modern English form a temporal continuum of variation within the same community. Linguistic borrowing due to contact can result in a spatial continuum, as found in the areal patterns of East Asia (e.g. Can- tonese, Mandarin, Japanese and Korean in our sample) or Mainland Southeast Asia (Khmer, Burmese, Lao and Thai). Often, languages are related both in time (i.e. genealogy) and in space (i.e. geographical proximity), as in the case of the closely-related Germanic dialects of Belgium, the Netherlands, Germany, Switzerland, Luxembourg and Austria. Languages can also vary on a social scale, this is for example reflected in different language varieties within the same lan- guage depending on socio-economic