Mapping Lexical Dialect Variation in British English Using Twitter Grieve, Jack; Montgomery, Chris; Nini, Andrea; Murakami, Akira; Guo, Diansheng
Total Page:16
File Type:pdf, Size:1020Kb
University of Birmingham Mapping lexical dialect variation in British English using Twitter Grieve, Jack; Montgomery, Chris; Nini, Andrea; Murakami, Akira; Guo, Diansheng DOI: 10.3389/frai.2019.00011 License: Creative Commons: Attribution (CC BY) Document Version Publisher's PDF, also known as Version of record Citation for published version (Harvard): Grieve, J, Montgomery, C, Nini, A, Murakami, A & Guo, D 2019, 'Mapping lexical dialect variation in British English using Twitter', Frontiers in Artificial Intelligence, vol. 2, 11. https://doi.org/10.3389/frai.2019.00011 Link to publication on Research at Birmingham portal Publisher Rights Statement: Checked for eligibility: 02/08/2019 General rights Unless a licence is specified above, all rights (including copyright and moral rights) in this document are retained by the authors and/or the copyright holders. The express permission of the copyright holder must be obtained for any use of this material other than for purposes permitted by law. •Users may freely distribute the URL that is used to identify this publication. •Users may download and/or print one copy of the publication from the University of Birmingham research portal for the purpose of private study or non-commercial research. •User may use extracts from the document in line with the concept of ‘fair dealing’ under the Copyright, Designs and Patents Act 1988 (?) •Users may not further distribute the material nor use it for the purposes of commercial gain. Where a licence is displayed above, please note the terms and conditions of the licence govern your use of this document. When citing, please reference the published version. Take down policy While the University of Birmingham exercises care and attention in making items available there are rare occasions when an item has been uploaded in error or has been deemed to be commercially or otherwise sensitive. If you believe that this is the case for this document, please contact [email protected] providing details and we will remove access to the work immediately and investigate. Download date: 01. Mar. 2020 ORIGINAL RESEARCH published: 12 July 2019 doi: 10.3389/frai.2019.00011 Mapping Lexical Dialect Variation in British English Using Twitter Jack Grieve 1*, Chris Montgomery 2, Andrea Nini 3, Akira Murakami 1 and Diansheng Guo 4 1 Department of English Language and Linguistics, University of Birmingham, Birmingham, United Kingdom, 2 School of English, University of Sheffield, Sheffield, United Kingdom, 3 Department of Linguistics and English Language, University of Manchester, Manchester, United Kingdom, 4 Department of Geography, University of South Carolina, Columbia, SC, United States There is a growing trend in regional dialectology to analyse large corpora of social media data, but it is unclear if the results of these studies can be generalized to language as a whole. To assess the generalizability of Twitter dialect maps, this paper presents the first systematic comparison of regional lexical variation in Twitter corpora and traditional survey data. We compare the regional patterns found in 139 lexical dialect maps based on a 1.8 billion word corpus of geolocated UK Twitter data and the BBC Voices dialect survey. A spatial analysis of these 139 map pairs finds a broad alignment between these two data sources, offering evidence that both approaches to data collection allow for the same basic underlying regional patterns to be identified. We argue that these Edited by: results license the use of Twitter corpora for general inquiries into regional lexical variation Jonathan Dunn, University of Canterbury, New Zealand and change. Reviewed by: Keywords: dialectology, social media, Twitter, British English, big data, lexical variation, spatial analysis, Benedikt Szmrecsanyi, sociolinguistics KU Leuven, Belgium Gabriel Doyle, San Diego State University, United States INTRODUCTION John Nerbonne, University of Freiburg, Germany Regional dialectology has traditionally been based on data elicited through surveys and interviews, *Correspondence: but in recent years there has been growing interest in mapping linguistic variation through Jack Grieve the analysis of very large corpora of natural language collected online. Such corpus-based [email protected] approaches to the study of language variation and change are becoming increasingly common across sociolinguistics (Nguyen et al., 2016), but have been adopted most enthusiastically in Specialty section: dialectology, where traditional forms of data collection are so onerous. Dialect surveys typically This article was submitted to require fieldworkers to interview many informants from across a region and are thus some of the Language and Computation, most expensive and complex endeavors in linguistics. As a result, there have only been a handful a section of the journal of surveys completed in the UK and the US in over a century of research. These studies have been Frontiers in Artificial Intelligence immensely informative and influential, shaping our understanding of the mechanisms of language Received: 02 April 2019 variation and change and giving rise to the modern field of sociolinguistics, but they have not Accepted: 12 June 2019 allowed regional dialect variation to be fully understood, especially above the levels of phonetics Published: 12 July 2019 and phonology. As was recently lamented in the popular press (Sheidlower, 2018), this shift from Citation: dialectology as a social science to a data science has led to a less personal form of scholarship, but it Grieve J, Montgomery C, Nini A, has nevertheless reinvigorated the field, democratizing dialectology by allowing anyone to analyse Murakami A and Guo D (2019) Mapping Lexical Dialect Variation in regional linguistic variation on a large scale. British English Using Twitter. The main challenge associated with corpus-based dialectology is sampling natural language in Front. Artif. Intell. 2:11. sufficient quantities from across a region of interest to permit meaningful analyses to be conducted. doi: 10.3389/frai.2019.00011 The rise of corpus-based dialectology has only become possible with the rise of computer-mediated Frontiers in Artificial Intelligence | www.frontiersin.org 1 July 2019 | Volume 2 | Article 11 Grieve et al. Mapping Lexical Variation Using Twitter communication, which deposits massive amounts of regionalized is therefore to compare lexical dialect maps based on Twitter language data online every day. Aside from early studies based corpora and survey data so as to assess the degree to which these on corpora of letters to the editor downloaded from newspaper two approaches to data collection yield comparable results. We websites (e.g., Grieve, 2009), this research has been almost do not assume that the results of surveys generalize; rather, we entirely based on Twitter, which facilitates the collection of believe that alignment between these two very different sources large amounts of geolocated data. Research on regional lexical of dialect data would be strong evidence that both approaches variation on American Twitter has been especially active (e.g., to data collection allow for more general patterns of regional Eisenstein et al., 2012, 2014; Cook et al., 2014; Doyle, 2014; dialect variation to be mapped. A secondary goal of this study Jones, 2015; Huang et al., 2016; Kulkarni et al., 2016; Grieve is to test how consistent dialect patterns are across different et al., 2018). For example, Huang et al. (2016) found that communicative contexts. Corpus-based dialectology has shown regional dialect patterns on American Twitter largely align that regional variation pervades language, even in the written with traditional dialect regions, based on an analysis of lexical standard (Grieve, 2016), but we do not know how stable regional alternations, while Grieve et al. (2018) identified five main variation is on the level of individual linguistic features. To regional patterns of lexical innovation through an analysis of the address these gaps in our understanding of regional linguistic relative frequencies of emerging words. Twitter has also been variation, this paper presents the first systematic comparison used to study more specific varieties of American English. For of lexical dialect maps based on surveys and Twitter corpora. example, Jones (2015) analyzed regional variation in African Specifically, we report the results of a spatial comparison of American Twitter, finding that African American dialect regions the maps for 139 lexical variants based on a multi-billion-word reflect the pathways taken by African Americans as they migrated corpus of geocoded British Twitter data and the BBC Voices north during the Great Migration. There has been considerably dialect survey. less Twitter-based dialectology for British English. Most notably, Bailey (2015, 2016) compiled a corpus of UK Twitter and mapped a selection of lexical and phonetic variables, while Shoemark BRITISH DIALECTOLOGY et al. (2017) looked at a Twitter corpus to see if users were more likely to use Scottish forms when tweeting on Scottish topics. In Interest in regional dialect variation in Great Britain is addition, Durham (2016) used a corpus of Welsh English Twitter longstanding, with the earliest recorded comments on accent to examine attitudes toward accents in Wales, and Willis et al. dating back to the fifteenth and sixteenth centuries (Trevisa, (2018) have begun to map grammatical variation in the UK. 1495). The study of regional variation in lexis grew in popularity Research in corpus-based