COMPARISONS: A NEW ONLINE DATABASE AND RESOURCE FOR RESEARCH IN PHONETIC DIVERSITY

Paul Heggarty et al. — A full list of authors, affiliations and emails appears at end of the paper

Max Planck Institute for the of History — et al. [email protected] — et al.

ABSTRACT This paper focuses on the potential of the Sound Comparisons data-set, from its wide range of Sound Comparisons hosts over 90,000 individual linguistic contexts, to support research in , recordings and 50,000 narrow phonetic can also be applied to other fields (comparative/ transcriptions from 600 varieties from historical , dialectology, sociolinguistics). eleven language families around the . This To make the most of this potential, Sound resource is designed to serve researchers in Comparisons is founded equally on a powerful phonetics, and related fields. online interface to explore its data-set. This offers Transcriptions follow new initiatives for the functionality needed to turn Sound Comparisons standardisation in usage of the IPA and . At into a resource and research tool for phoneticians. soundcomparisons.com, users can explore the includes powerful functions to search and filter the transcription datasets by phonetically-informed phonetic diversity represented in the database, to search and filtering, customise selections of customise the selection and display of data, and to and , download any targeted data download and cite any such targeted selections (§6). subset (sound files and transcriptions) and cite it Having presented the database, its coverage and through a custom URL. We present sample research functionality, this paper closes with brief applications based on our extensive coverage of illustrations of some sample research applications. regional and sociolinguistic variation within major languages, and also of endangered languages, for 2. DATABASE STRUCTURE AND which Sound Comparisons provides a rapid first COVERAGE documentation of their diversity in phonetics. The multilingual interface and user-friendly, ‘hover-to- The Sound Comparisons data-set is structured as a hear’ maps likewise constitute an outreach tool, series of ‘studies’ (currently ten), by language where speakers can instantaneously hear and family and/or by region. As a framework for the compare the phonetic diversity and relationships of collection, storage and exploration of data on their native languages. phonetic diversity, Sound Comparisons can in principle host data on any or region Keywords: database, endangered languages, sound worldwide. Actual coverage so far is a function of change, comparative/historical linguistics, diversity. the specialisms and research objectives of the main researchers involved, and the availability of time and 1. PURPOSE: A RESOURCE FOR PHONETIC funding for fieldwork to make the recordings. RESEARCH  In (6 studies): four on Romance, Germanic, Balto-Slavic and Celtic (and parts of Sound Comparisons is a major collaborative project their dialectal worldwide); a Europe that since 2002 has collected over 90,000 individual study on their overlap in deep Indo-European word recordings from 600 language varieties of ; and a highly focused study on eleven language families around the world. Of these dialectal variation within English. recordings, over 50,000 are accompanied by the  In (3): on the Central Andes corresponding narrow , (the Quechua, Aymara and Uru families); on conforming to the latest approaches to standardising Mapudungun in Patagonia; and a pilot study on usage of IPA and Unicode (see §5 below), to ensure the indigenous . their utility also for the growing use of  (1): covering 40+ languages and 80 computational analyses in phonetics. The dialectal variants on the linguistically hyper- fundamental phonetic objective entails a on diverse island of Malakula, and its neighbours. collecting wordforms that are within each Studies also include, where known, assumed and language family covered, to ensure the most direct reconstructed transcriptions for earlier historical and meaningful comparisons between different stages of the modern languages recorded, such as realisations of the same original form. Shakespearean English, , Classical or Proto-Quechua. These are of particular use Examples of urgently targeted languages include for comparative and historical purposes (see §6). the originally native to Pomerania, but now spoken only by a last generation particularly in 3. DIVERSITY AND DOCUMENTATION Wisconsin, and Brazil (descendants of emigrants in the 1850s). Similar motivations have Much of the Sound Comparisons coverage thus far driven the Sound Comparisons coverage of Volga has prioritised languages that are highly endangered, German, Transylvanian ‘Saxon’ and indeed moribund. This reflects an urgent effort in ‘Dutch’, Patagonian Welsh, the of language documentation, of a new form that can be , Chabacano Spanish ‘creole’ in the considered ‘shallow but broad’. For although the , both varieties of Sorbian, and so on. data collected for any language are but a The ten studies in Sound Comparisons fall into limited word-list, those lists are devised to provide at two different types. Linguists might instinctively least a representative sample of the phonetics of that assume that the words axis is based on a Swadesh- variety. Moreover, they can be collected relatively type reference list of basic meanings quickly, in the space of a few hours from a suitable such as HEAD, LEG or DOG, for which Sound native-speaker informant. Before linguists can Comparisons would collect the word used in each document them in richer detail, many of the varieties language. But this actually applies only to the latest recorded will have vanished — some already have. two studies: the special case of Vanuatu, and the A further major objective of Sound Comparisons pilot study on Brazil, which do work to modified is outreach, to return something to the informants’ regional versions of the Swadesh 200-meaning list. speaker communities, including contributing to All other studies follow a different rationale, efforts towards awareness, native-language however, in line with the fundamentally phonetic and revitalisation. The explorer website is and comparative motivation behind “sound specifically designed to be as user-friendly as comparisons”. To that end, it is not meanings but possible to the general public, not least speakers of cognates that allow for the most valuable, direct and the languages themselves. It includes a multilingual meaningful comparisons of sounds between different user-interface, so that the can languages. Cognates constitute extensive sets of be explored through the entire website in a - phonetic realisations that may differ slightly or very language version (the English-based creole that is significantly, but which all correspond to each other the of Vanuatu). as reflexes of the same original underlying form. For researchers in phonetics, Sound Comparisons Such cognate sets can span the diversity across the thus constitutes a mass of detailed phonetic data on and languages of an entire language family, hundreds of languages and dialects that are as well as numerous varieties (geolectal and otherwise very little documented, and for which few sociolectal) of a single language. or no other recordings and resources are easily This is why each family constitutes a separate available — let alone in a form immediately and study, with its own list of cognates as its ‘word’ directly comparable with cognate forms in scores of axis. The Germanic study, for instance, is based on other related language varieties. . 100 cognates that have survived in (ideally) all languages and dialects across that family. This is not 4. DATA SELECTION CRITERIA a list of meanings HEAD, LEG or DOG, then, but of sets of cognates in Germanic — which may have Within each study, the data are structured on two those meanings in some languages, but not in all. So axes: language varieties, and words. (Both can be English head is covered alongside its cognate Haupt searched and customised, independently: see §6.) in , rather than its meaning ‘Language’ refers to any lect: .. 49 modern and 10 equivalent Kopf. (Haupt now means ‘main’, but like historical varieties of English, 37 of Mapudungun. English head, it goes back to the same Proto- Sampling of language varieties has been Germanic *xaubadan.) Similarly, Bein may mean determined by two main criteria: to be representative LEG in standard German, but is set alongside English of linguistic and dialectal diversity, and urgency in bone, its true cognate (from *bainan). In some the face of the imminent extinction that hangs over dialects, too, Bein means ‘bone’. much of that diversity. The complex balance In the Romance family, though, no single cognate between these criteria often overrides the default of survives well across the family for any of the Latin sampling evenly through geographical space. Fuller source terms for HEAD, LEG or DOG. Necessarily, for explanations of the sampling policies followed is Romance the Sound Comparisons cognate list is given in [1], and a series of subsequent publications different, and based on Latin as the reference for in preparation on each individual study. common origin and cognacy. The meanings in which cognates happen to survive widely vary 6. WEBSITE RESEARCH FUNCTIONALITY: significantly from one family to the next. The SEARCH, FILTER, DOWNLOAD, CITE Europe ‘overlap’ study has just twenty deep Indo- European cognates across the four European studies. The two axes of the database, languages and words, As for how many and which cognates are also structure the explorer website. The main central covered in each study, even within an individual panel is flanked by a language selector panel on the branch like Germanic, Romance or Celtic, diversity left, and a word selector panel to the right. The data is enough to make it a challenge even to find many thus selected appear in the main central panel, in more than 100 or so cognates that survive (and can map or table views, showing: the pronunciations of a still be straightforwardly elicited) in all of the single selected word in all languages; of all words in dialectal variation within each branch. Details on a single language; or of any selected combination of this, and on other criteria to ensure that the list words and languages. The corresponding sound files makes for as balanced as possible a sample of the are played and compared instantaneously by phonetics of each language family, are given in [1]. touching, clicking or just hovering the cursor over any transcription or play icon. 5. TRANSCRIPTION POLICIES AND Among many search and filter options, most NORMALISATION TO CLTS useful for phonetic research are two entry boxes in the word selector column: to filter/search by either For any database of phonetic transcriptions, key the or the phonetic transcriptions of any concerns are consistency and standardisation. All language variety in that study. So a user can set the Sound Comparisons transcriptions follow IPA usage, filter language to English, and type to in Unicode. Even so, in practice those still leave return all words that contain the grapheme in much scope for inconsistency. The need for data to their English ; or set the transcriptions filter be consistent, standardised and unambiguous is language to Doric Scots, and type f to return all particularly acute for the growing trend towards words that contain IPA [f] in their transcriptions in analysing phonetic transcription data by Doric Scots. All cognate reflexes in other languages computational methods. can then be shown by just clicking an icon to reset Sound Comparisons seeks to limit transcription the main display table to this newly filtered set of inconsistencies through specific conventions and words (in the multi-language table views). guidelines for its transcribers, and by ensuring that Both filter boxes allow full ‘regular expressions’ wherever possible, all transcriptions within each . So typing ^f or f$, for instance, returns all family, or at least major sub-branch, are by the same words that begin or end with f, respectively. The phonetician. Transcriptions are generally narrowest, transcription filter box has pop-up IPA charts so that though, for English, Mapudungun, Romance and the user can enter any IPA character or diacritic Balto-Slavic; and broader for Germanic, the Andean directly, without needing complex key combina- families, and particularly Vanuatu. tions. It also recognises phonological shorthand All Sound Comparisons transcriptions have also characters in upper case: i.e. for any , for been run through computer-assisted correction and any nasal, and so on. Entering VːN$, for example, normalisation. The Cross-Linguistic Transcription returns all words that end in a long vowel followed System (CLTS, see [2]) provides software that can by a nasal. It is possible to filter for any Unicode identify many forms of error or departures from IPA character, even if only a modifier or diacritic, to find and Unicode usage, and can correct transcriptions to all instances, irrespective of the main symbol it a proposed ‘normalised’ IPA. CLTS tools identify and appears with. So typing only ʷ returns all words with correct Unicode “confusables” (see [3]), e.g. [ә] any labialised segment. With the search language set U+0259 “Latin small letter ” vs. [ǝ] U+01DD to an earlier, ancestral language, the user can filter to “Latin small letter turned e”, and common a given sound in that language, and then transcription ‘shorthand’ characters, e.g. the ASCII immediately create a side-by-side table to compare [:] instead of the correct Unicode all of that (proto-)sound’ reflexes across all marker [ː] U+02D0. CLTS also normalises the order descendant varieties, e.g. all modern Romance of diacritics, e.g. [kʲʰ] rather than [kʰʲ], and their best reflexes of Classical Latin /kʷ/ or written , or placement relative to the base symbol (e.g. a all modern reflexes of Early . devoiced ring either above or below it). In return, the For any combination of languages and words narrow phonetic transcriptions of Sound customised using these search and filter operations, Comparisons served as an extensive test that clicking on the link icon gives a durable shortcut contributed to the final stage of development, URL so that the user can also cite that specific refinement and extension of the CLTS software itself. combination of data. Clicking on the download icons exports to the user the phonetic transcriptions and of the 37 regiolects of Mapudungun covered by the sound files for that precise selection of languages Sound Comparisons, as at ../sl/2G (where /sl/ = a and words. Exports are in Unicode plain csv and tsv short link). Other data such as at ../sl/n5, meanwhile, formats, and comply with the proposed Cross- are employed in [8] to show that one Huilliche Linguistic Data Format (CLDF) standards (see [4]), to variety may even have developed a “bisyllabic facilitate integration with the suite of databases that with phonemic status /l̩̪ d̪ /”. already follow this format (clld.org/datasets.html). Sound Comparisons is also particularly suited to This progressive integration aims also to ensure the research on continua, including where sustainability and long-term data curation that CLDF migration and contact bring further complexities. As and CLLD are specifically devised to provide. one illustration, the western Erzgebirge, the ‘Ore It is also possible to cite any studies and views Mountains’ that form a natural frontier between with custom URLs after the soundcomparisons.com/ and the Czech Republic, were from the (replaced by ../ in the links below), as follows. 12th century onwards settled primarily by speakers of  Any specific study: e.g. ../Germanic. High Franconian and Thuringian dialects, thus  Any specific view type and word: e.g. also brought their speech into contact with new, ../Germanic/map/night more easterly neighbours. Even a small customised  Any current user interface language for the sample at ..sl/Mm can reveal this mix of origins and website (English, Bislama, Croatian, German, contact. Western deletes word-final Italian, Polish, Portuguese, Russian, Spanish), *n (as in ‘stone’ in Table 1) and intervocalic *g (as using its two-letter ISO 639-1 code: e.g. in ‘nail’), both traits probably inherited from ../bi/Vanuatu for the Bislama version. Franconian. But it also shows the typically (East)  All data for any variety, using its Thuringian hardening of word-initial *- (as in three-letter ISO 639-3 code, or its aaaa0000 ‘year’), while its vowel system has been re-shaped format GlottoCode: e.g. either ../cap or by retracting *a (as in ‘name’) and by fronting back ../chip1262 for the Chipaya language. rounded (as in ‘dog’), traits typical of Upper  Often a single ISO or GlottoCode has several Saxon, the main contact variety since the migrations. sub-varieties in Sound Comparisons, so those codes return a table of all of those sub-varieties: Table 1: Western Erzgebirgisch compared to its source e.g. ../vls shows the three varieties that fall under and contact dialects — data at ..sl/Mm.

ISO code vls for Flemish. dialect East Upper West Thuringian  Any individual variety is specified by its Sound group Franconian Saxon Erzgebirge Comparisons URL-name: e.g. ../Gmc_W_Dut_ZandF_FlW_FrenchFlemish_Dl locality Altersbach Altenburg Aue for the ‘Flemish’ specific to northern France. ‘stone’ ʃtaɪ ʃteːn ʃtɛɪn ʃtæː ‘nail’ nɛːl nɔ̝ ːʁәl nʌːʁ̥ әl nʌː.әl 7. RANGE OF RESEARCH APPLICATIONS ‘year’ juwɚ ɡɔˑʌ jɔˤː ɡɔ̟ ˤː Some of the earliest publications founded on the ‘name’ nu:mә nɤ:mә nʌːmә nʌːmә data now built into Sound Comparisons focused on ‘dog’ hɔʏnd̥ hʌntʰ hɵnt hɵnt English dialectology (see [5], [6]). Those were soon extended to Germanic historical linguistics, to the Other language families and regions covered within general methodological challenge of quantifying Sound Comparisons offer many further data-sets , and to whether such measures open to researchers to exploit: on, for instance, can validly contribute to diachronic study, as in [7]. debated aspects of English dialectology such as ‘pre- On a vexed question of language classification, glottalisation’ in Tyneside English; on geminate meanwhile, [8] whether Huilliche qualifies reflexes across Italian dialects; on prenasalisation as either a fully-fledged language and ‘sister’ to phenomena in many near-undocumented Oceanic Mapudungun, or just a slightly divergent dialect languages of Vanuatu; and on the striking consonant within it. Sound Comparisons data were employed clusters in the little-known Chipaya language of as evidence to clarify the and degree of highland Bolivia, at ../cap. phonetic divergence that underlies the classificatory In sum, the scale, diversity and precision of the confusion. Mapudungun is also characterised by a Sound Comparisons database, coupled with its “typologically famous series of interdental-alveolar powerful research functionality and full open access oppositions”, i.e. not just /θ/ vs. /s/, but also /t̪/ vs. //, online, make for a rich resource. Researchers in /n̪ / vs. /n/ and /l̪ / vs. //. Despite claims of its demise, phonetics are invited to explore and make use of it. in [9] this opposition is shown to survive in various AUTHORS, AFFILIATIONS AND EMAILS REFERENCES

Paul Heggarty1, Aviva Shimelman2, Giovanni [1] Heggarty et al. in prep. Sound Comparisons: A Abete3, Cormac Anderson1, Scott Sadowsky4,1, new resource for exploring phonetic diversity Ludger Paschen5, Warren Maguire6, Lechoslaw across language families of the world. Jocz7, María José Aninao A.8, Laura Wägerle2, Darja [2] Anderson, C. et al. 2019. A cross-linguistic Appelganz1, Ariel Pheula do Couto e Silva9, database of phonetic transcription systems. C. Lawyer2, Ana Suelly Arruda Câmara Cabral9, Yearbook of the Poznan Linguistic Meeting 4. Mary Walworth1, Jan Michalsky10, Ezequiel Koile1, [3] Moran, S., & Cysouw, . eds. 2018. The Jakob Runge2 & -Jörg Bibiko1 Unicode Cookbook for Linguists. Berlin: Language Science Press. 1 Dept of Linguistic and Cultural Evolution, Max http://doi.org/10.5281/zenodo.1296780 Planck Institute for the Science of Human [4] Forkel, . et al. 2018. Cross-Linguistic Data History, Jena, Germany Formats, advancing data sharing and re-use in 2 Independent scholar. (Sound Comparisons work comparative linguistics. Scientific Data 5: performed during employment at 1 and/or at the p.180205. http://doi.org/10.1038/sdata.2018.205 Max Planck Institute for Evolutionary [5] McMahon, A.M.S. et al. 2007. The sound Anthropology, Leipzig.) patterns of Englishes: representing phonetic 3 Dipartimento di Studi Umanistici, University of similarity. and Linguistics 11 Federico II (01): p.113-42. 4 Dpto. de Ciencias del Lenguaje, Pontificia http://doi.org/10.1017/S1360674306002139 Universidad Católica de Chile, Santiago [6] Maguire, . et al. 2010. The past, present and 5 Leibniz-Zentrum Allgemeine Sprachwissen- future of English dialects: Quantifying schaft, Berlin convergence, divergence and dynamic 6 Linguistics and English Language, University of equilibrium. Language Variation and Change 22 Edinburgh (1): p.69-104. 7 Akademia im. Jakuba Paradyża, Gorzów http://doi.org/10.1017/S0954394510000013 Wlkp., [7] Heggarty, P., Maguire, W., & McMahon, 8 Pontificia Universidad Católica del Perú, Lima A.M.S. 2010. Splits or waves? Trees or webs? 9 Laboratório de Línguas e Literatura Indígenas, How divergence measures and network analysis University of Brasilia can unravel language histories. Philosophical 10 Technology Management, FAU Erlangen- Transactions of the Royal Society : Biological Nuremberg (365): p.3829-43. http://doi.org/10.1098/rstb.2010.0099 [email protected], [email protected], [8] Sadowsky, S. et al. 2015. Huilliche: ¿geolecto [email protected], [email protected], del mapudungun lengua propia? Una mirada [email protected], [email protected], desde la fonética la fonología de las [email protected], [email protected], consonantes. In: A. Fernández Garay & M. A. [email protected], Regúnaga (eds) Lingüística indígena [email protected], [email protected], [email protected], [email protected], sudamericana: aspectos descriptivos, [email protected], [email protected], comparativos y areales. Buenos Aires: Editorial [email protected], [email protected] de la Facultad de Filosofía y Letras, Universidad de Buenos Aires. ACKNOWLEDGEMENTS [9] Sadowsky, S. et al. 2013. Mapudungun. Journal of the International Phonetic Association 43 We thank the many hundreds of native speakers who (01): p.87-96. generously gave of their time and allowed us to http://doi.org/10.1017/S0025100312000369 record their voices, as well as all other contributors to the Sound Comparisons project. This project has relied on extensive funding from the British Academy, Leverhulme Trust and Max Planck Society. Full details are given at: soundcomparisons.com/#/about/Funding