From Phonology to Syntax: Unsupervised Linguistic Typology at Different Levels with Language Embeddings

Total Page:16

File Type:pdf, Size:1020Kb

From Phonology to Syntax: Unsupervised Linguistic Typology at Different Levels with Language Embeddings From Phonology to Syntax: Unsupervised Linguistic Typology at Different Levels with Language Embeddings Johannes Bjerva Isabelle Augenstein Department of Computer Science Department of Computer Science University of Copenhagen University of Copenhagen Denmark Denmark [email protected] [email protected] Abstract popularity (Dunn et al., 2011;W alchli¨ , 2014; ¨ A core part of linguistic typology is the clas- Ostling, 2015; Cotterell and Eisner, 2017; Asgari sification of languages according to linguistic and Schutze¨ , 2017; Malaviya et al., 2017; Bjerva properties, such as those detailed in the World and Augenstein, 2018). One part of traditional ty- Atlas of Language Structure (WALS). Doing pological research can be seen as assigning sparse this manually is prohibitively time-consuming, explicit feature vectors to languages, for instance which is in part evidenced by the fact that only manually encoded in databases such as the World 100 out of over 7,000 languages spoken in the Atlas of Language Structures (WALS, Dryer and world are fully covered in WALS. Haspelmath, 2013). A recent development which We learn distributed language representations, can be seen as analogous to this is the process which can be used to predict typological prop- erties on a massively multilingual scale. Ad- of learning distributed language representations in ditionally, quantitative and qualitative analy- the form of dense real-valued vectors, often re- ses of these language embeddings can tell us ferred to as language embeddings (Tsvetkov et al., how language similarities are encoded in NLP 2016; Ostling¨ and Tiedemann, 2017; Malaviya models for tasks at different typological levels. et al., 2017). We hypothesise that these language The representations are learned in an unsuper- embeddings encode typological properties of lan- vised manner alongside tasks at three typolog- guage, reminiscent of the features in WALS, or ical levels: phonology (grapheme-to-phoneme even of parameters in Chomsky’s Principles and prediction, and phoneme reconstruction), mor- phology (morphological inflection), and syn- Parameters framework (Chomsky, 1993). tax (part-of-speech tagging). In this paper, we model languages in deep neu- We consider more than 800 languages and find ral networks using language embeddings, consid- significant differences in the language rep- ering three typological levels: phonology, mor- resentations encoded, depending on the tar- phology and syntax. We consider four NLP tasks get task. For instance, although Norwegian to be representative of these levels: grapheme-to- Bokmal˚ and Danish are typologically close to phoneme prediction and phoneme reconstruction, one another, they are phonologically distant, morphological inflection, and part-of-speech tag- which is reflected in their language embed- ging. We pose three research questions (RQs): dings growing relatively distant in a phonolog- ical task. We are also able to predict typolog- ical features in WALS with high accuracies, RQ 1 Which typological properties are encoded even for unseen language families. in task-specific distributed language repre- sentations, and can we predict phonologi- 1 Introduction cal, morphological and syntactic properties For more than two and a half centuries, linguistic of languages using such representations? typologists have studied languages with respect to their structural and functional properties (Haspel- RQ 2 To what extent do the encoded properties math, 2001; Velupillai, 2012). Although typol- change as the representations are fine tuned ogy has a long history (Herder, 1772; Gabelentz, for tasks at different linguistic levels? 1891; Greenberg, 1960, 1974; Dahl, 1985; Com- rie, 1989; Haspelmath, 2001; Croft, 2002), com- RQ 3 How are language similarities encoded in putational approaches have only recently gained fine-tuned language embeddings? 907 Proceedings of NAACL-HLT 2018, pages 907–916 New Orleans, Louisiana, June 1 - 6, 2018. c 2018 Association for Computational Linguistics One of our key findings is that language represen- several hundred languages; and (iv) grounding the tations differ considerably depending on the tar- language representations in linguistic theory. get task. For instance, for grapheme-to-phoneme mapping, the differences between the representa- 3 Background tions for Norwegian Bokmal˚ and Danish increase 3.1 Distributed language representations rapidly during training. This is due to the fact that, although the languages are typologically close to There are several methods for obtaining dis- one another, they are phonologically distant. tributed language representations by training a re- current neural language model (Mikolov et al., 2 Related work 2010) simultaneously for different languages (Tsvetkov et al., 2016; Ostling¨ and Tiedemann, Computational linguistics approaches to typology 2017). In these recurrent multilingual lan- are now possible on a larger scale than ever before guage models with long short-term memory cells due to advances in neural computational models. (LSTM, Hochreiter and Schmidhuber, 1997), lan- Even so, recent work only deals with fragments of guages are embedded into an n-dimensional typology compared to what we consider here. space. In order for multilingual parameter shar- ing to be successful in this setting, the neural net- Computational typology has to a large extent work is encouraged to use the language embed- focused on exploiting word or morpheme align- dings to encode features of language. In this pa- ments on the massively parallel New Testament, per, we explore the embeddings trained by Ostling¨ in approximately 1,000 languages, in order to in- and Tiedemann(2017), both in their original state, fer word order (Ostling¨ , 2015) or assign linguis- and by further tuning them for our four tasks. tic categories (Asgari and Schutze¨ , 2017).W alchli¨ These are trained by training a multilingual lan- (2014) similarly extracts lexical and grammatical guage model with language representations on a markers using New Testament data. Other work collection of texts from the New Testament, cov- has taken a computational perspective on language ering 975 languages. While other work has looked evolution (Dunn et al., 2011), and phonology (Cot- at the types of representations encoded in different terell and Eisner, 2017; Alishahi et al., 2017). layers of deep neural models (Kad´ ar´ et al., 2017), Language embeddings In this paper, we follow we choose to look at the representations only in an approach which has seen some attention in the the bottom-most embedding layer. This is moti- past year, namely the use of distributed language vated by the fact that we look at several tasks using representations, or language embeddings. Some different neural architectures, and want to ensure typological experiments are carried out by Ostling¨ comparability between these. and Tiedemann(2017), who learn language em- 3.1.1 Language embeddings as continuous beddings via multilingual language modelling and Chomskyan parameter vectors show that they can be used to reconstruct ge- nealogical trees. Malaviya et al.(2017) learn lan- We now turn to the theoretical motivation of the guage embeddings via neural machine translation, language representations. The field of NLP is and predict syntactic, morphological, and phonetic littered with distributional word representations, features. which are theoretically justified by distributional semantics (Harris, 1954; Firth, 1957), summarised Contributions Our work bears the most resem- in the catchy phrase You shall know a word by the blance to Bjerva and Augenstein(2018), who fine- company it keeps (Firth, 1957). We argue that lan- tune language embeddings on the task of PoS tag- guage embeddings, or distributed representations ging, and investigate how a handful of typological of language, can also be theoretically motivated by properties are coded in these for four Uralic lan- Chomsky’s Principles and Parameters framework guages. We expand on this and thus contribute (Chomsky and Lasnik, 1993; Chomsky, 1993, to previous work by: (i) introducing novel qual- 2014). Language embeddings encode languages itative investigations of language embeddings, in as dense real-valued vectors, in which the dimen- addition to thorough quantitative evaluations; (ii) sions are reminiscent of the parameters found in considering four tasks at three different typologi- this framework. Briefly put, Chomsky argues that cal levels; (iii) considering a far larger sample of languages can be described in terms of principles 908 (abstract rules) and parameters (switches) which are typically found in several African languages can be turned either on or off for a given lan- as well as languages in South-East Asia. guage (Chomsky and Lasnik, 1993). An example Morphological features cover a total of 41 fea- of such a switch might represent the positioning tures. We consider the features included in the of the head of a clause (i.e. either head-initial or Morphology chapter as well as those included in head-final). For English, this switch would be set the Nominal Categories chapter to be morpholog- to the ‘initial’ state, whereas for Japanese it would ical in nature.3 This includes features such as the be set to the ‘final’ state. Each dimension in an n- number of genders, usage of definite and indefinite dimensional language embedding might also de- articles and reduplication. As an example, con- scribe such a switch, albeit in a more continuous sider WALS feature 37A (Definite Articles).4 This fashion. The number of dimensions
Recommended publications
  • Using 'North Wind and the Sun' Texts to Sample Phoneme Inventories
    Blowing in the wind: Using ‘North Wind and the Sun’ texts to sample phoneme inventories Louise Baird ARC Centre of Excellence for the Dynamics of Language, The Australian National University [email protected] Nicholas Evans ARC Centre of Excellence for the Dynamics of Language, The Australian National University [email protected] Simon J. Greenhill ARC Centre of Excellence for the Dynamics of Language, The Australian National University & Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History [email protected] Language documentation faces a persistent and pervasive problem: How much material is enough to represent a language fully? How much text would we need to sample the full phoneme inventory of a language? In the phonetic/phonemic domain, what proportion of the phoneme inventory can we expect to sample in a text of a given length? Answering these questions in a quantifiable way is tricky, but asking them is necessary. The cumulative col- lection of Illustrative Texts published in the Illustration series in this journal over more than four decades (mostly renditions of the ‘North Wind and the Sun’) gives us an ideal dataset for pursuing these questions. Here we investigate a tractable subset of the above questions, namely: What proportion of a language’s phoneme inventory do these texts enable us to recover, in the minimal sense of having at least one allophone of each phoneme? We find that, even with this low bar, only three languages (Modern Greek, Shipibo and the Treger dialect of Breton) attest all phonemes in these texts.
    [Show full text]
  • Applying Phonology in Lexicography: Variant-Synonym Classification In
    Applying phonology in lexicography: variant-synonym classification in Czech Sign Language Hana Strachoňová, Masaryk University, Brno, Czech Republic, [email protected] Lucia Vlášková, Masaryk University, Brno, Czech Republic, [email protected] Sign language (SL) lexicography, as a young field of study within SL linguistics, faces a lot of challenges that have already been answered for the audio-oral language material. In this talk we present a method that is being applied in the ongoing process of data classification for the first Czech SL online dictionary (part of the platform Dictio). Problem: For audio-oral languages, a dictionary entry standardly contains the citation form of a lexeme and all the variants (Čermák 1995). See for example the gender variants in Czech: brambor (potato- masculine) / brambor-a (potato-feminine), hadr (cloth-masculine) / hadr-a (cloth-feminine). However, two (or more) expressions of a different word-forming nature are not considered variants but synonyms (Filipec 1995). See the example pairs in Czech - the first expression comes from the traditional Czech lexicon and the latter originates in English/Latin: jazykověda (linguistics; Czech origin) / lingvistika (linguistics; foreign origin), poradce (consultant; Czech origin) / konzultant (consultant; foreign origin). The common ground of the variant- and synonym-pairs is their shared meaning (brambor has the same meaning as brambora; jazykověda has the same meaning as lingvistika) and in the lexicographic work it is essential to assign each of them the right place in the dictionary entry. What seems as a simple task for spoken languages (basically - common root for variants, different roots for synonyms) becomes a challenge for SLs (still ongoing discussion about the definition of morphemes and lexical roots; see e.g.
    [Show full text]
  • Language Development Language Development
    Language Development rom their very first cries, human beings communicate with the world around them. Infants communicate through sounds (crying and cooing) and through body lan- guage (pointing and other gestures). However, sometime between 8 and 18 months Fof age, a major developmental milestone occurs when infants begin to use words to speak. Words are symbolic representations; that is, when a child says “table,” we understand that the word represents the object. Language can be defined as a system of symbols that is used to communicate. Although language is used to communicate with others, we may also talk to ourselves and use words in our thinking. The words we use can influence the way we think about and understand our experiences. After defining some basic aspects of language that we use throughout the chapter, we describe some of the theories that are used to explain the amazing process by which we Language9 A system of understand and produce language. We then look at the brain’s role in processing and pro- symbols that is used to ducing language. After a description of the stages of language development—from a baby’s communicate with others or first cries through the slang used by teenagers—we look at the topic of bilingualism. We in our thinking. examine how learning to speak more than one language affects a child’s language develop- ment and how our educational system is trying to accommodate the increasing number of bilingual children in the classroom. Finally, we end the chapter with information about disorders that can interfere with children’s language development.
    [Show full text]
  • Phonetic and Phonological Research Sharing Methods
    The Kabod Volume 3 Issue 3 Summer Article 1 January 2017 Phonetic and Phonological Research Sharing Methods Cory C. Coogan Liberty University, [email protected] Follow this and additional works at: https://digitalcommons.liberty.edu/kabod Part of the Modern Languages Commons, and the Reading and Language Commons Recommended Citations MLA: Coogan, Cory C. "Phonetic and Phonological Research Sharing Methods," The Kabod 3. 3 (2017) Article 1. Liberty University Digital Commons. Web. [xx Month xxxx]. APA: Coogan, Cory C. (2017) "Phonetic and Phonological Research Sharing Methods" The Kabod 3( 3 (2017)), Article 1. Retrieved from https://digitalcommons.liberty.edu/kabod/vol3/iss3/1 Turabian: Coogan, Cory C. "Phonetic and Phonological Research Sharing Methods" The Kabod 3 , no. 3 2017 (2017) Accessed [Month x, xxxx]. Liberty University Digital Commons. This Individual Article is brought to you for free and open access by Scholars Crossing. It has been accepted for inclusion in The Kabod by an authorized editor of Scholars Crossing. For more information, please contact [email protected]. Coogan: Phonetic and Phonological Research Sharing Methods Running Head: PHONETIC AND PHONOLOGICAL RESEARCH SHARING METHODS 1 Phonetic and Phonological Research Sharing Methods Cory Coogan Liberty University Published by Scholars Crossing, 2017 1 The Kabod, Vol. 3, Iss. 3 [2017], Art. 1 PHONETIC AND PHONOLOGICAL RESEARCH SHARING METHODS 2 Phonetic and Phonological Research Sharing Methods Most linguists affirm the observation that human language is innate; the human mind has a capacity for grammar that is inherent from birth. This notion implies that a singular grammar produces all human languages; therefore, to appropriately understand the scope of the human capacity for grammar, a single model must cohesively describe the various processes of all human languages.
    [Show full text]
  • 24.900 Intro to Linguistics Lecture Notes: Phonology Summary
    Fall 2012 Phonology Summary • Speakers of English learned something from the data they were presented with as (contains all examples from class slides, and more!) babies that caused them to internalize (learn) the rule exemplified in (1) — just as Tojolabal speakers learned from the data that they heard as babies, and ended up with 1. Phonology vs. phonetics the rule exemplified in (2). • The path from memory (lexical access) to speech is mediated by phonology. • The English rule is real and active "on-line", governing creative linguistic behavior. A • Phonology = system of rules that apply when speech sounds are put together to form native speaker of English will apply it to new words they have never heard. The /t/ in morphemes and words. tib will be aspirated, and the /t/ in stib (both nonsense words, I hope) will not be. Probably Tojolabal speakers will show similar behavior with respect to their rule. (1) stop consonant aspiration in English: initial within a stressed syllable ASPIRATED UNASPIRATED 2. What phonetic distinctions are made in lexical entries? initial within a stressed syllable after s or initial within an word-final (therefore syllable- Part 1: phonological rules that eliminate distinctions from the lexicon unstressed syllable final) pan span nap tone stone note • English: lexicon does not need to distinguish aspirated from unaspirated stops. kin skin nick There is no reason to suppose that information about aspiration forms part of the sound field of lexical entries of English words, since it is entirely predictable. Though pan is upon supping pronounced /pʰæn/ and span /spæn/, there is no reason to distinguish the aspirated and unaspirated bilabial stops in the lexical entries of the two words.
    [Show full text]
  • Phonological Typology, Rhythm Types and the Phonetics-Phonology Interface
    Zurich Open Repository and Archive University of Zurich Main Library Strickhofstrasse 39 CH-8057 Zurich www.zora.uzh.ch Year: 2012 Phonological typology, rhythm types and the phonetics-phonology interface. A methodological overview and three case studies on Italo-Romance dialects Schmid, Stephan Posted at the Zurich Open Repository and Archive, University of Zurich ZORA URL: https://doi.org/10.5167/uzh-73782 Book Section Published Version Originally published at: Schmid, Stephan (2012). Phonological typology, rhythm types and the phonetics-phonology interface. A methodological overview and three case studies on Italo-Romance dialects. In: Ender, Andrea; Leemann, Adrian; Wälchli, Bernhard. Methods in contemporary linguistics. Berlin: de Gruyter Mouton, 45-68. Phonological typology, rhythm types and the phonetics-phonology interface. A methodological overview and three case studies on Italo- Romance dialects Stephan Schmid 1. Introduction Phonological typology has mainly concentrated on phoneme inventories and on implicational universals, whereas the notion of ‘language type’ ap- pears to be less appealing from a phonological perspective. An interesting candidate for establishing language types on the grounds of phonological or phonetic criteria would have come from the dichotomy of ‘stress-timing’ vs. ‘syllable-timing’, if instrumental research carried out by a number of phoneticians had not invalidated the fundamental claim of the so-called ‘isochrony hypothesis’. Nevertheless, the idea of classifying languages according to their rhythmic properties has continued to inspire linguists and phoneticians, giving rise to two diverging methodological perspectives. The focus of the first framework mainly lies on how phonological processes relate to prosodic domains, in particular to the syllable and to the phonolog- ical word.
    [Show full text]
  • Introduction to Phonology École D’Automne De Linguistique, ENS
    Introduction to Phonology École d’automne de linguistique, ENS Class coordinates Time : 14:30-15:50 (Session 4), Sept. 24, 25, 26, 27 (Monday to Thursday) Place : Salle des Résistants (45 rue d’Ulm, 1er étage, couloir A-B) Instructor coordinates Name : Kie Zuraw [ ka z ] Affiliation : UCLA (University of California, Los Angeles) Department of Linguistics E-mail : [email protected] Web page : www.linguistics.ucla.edu/people/zuraw Course description What do we know about our language’s sound pattern, and how do we know it? This course will begin with a quick overview of characteristics of sound patterns that linguists have noticed (alternations and phonotactics), and of the approach to explanatory adequacy that will be adopted here. We will then look at research that has sought to determine what phonological generalizations speakers extract from the learning data, and follow the consequences of these findings for achieving a descriptively adequate grammatical framework (that is, a framework that can express speakers’ implicit phonological knowledge): basic rule notation, features, and constraint interaction. Next we will consider why determining what speakers know is so difficult, and review a range of methods that have been tried. Finally, we will examine some recent work that moves towards explanatory adequacy—what kind of learner can, on exposure to typical learning data, choose a grammar similar to the one that human learners choose? Prerequisites : None! Course outline Day 1: 24 September sound patterns conceptual framework Day 2: 25 September descriptive adequacy: methods and consequences Day 3: 26 September explanatory adequacy: methods Day 4: 27 September explanatory adequacy: theoretical developments Suggestions for further reading are included at the ends of the first two handouts Language : In accordance with EALing policy, I’ll lecture in English, but feel free to make comments or pose questions in French, to ask me to try to express something into French if it’s not clear in English, to talk to me after class in French..
    [Show full text]
  • The Meaning of Language
    The Meaning of Language The Meaning of Language Edited by Hans Götzsche The Meaning of Language Edited by Hans Götzsche This book first published 2018 Cambridge Scholars Publishing Lady Stephenson Library, Newcastle upon Tyne, NE6 2PA, UK British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library Copyright © 2018 by Hans Götzsche and contributors All rights for this book reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the copyright owner. ISBN (10): 1-5275-0542-1 ISBN (13): 978-1-5275-0542-1 TABLE OF CONTENTS Preface ...................................................................................................... vii Chapter One ................................................................................................ 1 The Meanings of Life Hans Götzsche Chapter Two ............................................................................................. 17 The Discovery of Danish Phonology and Prosodic Morphology: From the Third University Caretaker Jens P. Høysgaard (1743) to the 19th Century Hans Basbøll Chapter Three ........................................................................................... 46 What Do Languages Refrain from Copying Morphology? On Structural Obstacles to Morphological Borrowing Stig Eliasson Chapter Four ............................................................................................
    [Show full text]
  • DOCTORAL DISSERTATION Suprasegmental Sound Changes In
    DOCTORAL DISSERTATION Suprasegmental sound changes in the Scandinavian languages Áron Tési 2017 Eötvös Loránd University of Sciences Faculty of Humanities DOCTORAL DISSERTATION Áron Tési SUPRASEGMENTAL SOUND CHANGES IN THE SCANDINAVIAN LANGUAGES SZUPRASZEGMENTÁLIS HANGVÁLTOZÁSOK A SKANDINÁV NYELVEKBEN Doctoral School of Linguistics Head: Dr. Gábor Tolcsvai Nagy MHAS Doctoral programme in Germanic Linguistics Head: Dr. Károly Manherz CSc Members of the thesis committee Dr. Károly Manherz CSc (chairman) Dr. Roland Nagy PhD (secretary) Dr. Valéria Molnár PhD (officially appointed opponent) Dr. Ildikó Vaskó PhD (officially appointed opponent) Dr. László Komlósi CSc (member) Further members Dr. Péter Siptár DSc Dr. Miklós Törkenczy DSc Supervisor Dr. Péter Ács CSc Budapest, 2017 Table of contents List of abbreviations ................................................................................................................... 0 Foreword .................................................................................................................................... 1 1. Theoretical considerations .................................................................................................. 2 1.1. Some notes on sound change ....................................................................................... 3 1.2. The problem of teleology ............................................................................................ 5 1.2.1. A philosophical overview ....................................................................................
    [Show full text]
  • Fieldwork for Studies of Phonological Variation
    Fieldwork for studies of phonological variation Paul Foulkes University of York, UK E-mail: [email protected] ABSTRACT 3. SOCIAL PARAMETERS This session discusses fieldwork strategies developed It is not usually possible to collect data from all members mainly for dialectological and sociolinguistic research on of a community. A subset must therefore be drawn, and varieties of English. Issues in sampling the community must include representatives of those categories of speaker and the language are explored. Adaptations of traditional relevant to the research questions. In dialectology a main methods for work with children are also outlined. Finally, research goal has been the recording of conservative a brief assessment is made of the implications of these forms. Fieldwork has therefore sought to recruit subjects types of fieldwork for linguistic phonetics. who might be the best guardians of older variants. In the case of the SED, the majority of such subjects were 1. INTRODUCTION NORMs (non-mobile older rural males). Other speaker groups were underrepresented [18]. By contrast, Linguistic fieldwork is a means to an end. It is undertaken sociolinguists have primarily been interested in variation after prior definition of the aims of linguistic research. within urban communities. Fieldwork designs have thus These aims may be purely descriptive, as in the case of included speakers of both sexes, and a range of ages, recording endangered languages. More often, however, social backgrounds, and/or ethnic groups. descriptions of linguistic phenomena also serve a wider All social parameters require careful definition and purpose: they provide evidence for theoretical debate (e.g. interpretation, often with respect to the specific in phonology or sociolinguistics), and/or yield reference community of interest [17].
    [Show full text]
  • Phonetics in Phonology
    Phonetics in Phonology John J. Ohala University of California, Berkeley At least since Trubetzkoy (1933, 1939) many have thought of phonology and phonetics as separate, largely autonomous, disciplines with distinct goals and distinct methodologies. Some linguists even seem to doubt whether phonetics is properly part of linguistics at all (Sommerstein 1977:1). The commonly encountered expression ‘the interface between phonology and phonetics’ implies that the two domains are largely separate and interact only at specific, proscribed points (Ohala 1990a). In this paper I will attempt to make the case that phonetics is one of the essential areas of study for phonology. Without phonetics, I would maintain, (and allied empirical disciplines such as psycholinguistics and sociolinguistics) phonology runs the risk of being a sterile, purely descriptive and taxonomic, discipline; with phonetics it can achieve a high level of explanation and prediction as well as finding applications in areas such as language teaching, communication disorders, and speech technology (Ohala 1991). 1. Introduction The central task within phonology (as well as in speech technology, etc.) is to explain the variability and the patterning -- the “behavior” -- of speech sounds. What are regarded as functionally the ‘same’ units, whether word, syllable, or phoneme, show considerable physical variation depending on context and style of speaking, not to mention speaker-specific factors. Documenting and explaining this variation constitutes a major challenge. Variability is evident in several domains: in everyday speech where the same word shows different phonetic shapes in different contexts, e.g., the release of the /t/ in tea has more noise than that in toe when spoken in isolation.
    [Show full text]
  • Phonetic Documentation in Three Collections: Topics and Evolution
    Phonetic documentation in three collections: Topics and evolution D. H. Whalen City University of New York (also Haskins Laboratories and Yale University) [email protected] Christian DiCanio University at Buffalo [email protected] Rikker Dockum Swarthmore College [email protected] Phonetic aspects of many languages have been documented, though the breadth and focus of such documentation varies substantially. In this survey, phonetic aspects (here called ‘categories’) that are typically reported were assessed in three English-language collections – the Illustrations of the IPA from the Journal of the International Phonetic Association, articles from the Journal of Phonetics, and papers from the Ladefoged/Maddieson Sounds of the World’s Languages (SOWL) documentation project. Categories were defined for consonants (e.g. Voice Onset Time (VOT) and frication spec- trum; 10 in total), vowels (e.g. formants and duration; 7 in total) and suprasegmentals (e.g. stress and distinctive vowel length, 6 in total). The Illustrations, due to their brevity, had, on average, limited coverage of the selected categories (12% of the 23 categories). Journal of Phonetics articles were typically theoretically motivated, but 64 had sufficient measurements to count as phonetic documentation; these also covered 12% of the cate- gories. The SOWL studies, designed to cover as much of the phonetic structure as feasible in an article-length treatment, achieved 41% coverage on average. Four book-length stud- ies were also examined, with an average of 49% coverage. Phonetic properties of many language families have been studied, though Indo-European is still disproportionately rep- resented. Physiological measures were excluded as being less common, and perceptual measures were excluded as being typically more theoretical.
    [Show full text]