C⃝copyright 2012 Steven Paul Moran

⃝c Copyright 2012 Steven Paul Moran Phonetics Information Base and Lexicon Steven Paul Moran A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2012 Reading Committee: Emily M. Bender, Chair Richard Wright, Chair Scott Farrar Sharon Hargus Program Authorized to Offer Degree: Department of Linguistics University of Washington Abstract Phonetics Information Base and Lexicon Steven Paul Moran Co-Chairs of the Supervisory Committee: Associate Professor Emily M. Bender Department of Linguistics Associate Professor Richard Wright Department of Linguistics In this dissertation, I investigate the linguistic and technological challenges involved in cre- ating a cross-linguistic data set to undertake phonological typology. I then address the question of whether more sophisticated, knowledge-based approaches to data modeling, coupled with a broad cross-linguistic data set, can extend previous typological observations and provide new ways of querying segment inventories. The model that I implement facili- tates testing typological observations by aligning data models to questions that typologists wish to ask. The technological infrastructure that I create is conducive to data sharing, extensibility and reproducibility of results. I use the data set and data models in this work to validate and extend previous typological observations. In doing so, I revisit the typological facts proposed in the linguistics literature about the size, shape and composition of segment inventories in the world’s languages and find that they remain similar even with a much larger sample of languages. I also show that as the number of segment inventories increases, the number of distinct segments also continues to increase. And when vowel systems grow beyond the basic cardinal vowels, they do so first by length and nasalization, and then diphthongization. Moving beyond segments, I show that distinctive feature sets in general lack the typological representation needed to straightforwardly map sets of features to the segment types found in a broad set of language descriptions. Therefore, I extend a distinctive feature set, devise a method to computationally encode features by combining feature vectors and assigning them to segment types, and create a system in which users can query by feature, by sets of features that define natural classes, or by omitting features in queries to utilize the underspecification of segments. I use this system and reinvestigate proposed descriptive universals about phonological systems and find that some, but not all universals hold up to the more rigorous testing made possible with this larger data set and a graph data model. Lastly, I reevaluate one of the many purported correlations between a non-linguistic factor and language: the claim that there exists a relationship between population size and phoneme inventory size. I show that this finding is actually an artifact of a small data set, which constrains the use of more nuanced statistical approaches that can control for the genealogical relatedness of languages. Thus, in this work I illustrate how researchers can leverage the data set and data models that I have implemented to investigate different aspects of languages’ phonological systems, including the possible impact of non-linguistic factors on phonology. TABLE OF CONTENTS Page List of Figures . ii List of Tables . iii Chapter 1: Introduction . 1 Chapter 2: Background . 7 2.1 Conventions and terminology . 7 2.1.1 Conventions . 7 2.1.2 Linguistic terminology for phonology . 7 2.1.3 Linguistic terminology for writing systems . 10 2.1.4 Technological terminology . 12 2.1.5 Abbreviations . 23 2.2 Linguistic theories . 23 2.2.1 Segmental phonology . 24 2.2.2 Distinctive feature theory . 27 2.2.3 Summary . 31 2.3 Challenges . 32 2.3.1 Typological databases . 33 2.3.2 Sampling . 43 2.3.3 Data and analysis . 49 2.3.4 Segments . 54 2.3.5 Standardization . 60 2.3.6 Data provenance . 70 2.3.7 Summary . 74 Chapter 3: Data Modeling . 76 3.1 Data modeling basics . 76 3.1.1 Table . 76 i 3.1.2 Database . 80 3.1.3 Graph . 84 3.2 PHOIBLE data models . 94 3.2.1 Relational database . 97 3.2.2 Data warehouse flat file tables . 117 3.2.3 RDF graph . 137 3.2.4 Summary . 152 3.3 Knowledge representation . 152 3.3.1 Representing knowledge . 153 3.3.2 Description Logics . 155 3.3.3 Ontology . 160 3.3.4 Knowledge base . 164 3.3.5 Summary . 167 3.4 Conclusion . 167 Chapter 4: PHOIBLE . 168 4.1 Introduction . 168 4.2 Motivation . 169 4.3 Extract, transform, load . 171 4.3.1 SPA . 173 4.3.2 UPSID . 183 4.3.3 Alphabets of Africa . 190 4.3.4 PHOIBLE inventories . 195 4.3.5 Summary . 197 4.4 Genealogical coverage . 198 4.5 Summary . 201 Chapter 5: Segments and Inventories . 202 5.1 Introduction . 202 5.2 Background . 203 5.3 Distribution of segments . 205 5.4 Segment inventories . 219 5.4.1 Inventory size . 220 5.4.2 Consonants . 226 5.4.3 Vowels . 229 5.4.4 Tone . 234 ii 5.4.5 Summary . 235 5.5 Consonants and vowels . 235 5.6 Implications in vowels systems . 239 5.7 Conclusion . 250 Chapter 6: Distinctive Features . 252 6.1 Introduction . 252 6.2 Background . 253 6.3 Typological coverage . 256 6.4 Challenges and implementation . 264 6.5 Investigating universals in phonological systems . 278 6.6 Conclusion . 285 Chapter 7: Case Study: Phoneme Inventory Size and Population Size . 287 7.1 Introduction . 287 7.2 Previous studies . 289 7.3 Hay & Bauer . 294 7.4 Materials and analysis . 300 7.5 Discussion . 313 7.6 Conclusion . 317 Chapter 8: Conclusion . 320 8.1 Summary . 320 8.2 Contributions . 322 8.3 Where are the lexicons? . 326 8.4 Future work . 333 8.4.1 Information theoretic approaches to phonology . 333 8.4.2 Complexity . 334 8.4.3 Feature-based principles in phonological inventories . 335 8.4.4 Correlation studies . 336 8.4.5 Tackling provenance . 337 8.4.6 Linked Open Data . 337 Appendix A: Genealogical coverage of segment inventories in PHOIBLE . 475 Appendix B: List of languages, ISO 639-3 codes and their sources in PHOIBLE . 480 iii Appendix C: PHOIBLE segment conventions . ..

C⃝copyright 2012 Steven Paul Moran

A First Comparison of Pronominal and Demonstrative Systems in the Cariban Language Family*

Plural Words in Austronesian Languages: Typology and History

LCSH Section K

Replacive Grammatical Tone in the Dogon Languages

Dagbani Tongue-Root Harmony: a Formal Account with Ultrasound Investigation

John P. Harrington Papers 1907-1959

Fieldwork and Linguistic Analysis in Indigenous Languages of the Americas

5 Phonology Florian Lionnet and Larry M

Synchronic and Diachronic Aspects of Valency-Reducing Devices in Oceanic Languages

PDF Hosted at the Radboud Repository of the Radboud University Nijmegen

Phonetic Characters, Fonts, and Editors Data and Other Documents Here

Language Typology and Syntactic Description