Two Statistical Approaches to Finding Vowel Harmony Adam C. Baker University of Chicago
[email protected] July 27, 2009 Abstract The present study examines two methods for learning and modeling vowel harmony from text corpora. The first uses Expectation Maximization with Hidden Markov Mod- els to find the most probable HMM for a training corpus. The second uses pointwise Mutual Information between distant vowels in a Boltzmann distribution, along with the Minimal Description Length principle to find and model vowel harmony. Both methods correctly detect vowel harmony in Finnish and Turkish, and correctly recognize that English and Italian have no vowel harmony. HMMs easily model the transparent neu- tral vowels in Finnish vowel harmony, but have difficulty modeling secondary rounding harmony in Turkish. The Boltzmann model correctly captures secondary roundness harmony and the opacity of low vowels in roundness harmony in Turkish, but has more trouble capturing the transparency of neutral vowels in Finnish. 1 Introduction: Vowel Harmony and Unsupervised Learning The problem of child language acquisition has been a driving concern of modern theoret- ical linguistics since at least the 1950s. But traditional linguistic research has focused on constructing models like the Optimality Theory model of phonology (Prince and Smolensky, 2004). Learning models for those theories like the OT learning theory in Riggle (2004) are 1 sought later to explain how a child could learn a grammar. A lot of recent research in computational linguistics has turned this trend around. Break- throughs in artificial intelligence and speech processing in the '80s has lead many phonologists starting from models with well-known learning algorithms and using those models to extract phonological analyses from naturally occurring speech (for an overview and implications, see Seidenberg, 1997).