Unsupervised Language Acquisition: Theory and Practice

Unsupervised Language Acquisition: Theory and Practice Alexander Simon Clark Submitted for the degree of D.Phil. University of Sussex September, 2001 Declaration I hereby declare that this thesis has not been submitted, either in the same or different form, to this or any other university for a degree. Signature: Acknowledgements First, I would like to thank Bill Keller, for his supervision over the past three years. I would like to thank my colleagues at ISSCO for making me welcome and especially Susan Armstrong for giving me the job in the first place, and various people for helpful comments and discussions including Chris Manning, Dan Klein, Eric Gaussier, Nicola Cancedda, Franck Thollard, Andrei Popescu-Beilis, Menno van Zaanen and numerous other people including Sonia Halimi for check- ing my Arabic. I would also like to thank Gerald Gazdar and Geoffrey Sampson for their helpful comments as part of my thesis committee. I would also like to thank all of the people that have worked on the various software packages and operating systems that I have used, such as LaTeX, Gnu/Linux, and the gcc compiler, as well as the people who have worked on the preparation of the various corpora I have used: in particular I am grateful to John McCarthy and Ramin Nakisa for allowing me to use their painstakingly collected data sets for Arabic. Finally I would like to thank my wife, Dr Clare Hornsby, for her unfailing support and encouragement (in spite of her complete lack of understanding of the subject) and my children, Lily and Francis, for putting up with my absences both physical and mental, and showing me how language acquisition is really done – over-regularisation of past tense verbs included. Unsupervised Language Acquisition: Theory and Practice Alexander Simon Clark Abstract In this thesis I present various algorithms for the unsupervised machine learning of aspects of natural languages using a variety of statistical models. The scientific object of the work is to examine the validity of the so-called Argument from the Poverty of the Stimulus advanced in favour of the proposition that humans have language-specific innate knowledge. I start by examining an a priori argument based on Gold’s theorem, that purports to prove that natural languages cannot be learned, and some formal issues related to the choice of statistical grammars rather than symbolic grammars. I present three novel algorithms for learning various parts of natural languages: first, an algorithm for the induction of syntactic categories from unlabelled text using distributional information, that can deal with ambiguous and rare words; secondly, a set of algorithms for learning morphological processes in a variety of languages, including languages such as Arabic with non- concatenative morphology; thirdly an algorithm for the unsupervised induction of a context-free grammar from tagged text. I carefully examine the interaction between the various components, and show how these algorithms can form the basis for a empiricist model of language acquisition. I therefore conclude that the Argument from the Poverty of the Stimulus is unsupported by the evidence. Submitted for the degree of D.Phil. University of Sussex September, 2001 Contents 1 Introduction 2 1.1 The Scientific Question . 3 1.2 Basic assumptions . 3 1.3 Outline of the thesis. 4 1.4 Prerequisites . 5 1.5 Literature review . 5 1.6 Bibliographic note . 5 1.7 Acknowledgments . 5 2 Nativism and the Argument from the Poverty of the Stimulus 6 2.1 Introduction . 7 2.2 Nativism . 7 2.3 The Argument from the Poverty of the Stimulus . 9 2.4 A learning algorithm as a refutation of the APS . 11 2.4.1 Data Limitations . 12 2.4.2 Absence of Domain-Specific information . 14 2.4.3 Absence of Language Specific Assumptions . 14 2.4.4 Plausibility of output . 14 2.4.5 Excluded sources of information . 15 2.5 Possible Counter-arguments . 16 2.5.1 Inadequate data . 16 2.5.2 Incomplete learning . 16 2.5.3 Undesirable Properties of Algorithm . 17 2.5.4 Incompatible with psychological evidence . 17 2.5.5 Incompatible with Developmental Evidence . 18 2.5.6 Storage of entire data set . 18 2.5.7 Argument from truth of the Innateness Hypothesis . 19 2.5.8 Argument from the inadequacy of empiricist models of language acquisition 20 2.5.9 Use of tree-structured representations . 21 2.5.10 Not a general purpose learning algorithm . 22 2.5.11 Fails to learn deep structures . 22 2.5.12 Fails to guarantee convergence . 24 2.5.13 Doesn’t use semantic evidence . 24 2.5.14 Fails to show that children actually use these techniques . 24 Contents vi 3 Language Learning Algorithms 26 3.1 Introduction . 27 3.2 Machine Learning of Natural Languages . 27 3.3 Supervised and Unsupervised Learning . 28 3.4 Statistical Learning . 28 3.5 Neural Networks . 29 3.6 Maximum Likelihood estimation . 30 3.7 The Expectation-Maximisation Theorem . 30 3.8 Bayesian techniques . 32 3.9 Evaluation . 34 3.10 Overall architecture of the language learner . 35 3.10.1 Categorisation . 35 3.10.2 Morphology . 35 3.10.3 Syntax . 36 3.10.4 Interactions between the models . 36 3.11 Areas of language not discussed in this thesis . 36 3.11.1 Acoustic processing . 37 3.11.2 Phonology . 37 3.11.3 Segmentation . 37 3.11.4 Semantics and Discourse . 37 4 Formal Issues 39 4.1 Introduction . 40 4.2 The A Priori APS . 40 4.3 Learnability in Formal language theory . 41 4.3.1 Gold . 42 4.4 Measure-one Learnability . 42 4.4.1 Horning . 42 4.5 Extension to Ergodic Processes . 45 4.6 Arguments from Gold’s Theorem . 47 4.7 Statistical Methods in Linguistics . 51 4.8 Formal Unity of Statistical Grammars and Algebraic grammars . 52 5 Syntactic Category Induction 55 5.1 Introduction . 56 5.2 Syntactic Categories . 56 5.2.1 Necessity of Syntactic Categories . 56 5.2.2 Learning categories . 57 5.3 Previous Work . 58 5.3.1 Harris and Lamb . 58 5.3.2 Finch and Chater . 59 5.3.3 Schutze¨ . 59 5.3.4 Brown et al. 59 Contents vii 5.3.5 Pereira, Tishby and Lee . 59 5.3.6 Dagan et al. 60 5.3.7 Li and Abe . 60 5.3.8 Brent . 60 5.4 Context Distributions . 60 5.4.1 Clustering . 62 5.4.2 Algorithm Description . 62 5.5 Ambiguous Words . 63 5.6 Rare words . 64 5.7 Results . 64 5.8 Discussion . 73 5.8.1 Limitations . 73 5.8.2 Independence assumptions . 74 5.8.3 Similarity with Hidden Markov Models . 75 5.8.4 Hierarchical clusters . 75 5.8.5 Use of orthography and morphology . 75 5.8.6 Multi-dimensional classes . 76 6 Morphology Acquisition 77 6.1 Introduction . 78 6.2 Computational Morphology . 78 6.3 Computational Models of Learning Morphology . 80 6.3.1 Supervised . 80 6.3.2 Unsupervised Learning . 81 6.3.3 Partially supervised . 82 6.3.4 Desiderata . ..

Unsupervised Language Acquisition: Theory and Practice

Catching Words in a Stream of Speech

Computer Vision Stochastic Grammars for Scene Parsing

Downloaded by [New York University] at 06:54 14 August 2016 Classic Case Studies in Psychology

UNIVERSITY of CALIFORNIA Los Angeles Human Activity

The Archidoxes of Magic, 2010, Theophrastus Paracelsus, 1163581216, 9781163581216, Kessinger Publishing, 2010

Statistical Pattern Recognition: a Review

SAVING LITERACY SAVING Dr

Using an Annotated Corpus As a Stochastic Grammar

Crash Prediction and Collision Avoidance Using Hidden

Chapter Five Considering Linguistic Knowledge

SCHOOL of EDUCATION, HEALTH and SCIENCE a Thesis Submitted

ALTERNATIVE (RE)VIEWS the Language Myth