Functional Distributional Semantics Learning Linguistically Informed Representations from a Precisely Annotated Corpus
Total Page:16
File Type:pdf, Size:1020Kb
Functional Distributional Semantics Learning Linguistically Informed Representations from a Precisely Annotated Corpus Guy Edward Toh Emerson Trinity College This dissertation is submitted on 20 August 2018 for the degree of Doctor of Philosophy Abstract Functional Distributional Semantics: Learning Linguistically Informed Representations from a Precisely Annotated Corpus The aim of distributional semantics is to design computational techniques that can automatically learn the meanings of words from a body of text. The twin challenges are: how do we represent meaning, and how do we learn these representations? The current state of the art is to represent meanings as vectors – but vectors do not correspond to any traditional notion of meaning. In particular, there is no way to talk about truth, a crucial concept in logic and formal semantics. In this thesis, I develop a framework for distributional semantics which answers this chal- lenge. The meaning of a word is not represented as a vector, but as a function, mapping entities (objects in the world) to probabilities of truth (the probability that the word is true of the entity). Such a function can be interpreted both in the machine learning sense of a classifier, and in the formal semantic sense of a truth-conditional function. This simultaneously allows both the use of machine learning techniques to exploit large datasets, and also the use of formal semantic techniques to manipulate the learnt representations. I define a probabilistic graphical model, which incorporates a probabilistic generalisation of model theory (allowing a strong connection with formal semantics), and which generates semantic dependency graphs (allowing it to be trained on a corpus). This graphical model provides a natural way to model logical inference, semantic composition, and context-dependent meanings, where Bayesian inference plays a cru- cial role. I demonstrate the feasibility of this approach by training a model on WikiWoods, a parsed version of the English Wikipedia, and evaluating it on three tasks. The results indicate that the model can learn information not captured by vector space models. Guy Edward Toh Emerson Declaration This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration. It is not substantially the same as any that I have submitted, or am concurrently submitting, for a degree or diploma or other qualification at the University of Cambridge or any other University or similar institution. I further state that no substantial part of my dissertation has already been submitted, or is being concurrently submitted, for any such degree, diploma or other qualification at the University of Cambridge or any other University or similar institution. This dissertation does not exceed the prescribed limit of 60 000 words. Guy Edward Toh Emerson 20 August 2018 Acknowledgements Towards the end of my PhD studies, many people asked me if I was glad to be done. The truth is, regardless of the effort of writing up this thesis, I have really enjoyed my PhD! Above all, I have to thank my PhD supervisor, Ann Copestake. You have been supportive throughout my studies (indeed, beginning with my master’s studies), urging me to try the simple things first, helping me to keep sight of the bigger picture, and lending me many books from your bookshelf. Without your guidance, this thesis would certainly be less coherent. The Computer Lab has been a welcoming environment – in particular, I should thank Lise Gough, for making sure things run smoothly, the Schiff Foundation, for funding my PhD, and of course all the people in the NLIP research group. While it is tempting to list everyone here, I will spare the reader from that,1 and single out a few people: fellow supervisees Matic, Ewa, Alex, and Paula, postdocs Laura, Tamara, and Marek, and fellow PhD students Kris and Amandla. I have had many stimulating discussions with each of you, which have really enriched my time here. In particular, I should thank Kris for the many times we’ve gone to the Food Park (and above all the Wandering Yak), and for the many times you’ve run into my office eager to tell me about some new problem. Beyond my department, there are also many people across the university who have con- tributed to my time here. I should thank everyone involved in the burgeoning Cambridge Lan- guage Sciences initiative,2 for trying to create bridges between departments, and in particular Jane Walsh, for keeping the initiative running. Beyond academia, there are also many people in Cambridge who have made the past four years so enjoyable. I should thank the Cambridge University Dancesport Team3 for being such a wonderful distraction. In particular, I should thank my life partner and dance partner Mary, for your love and support – and for all the time you’ve spent listening to me ramble, complain, and get excited about my work. Finally, I should acknowledge my own privilege in being in a position to happily pursue a PhD. I hope the world continues to develop so that knowledge is open, and so that anyone who wants to pursue a PhD feels able to do so. For my own place in the world, I want to thank my parents. You have supported me, encouraged me, and pushed me to be the best I can be. 1 https://www.cl.cam.ac.uk/research/nl/people/ 2 https://www.languagesciences.cam.ac.uk/ 3 http://cudt.org/ Now that my viva is over (for which I should thank my examiners Paula Buttery and Katrin Erk, for the challenging questions and compelling suggestions), and now that this thesis is submitted, I look forward to the next steps beyond my PhD. Contents 1 Between Linguistics and Machine Learning 13 1.1 Synopsis...................................... 14 1.1.1 Core of the Thesis............................ 14 1.1.2 Outline of the Thesis........................... 14 1.2 Distributional Semantics............................. 16 1.2.1 Vector Space Models........................... 17 1.2.2 A Note on Terminology: Numerical Vectors and Algebraic Vectors.. 18 1.3 Model-Theoretic Semantics............................ 19 1.3.1 Neo-Davidsonian Event Semantics.................... 20 1.3.2 Situation Semantics............................ 21 1.3.3 Dependency Minimal Recursion Semantics............... 22 2 Modelling Meaning in Distributional Semantics and Model-Theoretic Semantics 25 2.1 Meaning and the World.............................. 26 2.1.1 Grounding................................ 26 2.1.2 Concepts and Referents.......................... 28 2.2 Lexical Meaning................................. 31 2.2.1 Vagueness................................. 31 2.2.2 Polysemy................................. 33 2.2.3 Hyponymy................................ 34 2.3 Sentence Meaning................................. 37 2.3.1 Compositionality............................. 37 2.3.2 Logic................................... 39 2.3.3 Context Dependence........................... 42 2.4 Learning Meaning................................. 44 2.5 Existing Frameworks............................... 46 2.5.1 Extensions of Vector Space Models................... 46 2.5.2 Hybrid Approaches............................ 47 2.5.3 The Type-Driven Tensorial Framework................. 47 2.5.4 Probabilistic Semantics.......................... 48 3 Formal Framework of Functional Distributional Semantics 49 3.1 Summary of Classical Model Theory....................... 49 3.2 Individuals and Pixies............................... 50 3.3 Probabilistic Model Structures.......................... 52 3.4 Semantic Functions................................ 54 3.4.1 Regions of Semantic Space........................ 56 3.4.2 Uncertainty................................ 59 3.5 A Probabilistic Graphical Model for Probabilistic Model Structures...... 60 3.6 Functional Distributional Semantics....................... 63 3.7 Assessment against Top-Down Goals....................... 65 3.7.1 Language and the World......................... 65 3.7.2 Lexical Meaning............................. 65 3.7.3 Sentence Meaning............................ 66 3.7.4 Learning Meaning............................ 67 3.7.5 Comparison with Existing Frameworks................. 68 4 From Bayesian Inference to Logical Inference 69 4.1 Context Dependence as Bayesian Inference................... 70 4.2 Context Dependence in Functional Distributional Semantics.......... 71 4.3 Disambiguation.................................. 73 4.4 Semantic Composition.............................. 74 4.5 Logical Inference................................. 76 4.5.1 Proof of Equivalence........................... 80 5 Implementation and Inference Algorithms 83 5.1 Network Architecture............................... 84 5.1.1 Summary of Architecture......................... 87 5.1.2 Soft Constraints.............................. 87 5.2 Gradient Descent................................. 88 5.2.1 Derivation of Gradient.......................... 90 5.3 Markov Chain Monte Carlo............................ 93 5.4 Variational Inference............................... 96 5.4.1 Variational Inference for Context Dependence.............. 98 5.4.2 Variational Inference for Logical Inference............... 99 5.4.3 Derivation of Update Rule........................ 100 6 Experiments 103 6.1 Training...................................... 103 6.1.1 Training Data............................... 103 6.1.2 Training Algorithm............................ 106 6.1.3 Parameter Initialisation.......................... 108 6.2 Experimental Results..............................