Linear Transformations of Semantic Spaces for Word-Sense Discrimination and Collocation Compositionality Grading
Total Page:16
File Type:pdf, Size:1020Kb
Linear transformations of semantic spaces for word-sense discrimination and collocation compositionality grading Alfredo Maldonado Guerra Doctor of Philosophy The University of Dublin, Trinity College 2015 Declaration I declare that this thesis has not been submitted as an exercise for a degree at this or any other university and it is entirely my own work. I agree to deposit this thesis in the University’s open access institutional repository or allow the library to do so on my behalf, subject to Irish Copyright Legislation and Trinity College Library conditions of use and acknowledgement. Alfredo Maldonado Guerra 3 4 Abstract Latent Semantic Analysis (LSA) and Word Space are two semantic models derived from the vector space model of distributional semantics that have been used successfully in word-sense disambiguation and discrimination. LSA can represent word types and word tokens in con- text by means of a single matrix factorised by Singular Value Decomposition (SVD). Word Space is able to represent types via word vectors and tokens through two separate kinds of context vectors: direct vectors that count first-order word co-occurrence and indirect vec- tors that capture second-order co-occurrence. Word Space objects are optionally reduced by SVD. Whilst being regarded as related, little has been discussed about the specific relation- ship between Word Space and LSA or the benefits of one model over the other, especially with regard to their capability of representing word tokens. This thesis aims to address this both theoretically and empirically. Within the theoretical focus, the definitions of Word Space and LSA as presented in the literature are studied. A formalisation of these two semantic models is presented and their theoretical properties and relationships are discussed. A fundamental insight from this theor- etical analysis is that indirect (second-order) vectors can be computed from direct (first-order) vectors through a linear transformation involving a matrix of word vectors (a word matrix), an operation that can itself be seen as a method of dimensionality reduction alternative to SVD. Another finding is that in their unreduced form, LSA vectors and the Word Space dir- ect (first-order) context vectors define approximately the same objects and their difference can be exactly calculated. It is also found that the SVD spaces produced by LSA and the Word Space word vectors are also similar and their difference, which can also be precisely calculated, ultimately stems from the original difference between unreduced LSA vectors and Word Space direct vectors. It is also observed that the indirect “second-order” method of token representation from Word Space is also available to LSA, in a version of the representa- tion that has remained largely unexplored. And given the analysis of the SVD spaces produced by both models, it is hypothesised that, when exploited in comparable ways, Word Space and LSA should perform similarly in actual word-sense disambiguation and discrimination exper- iments. In the empirical focus, performance comparisons between different configurations of LSA and Word Space are conducted in actual word-sense disambiguation and discrimination ex- periments. It is found that some indirect configurations of LSA and Word Space do indeed perform similarly, but other LSA and Word Space indirect configurations as well as their dir- ect representations perform more differently. So, whilst the two models define approximately 5 the same spaces, their differences are large enough to impact performance. Word Space’s sim- pler, unreduced direct (first-order) context vectors are found to offer the best overall trade off between accuracy and computational expense. Another empirical exercise involves comparis- ons of geometric properties of Word Space’s two token vector representations aimed at testing their similarity and predicting their performance in means-based word-sense disambiguation and discrimination experiments. It is found that they are not geometrically similar and that sense vectors computed from direct vectors are more spread than those computed from indir- ect vectors. Word-sense disambiguation and discrimination experiments performed on these vectors largely reflect the geometric comparisons as the more spread direct vectors perform better than indirect vectors in supervised disambiguation experiments, although in unsuper- vised discrimination experiments, no clear winner emerges. The role of the Word Space word matrix as a dimensionality reduction operator is also explored. Instead of simply truncating the word matrix, a method in which dimensions representing statistically associated word pairs are summed and merged, called word matrix consolidation, is proposed. The method achieves modest but promising results comparable to SVD. Finally, the word vectors from Word Space are tested empirically in a task designed to grade (measure) the compositionality (or degree of “literalness”) of multi-word expressions (MWEs). Cosine similarity measures are taken between a word vector representing the full MWE, and word vectors represent- ing each of its individual member words in order to measure the deviation in co-occurrence distribution between the MWE and its individual members. It is found that this deviation in co-occurrence distributions does correlate with human compositionality judgements of MWEs. 6 Acknowledgements The research presented in this thesis was supported by Science Foundation Ireland (Grant 07/CE/I1142) as part of the Centre for Next Generation Localisation (www.cngl.ie) at Trinity College Dublin. Some calculations were performed on the Lonsdale cluster main- tained by the Trinity Centre for High Performance Computing. This cluster was also funded through grants from Science Foundation Ireland. I would like to express my sincere thanks to my supervisor Dr Martin Emms, not just for his hands-on involvement in this research and for his support throughout the course of my studies, but for teaching me a whole new way of deep analytical thinking without which this thesis would have not taken shape. In fact, some of the most important contributions included herein, such as the linear transformation formulation in Chapter 4 and the difference between the R1 and R2 SVD projections in Chapter 3, are based on original ideas by him. I only followed up on his big good ideas with lots of little good ideas. Credit and gratitude are also owed to the examiners of this thesis, Dr Saturnino Luz and Dr Anna Korhonen, whose feedback and advice strengthened this work significantly. I would also like to thank Dr Carl Vogel for his continued help and support during my studies as well as for initially having me admitted to the Ph.D. programme in Trinity. Simil- arly, I would like to thank Elda Quiroga from Tecnológico de Monterrey and Masaki Itagaki from Microsoft for their support in the preceding stages of my Ph.D. This thesis is in many ways the product of their guidance and support. Many thanks also go to the other Ph.D. students and post-docs for the deep technical discussions and their spirit of camaraderie: Liliana, Héctor, Gerard, Martin, Erwan, Derek, Anne, Roman, Stephan, Francesca, Oscar, Baoli, Nikiforos and Ielka, as well as to the “new generation”: Akira, Grace, Kevin, Carmen, Arun and Shane. I would also like to thank the DU Archaeological Society for providing me with a space on campus for intellectual discus- sions that did not involve computers but dusty old bones, and in particular I wish to thank Mary, Ciarán, Jenny, Deirdre, Pablo, Karl, Aoife, Sean, Alice, Michael, Alex, Victoria and John Tighe for their friendship. Thank you guys, I had a blast! Heel veel dank aan Wynzen de Vries, for his patience, encouragement and support during my studies and for his understanding when the writing of this thesis soaked up most of my time. Finalmente, me gustaría agradecer a mis padres, Beatriz Guerra Treviño and Alfredo Maldonado Osorno for all their care, education and support during the first 20-something years of my life. 7 8 Contents Declaration 3 Abstract 5 Acknowledgements 7 Typographical conventions 13 1 Introduction 15 1.1 Motivation .................................. 15 1.2 Operationalising context computationally .................. 20 1.3 Research questions and thesis structure ................... 24 2 Linguistic Background 31 2.1 What is a word? ............................... 32 2.1.1 Word tokens and word types .................... 32 2.1.2 Multi-word expressions and collocations .............. 33 2.1.3 Ngrams ............................... 36 2.2 What is a word sense? ............................ 37 2.2.1 Structuralist lexical semantics .................... 39 2.2.2 Word senses and the role of context ................. 42 2.2.3 Characterising context ........................ 46 2.2.4 The distributional hypothesis of lexical semantics .......... 49 2.2.5 Meaning beyond context ...................... 52 3 Computational Background 57 3.1 Natural language processing tasks ...................... 58 3.1.1 WSX: Word-sense disambiguation, discrimination and induction .. 60 3.1.1.1 Word-sense disambiguation ............... 61 3.1.1.2 Word-sense discrimination ................ 64 3.1.2 Measuring the compositionality of multi-word expressions ..... 65 3.2 The vector space model of information retrieval ............... 67 3.3 The VSM as a distributional lexical semantics model ............. 75 3.4 Latent Semantic Analysis ........................... 78 9 3.4.1 SVD: the mathematical foundation of LSA ............