A Relation-Centric View of Semantic Representation Learning

A Relation-Centric View of Semantic Representation Learning Sujay Kumar Jauhar CMU-LTI-17-006 Language Technologies Institute School of Computer Science Carnegie Mellon University 5000 Forbes Ave., Pittsburgh, PA – 15213 www.lti.cs.cmu.edu Thesis Committee: Eduard Hovy, Chair Chris Dyer Lori Levin Peter Turney Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Language and Information Technologies Copyright c 2017 Sujay Kumar Jauhar Keywords: Semantics, Representation Learning, Relations, Structure To the women in my life. My mother and my wife. iv Abstract Much of NLP can be described as mapping of a message from one sequence of symbols to another. Examples include word surface forms, POS-tags, parse trees, vocabularies of different languages, etc. Machine learning has been applied successfully to many NLP problems by adopting the symbol mapping view. No knowledge of the sets of symbols is required: only a rich feature representation to map between them. The task of coming up with expressive features is typically a knowledge-intensive and time-consuming endeavor of human creativity. However the representation learning paradigm (Bengio et al., 2013) offers an alternative solution, where an op- timal set of features are automatically learned from data for a target application. It has been successfully used in many applied fields of machine learning including speech recognition (Dahl et al., 2010), vision and object recognition (Krizhevsky et al., 2012), and NLP (Socher et al., 2010; Schwenk et al., 2012). One particular area of NLP that has received a lot of interest, from the perspective of representation learning, is lexical semantics. Many efforts have focussed on attempting to learn high-dimensional numerical vector representations for units of meaning (most often words) in a vocabulary. While the current slew of research uses neural network techniques (Collobert and Weston, 2008; Mikolov et al., 2013a), word-embeddings have been around much longer in the form of distributional vector space models (Turney and Pantel, 2010). We use the term semantic representation learning to encompass all these views and techniques. Semantic representations learned automatically from data have proven to be use- ful for many downstream applications such as question answering (Weston et al., 2015), word-sense discrimination and disambiguation (McCarthy et al., 2004; Schutze,¨ 1998), and selectional preference modeling (Erk, 2007). The successes of representation learning, along with its unsupervised and data-driven nature make it a com- pelling modelling solution for semantics. However, unlike manual feature engineering, it remains difficult to inject linguistic or world knowledge into the process of learning semantic representations automatically. This has led many models to still fail to account for many basic properties of human languages, such as antonymy, polysemy, semantic composition, and negation, among others. We attempt to outline solutions to some of these shortcomings in this thesis. When considering semantic representation learning models, two questions may be asked: firstly, how is the model learned; and secondly; what does the model represent? The research community has largely focussed on the first of the two questions – the how – proposing many different machine learning solutions that yield semantic vector representations. In this thesis, we instead focus on the second of the two questions – the what – namely gaining insight into the underlying semantic information that is actually captured by models. To explain this information we introduce the two connected notions of semantic relations and contextual relations. Semantic relations are properties of language that we generally wish to capture with semantic models, such as similarity and related- ness. However, given some data, unsupervised representation learning techniques have no way of directly optimizing and learning these semantic relations. At the same time, a crucial component of every representation learning model, is the notion of a context. This is the basis upon which models extract count statis- tics to process and produce representations. It is, effectively a model’s view of the world and one of the few ways of biasing the unsupervised learner’s perception of semantics, with data. A contextual relation is the operationalization that yields the contexts, upon which a semantic model can be trained. Therefore, to understand what information a model represents one must understand and articulate the key contextual relation that defines its learning process. In other words, being able to specify what is being counted leads to insights into what is being represented. Conversely, if we wish to learn a model that captures a specific semantic relation it is important to define a contextual relation that captures the intuition of the semantic relation through contextual views of data. The connection between semantic and corresponding contextual relation may not always be easy or evident, but in this thesis we present some examples where we can make such a connection and also provide a framework for thinking about future solutions. To clarify the idea of semantic and contextual relations let us consider the example of learning a model of semantic similarity. In this case the semantic relation that we wish to learn is similarity, but how do we bias the learner to capture a specific relation in an unsupervised way from unannotated data? We need to start with an intuition about semantic similarity, and make it concrete with a contextual relation. Harris (1954) and Firth (1957) intuited that words that appear in similar contexts tend to have similar meaning; this is known as the distributional hypothesis. This hypothesis is often operationalized to yield contexts over word window neighbor- hoods, and many existing word vector learning techniques (Collobert and Weston, 2008; Mnih and Teh, 2012; Mikolov et al., 2013a) use just such contexts. The word- neighborhood-word contextual relation thus produces models that capture similarity. From this point of view the distributional hypothesis is first and foremost a contextual relational, only one among many possible contextual relations. It becomes clear why models that use this relation to extract counts and process information are incapable of capturing all of semantics. The contextual relation based on the distributional hypothesis is simply not the right or most expressive one for many linguistic phenomena or semantic problems – for example composition, negation, antonymy etc. Turney and Pantel (2010) describe other contextual relations that define and yield contexts such as entire documents or patterns. Models that use these contextual relations capture different information (i.e. semantic relations) because they define views of the data via contexts differently. This thesis hypothesizes that relations and structure play a primary role in creating meaning in language, and that each linguistic phenomenon has its own set of relations and characteristic structure. We further hypothesize that by understanding and articulating the principal contextual relations vi relevant to a particular problem, one can leverage them to build more expressive and empirically superior models of semantic representations. We begin by introducing the problem of semantic representation learning, and by defining semantic and contextual relations. We then outline a framework within which to think about and implement solutions that bias learners of semantic representations towards capturing specific relational information. The framework is example-based, in the sense that we draw from concrete instances in this thesis to illustrate five sequential themes that need to be considered. The rest of the thesis provides concrete instantiations of these sequential themes. The problems we tackle in the four subsequent chapters (Chapters 3– 6) are broad, both from the perspective of the target learning problem as well as the nature of the relations and structure that we use. We thus show that by focussing on relations, we can learn models of word meaning that are semantically and linguistically more expressive while also being empirically superior. In the first of these chapters, we outline a general methodology to learn sense- specific representations for polysemous words by using the relational links in an on- tological graph to tease apart the different senses of words. In the next chapter, we show how the relational structure of tables can be used to produce high-quality an- notated data for question answering, as well as build models that leverage this data and structure to perform question answering. The third of the four chapters deals with the problem of inducing an embedded semantic lexicon using minimal anno- tation. In this chapter we jointly leverage local token-level and global type-level relations between predicate-argument pairs to yield representations for words in the vocabulary, as well as a codebook for a set of latent relational argument slots. Fi- nally, in the last content chapter we show how tensor-based structured distributional semantic models can be learned for tackling the problems of contextual variability and facetted comparison; we give examples of both syntactic and latent semantic relational integration into the model. This thesis provides a step towards better understanding computational semantic modelling with representation learning. It presents a way of connecting what knowledge

A Relation-Centric View of Semantic Representation Learning

Arguments For: Semantic Folding and Hierarchical Temporal Memory

Embeddings in Natural Language Processing

Sociolinguistic Properties of Word Embeddings Introduction

Semantic Structure and Interpretability of Word Embeddings

A Compositional and Interpretable Semantic Space

A Vector Semantics Approach to the Geoparsing Disambiguation Task for Texts in Spanish

The Discursive Construction of Migrant Otherness on Facebook: a Distributional Semantics Approach

Enhancing Domain Word Embedding Via Latent Semantic Imputation

Natural Language Processing (CSEP 517): Distributional Semantics

Embeddings in Natural Language Processing

User Representation Learning for Social Networks: an Empirical Study

Learning Semantic Similarity in a Continuous Space