A Type-Driven Tensor-Based Semantics for CCG

A Type-Driven Tensor-Based Semantics for CCG Jean Maillard Stephen Clark Edward Grefenstette University of Cambridge University of Cambridge University of Oxford Computer Laboratory Computer Laboratory Department of Computer Science [email protected] [email protected] [email protected] Abstract One approach is to assume that the meanings of all words are represented by context vectors, and This paper shows how the tensor-based se- then combine those vectors using some operation, mantic framework of Coecke et al. can such as vector addition, element-wise multiplica- be seamlessly integrated with Combina- tion, or tensor product (Clark and Pulman, 2007; tory Categorial Grammar (CCG). The inte- Mitchell and Lapata, 2008). A more sophisticated gration follows from the observation that approach, which is the subject of this paper, is to tensors are linear maps, and hence can adapt the compositional process from formal se- be manipulated using the combinators of mantics (Dowty et al., 1981) and attempt to build CCG, including type-raising and compo- a distributional representation in step with the syn- sition. Given the existence of robust, tactic derivation (Coecke et al., 2010; Baroni et al., wide-coverage CCG parsers, this opens up 2013). Finally, there is a third approach using neu- the possibility of a practical, type-driven ral networks, which perhaps lies in between the compositional semantics based on distri- two described above (Socher et al., 2010; Socher butional representations. et al., 2012). Here compositional distributed representations are built using matrices operating on 1 Intoduction vectors, with all parameters learnt through a su- In this paper we show how tensor-based distribu- pervised learning procedure intended to optimise tional semantics can be seamlessly integrated with performance on some NLP task, such as syntac- Combinatory Categorial Grammar (CCG, Steed- tic parsing or sentiment analysis. The approach man (2000)), building on the theoretical discus- of Hermann and Blunsom (2013) conditions the sion in Grefenstette (2013). Tensor-based distribu- vector combination operation on the syntactic type tional semantics represents the meanings of words of the combinands, moving it a little closer to the with particular syntactic types as tensors whose se- more formal semantics-inspired approaches. mantic type matches that of the syntactic type (Co- The remainder of the Introduction gives a short ecke et al., 2010). For example, the meaning of a summary of distributional semantics. The rest of transitive verb with syntactic type (S NP)/NP is the paper introduces some mathematical notation \ a 3rd-order tensor from the tensor product space from multi-linear algebra, including Einstein nota- N S N. The seamless integration with CCG tion, and then shows how the combinatory rules of ⊗ ⊗ arises from the (somewhat trivial) observation that CCG, including type-raising and composition, can tensors are linear maps — a particular kind of be applied directly to tensor-based semantic rep- function — and hence can be manipulated using resentations. As well as describing a tensor-based CCG’s combinatory rules. semantics for CCG, a further goal of this paper is to Tensor-based semantics arises from the desire to present the compositional framework of Coecke et enhance distributional semantics with some com- al. (2010), which is based on category theory, to a positional structure, in order to make distribu- computational linguistics audience using only the tional semantics more of a complete semantic the- mathematics of multi-linear algebra. ory, and to increase its utility in NLP applica- tions. There are a number of suggestions for how 1.1 Distributional Semantics to add compositionality to a distributional seman- We assume a basic knowledge of distributional se- tics (Clarke, 2012; Pulman, 2013; Erk, 2012). mantics (Grefenstette, 1994; Schutze,¨ 1998). Re- 46 Proceedings of the EACL 2014 Workshop on Type Theory and Natural Language Semantics (TTNLS), pages 46–54, Gothenburg, Sweden, April 26-30 2014. c 2014 Association for Computational Linguistics cent inroductions to the topic include Turney and ing recent techniques from “recursive” neural net- Pantel (2010) and Clark (2014). works (Socher et al., 2010). Another possibility A potentially useful distinction for this paper, is suggested by Grefenstette et al. (2013), extend- and one not commonly made, is between distri- ing the learning technique based on linear regres- butional and distributed representations. Distri- sion from Baroni and Zamparelli (2010) in which butional representations are inherently contextual, “gold-standard” distributional representations are and rely on the frequently quoted dictum from assumed to be available for some phrases and Firth that “you shall know a word from the com- larger units. pany it keeps” (Firth, 1957; Pulman, 2013). This 2 Mathematical Preliminaries leads to the so-called distributional hypothesis that words that occur in similar contexts tend to have The tensor-based compositional process relies on similar meanings, and to various proposals for taking dot (or inner) products between vectors and how to implement this hypothesis (Curran, 2004), higher-order tensors. Dot products, and a number including alternative definitions of context; alter- of other operations on vectors and tensors, can be native weighting schemes which emphasize the conveniently written using Einstein notation (also importance of some contexts over others; alterna- referred to as the Einstein summation convention). tive similarity measures; and various dimension- In the rest of the paper we assume that the vector ality reduction schemes such as the well-known spaces are over the field of real numbers. LSA technique (Landauer and Dumais, 1997). An interesting conceptual question is whether a sim- 2.1 Einstein Notation ilar distributional hypothesis can be applied to The squared amplitude of a vector v Rn is given ∈ phrases and larger units: is it the case that sen- by: tences, for example, have similar meanings if they n v 2 = v v occur in similar contexts? Work which does ex- | | i i i=1 tend the distributional hypothesis to larger units X includes Baroni and Zamparelli (2010), Clarke Similarly, the dot product of two vectors v, w n ∈ (2012), and Baroni et al. (2013). R is given by: Distributed representations, on the other hand, n can be thought of simply as vectors (or possibly v w = viwi · higher-order tensors) of real numbers, where there Xi=1 is no a priori interpretation of the basis vectors. Denote the components of an m n real matrix Neural networks can perhaps be categorised in this × A by Aij for 1 i m and 1 j n. Then way, since the resulting vector representations are ≤ ≤ ≤ ≤ n the matrix-vector product of A and v R gives simply sequences of real numbers resulting from m ∈ a vector Av R with components: the optimisation of some training criterion on a ∈ training set (Collobert and Weston, 2008; Socher n et al., 2010). Whether these distributed represen- (Av)i = Aijvj tations can be given a contextual interpretation de- Xj=1 pends on how they are trained. We can also multiply an n m matrix A and an × One important point for this paper is that the m o matrix B to produce an n o matrix AB × × tensor-based compositional process makes no as- with components: sumptions about the interpretation of the tensors. m Hence in the remainder of the paper we make no (AB) = A B reference to how noun vectors or verb tensors, ij ik kj k=1 for example, can be acquired (which, for the case X of the higher-order tensors, is a wide open re- The previous examples are some of the most search question). However, in order to help the common operations in linear algebra, and they all reader who would prefer a more grounded dis- involve sums over repeated indices. They can be cussion, one possibility is to obtain the noun vec- simplified by introducing the Einstein summation tors using standard distributional techniques (Cur- convention: summation over the relevant range ran, 2004), and learn the higher-order tensors us- is implied on every component index that occurs 47 twice. Pairs of indices that are summed over are Thus every finite-dimensional vector is a linear known as contracted, while the remaining indices functional, and vice versa. Row and column vec- are known as free. Using this convention, the tors are examples of first-order tensors. above operations can be written as: Definition 1 (First-order tensor). Given a vector 2 space V over the field , a first-order tensor T v = vivi R | | can be defined as: v w = viwi · an element of the vector space V , • (Av)i = Aijvj, i.e. the contraction of v with a linear map T : V R, the second index of A • → a V -dimensional array of numbers Ti, for (AB)ij = AikBkj, i.e. the contraction of the • | | 1 i V . second index of A with the first of B ≤ ≤ | | These three definitions are all equivalent. Given Note how the number of free indices is always a first-order tensor described using one of these conserved between the left- and right-hand sides in definitions, it is trivial to find the two other de- these examples. For instance, while the last equa- scriptions. tion has two indices on the left and four on the right, the two extra indices on the right are con- Matrices An n m matrix A over R can be rep- × tracted. Hence counting the number of free indices resented by a two-dimensional array of real num- can be a quick way of determining what type of bers A , for 1 i n and 1 j m. ij ≤ ≤ ≤ ≤ object is given by a certain mathematical expres- Via matrix-vector multiplication, the matrix A sion in Einstein notation: no free indices means can be seen as a linear map A : Rm Rn.

Load more