A Type-Driven Tensor-Based Semantics for CCG
A Type-Driven Tensor-Based Semantics for CCG
Jean Maillard Stephen Clark Edward Grefenstette University of Cambridge University of Cambridge University of Oxford Computer Laboratory Computer Laboratory Department of Computer Science [email protected] [email protected] [email protected]
Abstract One approach is to assume that the meanings of all words are represented by context vectors, and This paper shows how the tensor-based se- then combine those vectors using some operation, mantic framework of Coecke et al. can such as vector addition, element-wise multiplica- be seamlessly integrated with Combina- tion, or tensor product (Clark and Pulman, 2007; tory Categorial Grammar (CCG). The inte- Mitchell and Lapata, 2008). A more sophisticated gration follows from the observation that approach, which is the subject of this paper, is to tensors are linear maps, and hence can adapt the compositional process from formal se- be manipulated using the combinators of mantics (Dowty et al., 1981) and attempt to build CCG, including type-raising and compo- a distributional representation in step with the syn- sition. Given the existence of robust, tactic derivation (Coecke et al., 2010; Baroni et al., wide-coverage CCG parsers, this opens up 2013). Finally, there is a third approach using neu- the possibility of a practical, type-driven ral networks, which perhaps lies in between the compositional semantics based on distri- two described above (Socher et al., 2010; Socher butional representations. et al., 2012). Here compositional distributed rep- resentations are built using matrices operating on 1 Intoduction vectors, with all parameters learnt through a su- In this paper we show how tensor-based distribu- pervised learning procedure intended to optimise tional semantics can be seamlessly integrated with performance on some NLP task, such as syntac- Combinatory Categorial Grammar (CCG, Steed- tic parsing or sentiment analysis. The approach man (2000)), building on the theoretical discus- of Hermann and Blunsom (2013) conditions the sion in Grefenstette (2013). Tensor-based distribu- vector combination operation on the syntactic type tional semantics represents the meanings of words of the combinands, moving it a little closer to the with particular syntactic types as tensors whose se- more formal semantics-inspired approaches. mantic type matches that of the syntactic type (Co- The remainder of the Introduction gives a short ecke et al., 2010). For example, the meaning of a summary of distributional semantics. The rest of transitive verb with syntactic type (S NP)/NP is the paper introduces some mathematical notation \ a 3rd-order tensor from the tensor product space from multi-linear algebra, including Einstein nota- N S N. The seamless integration with CCG tion, and then shows how the combinatory rules of ⊗ ⊗ arises from the (somewhat trivial) observation that CCG, including type-raising and composition, can tensors are linear maps — a particular kind of be applied directly to tensor-based semantic rep- function — and hence can be manipulated using resentations. As well as describing a tensor-based CCG’s combinatory rules. semantics for CCG, a further goal of this paper is to Tensor-based semantics arises from the desire to present the compositional framework of Coecke et enhance distributional semantics with some com- al. (2010), which is based on category theory, to a positional structure, in order to make distribu- computational linguistics audience using only the tional semantics more of a complete semantic the- mathematics of multi-linear algebra. ory, and to increase its utility in NLP applica- tions. There are a number of suggestions for how 1.1 Distributional Semantics to add compositionality to a distributional seman- We assume a basic knowledge of distributional se- tics (Clarke, 2012; Pulman, 2013; Erk, 2012). mantics (Grefenstette, 1994; Schutze,¨ 1998). Re-
46 Proceedings of the EACL 2014 Workshop on Type Theory and Natural Language Semantics (TTNLS), pages 46–54, Gothenburg, Sweden, April 26-30 2014. c 2014 Association for Computational Linguistics cent inroductions to the topic include Turney and ing recent techniques from “recursive” neural net- Pantel (2010) and Clark (2014). works (Socher et al., 2010). Another possibility A potentially useful distinction for this paper, is suggested by Grefenstette et al. (2013), extend- and one not commonly made, is between distri- ing the learning technique based on linear regres- butional and distributed representations. Distri- sion from Baroni and Zamparelli (2010) in which butional representations are inherently contextual, “gold-standard” distributional representations are and rely on the frequently quoted dictum from assumed to be available for some phrases and Firth that “you shall know a word from the com- larger units. pany it keeps” (Firth, 1957; Pulman, 2013). This 2 Mathematical Preliminaries leads to the so-called distributional hypothesis that words that occur in similar contexts tend to have The tensor-based compositional process relies on similar meanings, and to various proposals for taking dot (or inner) products between vectors and how to implement this hypothesis (Curran, 2004), higher-order tensors. Dot products, and a number including alternative definitions of context; alter- of other operations on vectors and tensors, can be native weighting schemes which emphasize the conveniently written using Einstein notation (also importance of some contexts over others; alterna- referred to as the Einstein summation convention). tive similarity measures; and various dimension- In the rest of the paper we assume that the vector ality reduction schemes such as the well-known spaces are over the field of real numbers. LSA technique (Landauer and Dumais, 1997). An interesting conceptual question is whether a sim- 2.1 Einstein Notation ilar distributional hypothesis can be applied to The squared amplitude of a vector v Rn is given ∈ phrases and larger units: is it the case that sen- by: tences, for example, have similar meanings if they n v 2 = v v occur in similar contexts? Work which does ex- | | i i i=1 tend the distributional hypothesis to larger units X includes Baroni and Zamparelli (2010), Clarke Similarly, the dot product of two vectors v, w n ∈ (2012), and Baroni et al. (2013). R is given by: Distributed representations, on the other hand, n can be thought of simply as vectors (or possibly v w = viwi · higher-order tensors) of real numbers, where there Xi=1 is no a priori interpretation of the basis vectors. Denote the components of an m n real matrix Neural networks can perhaps be categorised in this × A by Aij for 1 i m and 1 j n. Then way, since the resulting vector representations are ≤ ≤ ≤ ≤ n the matrix-vector product of A and v R gives simply sequences of real numbers resulting from m ∈ a vector Av R with components: the optimisation of some training criterion on a ∈ training set (Collobert and Weston, 2008; Socher n et al., 2010). Whether these distributed represen- (Av)i = Aijvj tations can be given a contextual interpretation de- Xj=1 pends on how they are trained. We can also multiply an n m matrix A and an × One important point for this paper is that the m o matrix B to produce an n o matrix AB × × tensor-based compositional process makes no as- with components: sumptions about the interpretation of the tensors. m Hence in the remainder of the paper we make no (AB) = A B reference to how noun vectors or verb tensors, ij ik kj k=1 for example, can be acquired (which, for the case X of the higher-order tensors, is a wide open re- The previous examples are some of the most search question). However, in order to help the common operations in linear algebra, and they all reader who would prefer a more grounded dis- involve sums over repeated indices. They can be cussion, one possibility is to obtain the noun vec- simplified by introducing the Einstein summation tors using standard distributional techniques (Cur- convention: summation over the relevant range ran, 2004), and learn the higher-order tensors us- is implied on every component index that occurs
47 twice. Pairs of indices that are summed over are Thus every finite-dimensional vector is a linear known as contracted, while the remaining indices functional, and vice versa. Row and column vec- are known as free. Using this convention, the tors are examples of first-order tensors. above operations can be written as: Definition 1 (First-order tensor). Given a vector 2 space V over the field , a first-order tensor T v = vivi R | | can be defined as: v w = viwi · an element of the vector space V , • (Av)i = Aijvj, i.e. the contraction of v with a linear map T : V R, the second index of A • → a V -dimensional array of numbers Ti, for (AB)ij = AikBkj, i.e. the contraction of the • | | 1 i V . second index of A with the first of B ≤ ≤ | | These three definitions are all equivalent. Given Note how the number of free indices is always a first-order tensor described using one of these conserved between the left- and right-hand sides in definitions, it is trivial to find the two other de- these examples. For instance, while the last equa- scriptions. tion has two indices on the left and four on the right, the two extra indices on the right are con- Matrices An n m matrix A over R can be rep- × tracted. Hence counting the number of free indices resented by a two-dimensional array of real num- can be a quick way of determining what type of bers A , for 1 i n and 1 j m. ij ≤ ≤ ≤ ≤ object is given by a certain mathematical expres- Via matrix-vector multiplication, the matrix A sion in Einstein notation: no free indices means can be seen as a linear map A : Rm Rn. It → that an operation yields a scalar number, one free maps a vector v Rm to a vector index means a vector, two a matrix, and so on. ∈ A A v 11 ··· 1m 1 2.2 Tensors . . . . . .. . . , Linear Functionals Given a finite-dimensional An1 Anm vm n ··· vector space R over R, a linear functional is a linear map a : Rn R. with components → Let a vector v have components v in a fixed ba- i A(v) = A v . sis. Then the result of applying a linear functional i ij j a to v can be written as: We can also contract a vector with the first index of the matrix, which gives us a map A : Rn m → v1 R . This corresponds to the operation . a(v) = a1v1+ +anvn = a1 an . A11 A1m ··· ··· ··· v . . . n w1 wn . .. . , ··· . . A A The numbers ai are the components of the lin- n1 ··· nm ear functional, which can also be pictured as a row resulting in a vector with components vector. Since there is a one-to-one correspondence T between row and column vectors, the above equa- (w A)i = Ajiwj. tion is equivalent to: We can combine the two operations and see a matrix as a map A : Rn Rm R, defined by: a1 × → v(a) = a v + +a v = v v . A11 A1m v1 1 1 ··· n n 1 ··· n . ··· wT Av = w w . .. . . an 1 n . . . . ··· An1 Anm vm Using Einstein convention, the equations above ··· can be written as: In Einstein notation, this operation can be writ- ten as a(v) = viai = v(a) wiAijvj,
48 which yields a scalar (constant) value, consistent – T : V W R or T : V W R. × → ⊗ → with the fact that all the indices are contracted. Finally, matrices can also be characterised in Again, these definitions are all equivalent. Most terms of Kronecker products. Given two vectors importantly, the four types of maps given in the v Rn and w Rm, their Kronecker product definition are isomorphic. Therefore specifying ∈ ∈ v w is a matrix one map is enough to specify all the others. ⊗
v1w1 v1wm Tensors We can generalise these definitions to . ···. . the more general concept of tensor. v w = . .. . , ⊗ . . vnw1 vnwm Definition 3 (Tensor). Given vector spaces ··· th V1,...,Vk over the field R, a k -order tensor T with components is defined as:
(v w)ij = viwj. an element of the vector space V V , ⊗ • 1 ⊗ · · · ⊗ k It is a general result in linear algebra that any a V V , kth-dimensional array of • | 1| × · · · × | k| n m matrix can be written as a finite sum of numbers Ti1 i , for 1 ij Vj , × ··· k ≤ ≤ | | Kronecker products x(k) y(k) of a set of k ⊗ vectors x(k) and y(k). Note that the sum over k a multi-linear map T : V1 Vk R. P • × · · · × → is written explicitly as it would not be implied by Einstein notation: this is because the index k does 3 Tensor-Based CCG Semantics not range over vector/matrix/tensor components, In this section we show how CCG’s syntactic types but over a set of vectors, and hence that index ap- can be given tensor-based meaning spaces, and pears in brackets. how the combinator’s employed by CCG to com- An n m matrix is an element of the tensor × bine syntactic categories carry over to those mean- space Rn Rm, and it can also be seen as a linear ⊗ ing spaces, maintaining what is often described map A : Rn Rm R. This is because, given ⊗ → as CCG’s “transparent interface” between syntax a matrix B with decomposition x(k) y(k), k ⊗ and semantics. Here are some example syntactic the matrix A can act as follows: P types, and the corresponding tensor spaces con- taining the meanings of the words with those types X (k) (k) A(B) = Aij xi yj (using the notation syntactic type : semantic type). k We first assume that all atomic types have (k) A11 A1m y X ··· meanings living in distinct vector spaces: = (k) (k) . .. . . x1 xn . . . . k ··· (k) An1 Anm ym noun phrases, NP : N ··· • = A B . ij ij sentences, S : S • Again, counting the number of free indices in the The recipe for determining the meaning space last line tells us that this operation yields a scalar. of a complex syntactic type is to replace each Matrices are examples of second-order tensors. atomic type with its corresponding vector space Definition 2 (Second-order tensor). Given vector and the slashes with tensor product operators: spaces V,W over the field R, a second-order ten- sor T can be defined as: Intransitive verb, S NP : S N • \ ⊗ an element of the vector space V W , Transitive verb, (S NP)/NP : S N N • ⊗ • \ ⊗ ⊗ Ditransitive verb, ((S NP)/NP)/NP : a V W -dimensional array of numbers • \ • | | × | | S N N N Tij, for 1 i V and 1 j W , ⊗ ⊗ ⊗ ≤ ≤ | | ≤ ≤ | | Adverbial modifier, (S NP) (S NP) : • \ \ \ a (multi-) linear map: S N S N • ⊗ ⊗ ⊗ – T : V W , Preposition modifying NP, (NP NP)/NP : → • \ – T : W V , N N N → ⊗ ⊗
49 Hence the meaning of an intransitive verb, for the first index corresponds to the type X and the example, is a matrix in the tensor product space second to the type Y . That is why, when perform- S N. The meaning of a transitive verb is a ing the contraction corresponding to Pat walks, ⊗ “cuboid”, or 3rd-order tensor, in the tensor product P N is contracted with the second index of ∈ space S N N. In the same way that the syntac- W S N, and not the first.1 The first index ⊗ ⊗ ∈ ⊗ tic type of an intransitive verb can be thought of as of W is then the only free index, telling us that the a function — taking an NP and returning an S — above operation yields a first-order tensor (vector). the meaning of an intransitive verb is also a func- Since this index corresponds to S, we know that tion (linear map) — taking a vector in N and re- applying backward application to Pat walks yields turning a vector in S. Another way to think of this a meaning vector in S. function is that each element of the matrix spec- Forward application is performed in the same ifies, for a pair of basis vectors (one from N and manner. Consider the following example: one from S), what the result is on the S basis vec- Pat kisses Sandy tor given a value on the N basis vector. NP (S NP)/NP NP Now we describe how the combinatory rules \ NS N NN carry over to the meaning spaces. ⊗ ⊗ with corresponding tensors P N for Pat, K ∈ ∈ 3.1 Application S N N for kisses and Y N for Sandy. ⊗ ⊗ ∈ The function application rules of CCG are forward The forward application deriving the type of (>) and backward (<) application: kisses Sandy corresponds to X/Y Y = X (>) K Y , ⇒ ijk k YX Y = X (<) \ ⇒ where Y is contracted with the third index of K In a traditional semantics for CCG, if function because we have maintained the order defined by application is applied in the syntax, then function the type (S NP)/NP: the third index then corre- \ application applies also in the semantics (Steed- sponds to an argument NP coming from the right. man, 2000). This is also true of the tensor-based Counting the number of free indices in the semantics. For example, the meaning of a subject above expression tells us that it yields a second- NP combines with the meaning of an intransitive order tensor. Looking at the types corresponding verb via matrix multiplication, which is equivalent to the free indices tells us that this second-order to applying the linear map corresponding to the tensor is of type S N, which is the semantic type ⊗ matrix to the vector representing the meaning of of a verb phrase (or intransitive verb), as we have the NP. Applying (multi-)linear maps in (multi- already seen in the walks example. )linear algebra is equivalent to applying tensor 3.2 Composition contraction to the combining tensors. Here is the case for an intransitive verb: The forward (>B) and backward (B) NP S NP ⇒ \ Y ZX Y = X Z (tensor contraction. Consider the following be assigned a second-order tensor W S N. ∈ ⊗ example, in which might can combine with kiss Using the backward application combinator cor- using forward composition: responds to feeding P , an element of N, into W , seen as a function N S. In terms of tensor con- Pat might kiss Sandy → traction, this is the following operation: NP (S NP)/(S NP)(S NP)/NP NP \ \ \ NS N S NS N NN WijPj. ⊗ ⊗ ⊗ ⊗ ⊗ 1The particular order of the indices is not important, as Here we use the convention that the indices long as a convention such as this one is decided upon and consistently applied to all types (so that tensor contraction maintain the same order as the syntactic type. contracts the relevant tensors from each side when a combi- Therefore, in the tensor of an object of type X/Y , nator is used).
50 with tensors M S N S N for might and YX Y X, by backward application; ∈ ⊗ ⊗ ⊗ • \ ⇒ K S N N for kiss. Combining the meanings ∈ ⊗ ⊗ YX Y X/(X Y ) X Y , by forward of might and kiss corresponds to the following op- • \ ⇒T \ \ type-raising, and X/(X Y ) X Y X, by eration: \ \ ⇒ forward application. MijklKklm, yielding a tensor in S N N, which is the Both ways of parsing this sentence yield an item ⊗ ⊗ correct semantic type for a phrase with syntactic of type X, and crucially the meaning of the result- type (S NP)/NP. Backward composition is per- ing item should be the same in both cases.2 This \ formed analogously. property of type-raising provides an avenue into determining what the tensor representation for the 3.3 Backward-Crossed Composition type-raised category should be, since the tensor English also requires the use of backward-crossed representations must also be the same: composition (Steedman, 2000): AjBij = Aijk0 Bjk. X/Y Z X = Z/Y (
(S NP)/NP (S NP) (S NP) Kronecker delta (δij = 1 if i = j ⊗ ⊗ ing to the indices i, j and m). Note that we have and 0 otherwise): reversed the order of tensors in the contraction to make the matching of the indices more transpar- AkBjkδij = Aijk0 Bjk. ent; however, tensor contraction is commutative Since the equation holds for all B, we are left with (since it corresponds to a sum over products) so the order of the tensors does not affect the result. Aijk0 = δijAk, 3.4 Type-raising which gives us a recipe for performing type- The forward (>T) and backward (
51 a cubiod in which the noun vector is repeated a For each i, j, k, two matrices — corresponding to number of times (once for each sentence index), the l, m indices above — are “cancelled”. resulting in a series of “steps” progressing diag- This intuitive explanation extends to arguments onally from the bottom of the cuboid to the top with any number of slashes. For example, a (assuming a particular orientation). composition where the cancelling categories are The discussion so far has been somewhat ab- (N /N )/(N /N ) would require inner products be- stract, so to finish this section we include some tween 4th-order tensors in N N N N. more examples with CCG categories, and show ⊗ ⊗ ⊗ that the tensor contraction operation has an intu- itive similarity with the “cancellation law” of cat- 4 Related Work egorial grammar which applies in the syntax. First consider the example of a subject NP The tensor-based semantics presented in this pa- with meaning A, combining with a verb phrase per is effectively an extension of the Coecke et al. S NP with meaning B, resulting in a sentence (2010) framework to CCG, re-expressing in Ein- \ with meaning C. In the syntax, the two NPs can- stein notation the existing categorical CCG exten- cel. In the semantics, for each basis of the sentence sion in Grefenstette (2013), which itself builds space S we perform an inner product between two on an earlier Lambek Grammar extension to the vectors in N: framework by Coecke et al. (2013). This work also bears some similarity to the Ci = AjBij treatment of categorial grammars presented by Ba- Hence, inner products in the tensor space corre- roni et al. (2013), which it effectively encompasses spond to cancellation in the syntax. by expressing the tensor contractions described by This correspondence extends to complex argu- Baroni et al. as Einstein summations. However, ments, and also to composition. Consider the sub- this paper also covers CCG-specific operations not ject type-raising case, in which a subject NP with discussed by Baroni et al., such as type-raising and meaning A in S S N combines with a verb composition. ⊗ ⊗ phrase S NP with meaning B, resulting in a sen- One difference between this paper and the orig- \ tence with meaning C. Again we perform inner inal work by Coecke et al. (2010) is that they use product operations, but this time the inner product pregroups as the syntactic formalism (Lambek, is between two matrices:3 2008), a context-free variant of categorial gram- mar. In pregroups, cancellation in the syntax is
Ci = AijkBjk always between two atomic categories (or more precisely, between an atomic category and its “ad- Note that two matrices are “cancelled” for each joint”), whereas in CCG the arguments in complex basis vector of the sentence space (i.e. for each categories can be complex categories themselves. index i in Ci). To what extent this difference is significant re- As a final example, consider the forward com- mains to be seen. For example, one area where this position from earlier, in which a modal verb with may have an impact is when non-linearities are meaning A in S N S N combines with a tran- ⊗ ⊗ ⊗ added after contractions. Since the CCG contrac- sitive verb with meaning B in S N N to give ⊗ ⊗ tions with complex arguments happen “in one go”, a transitive verb with meaning C in S N N. ⊗ ⊗ whereas the corresponding pregroup cancellation Again the cancellation in the syntax corresponds in the semantics would be a series of contractions, to inner products between matrices, but this time many more non-linearities would be added in the we need an inner product for each combination of pregroup case. 3 indices: Krishnamurthy and Mitchell (2013) is based on a similar insight to this paper – that CCG provides C = A B ijk ijlm lmk combinators which can manipulate functions op- 3To be more precise, the two matrices can be thought of erating over vectors. Krishnamurthy and Mitchell as vectors in the tensor space S N and the inner product is consider the function application case, whereas we between these vectors. Another⊗ way to think of this opera- tion is to “linearize” the two matrices into vectors and then have shown how the type-raising and composition perform the inner product on these vectors. operators apply naturally in this setting also.
52 5 Conclusion of what the meaning spaces represent. The lat- ter question is particularly pressing in the case of This paper provides a theoretical framework for the sentence space, and providing an interpretation the development of a compositional distributional of such spaces remains a challenge for the distri- semantics for CCG. Given the existence of ro- butional semantics community, as well as relating bust, wide-coverage CCG parsers (Clark and Cur- distributional semantics to more traditional topics ran, 2007; Hockenmaier and Steedman, 2002), in semantics such as quantification and inference. together with various techniques for learning the tensors, the opportunity exists for a practical im- Acknowledgments plementation. However, there are significant engi- neering difficulties which need to be overcome. Jean Maillard is supported by an EPSRC MPhil studentship. Stephen Clark is supported by ERC Consider adapting the neural-network learning Starting Grant DisCoTex (306920) and EPSRC techniques of Socher et al. (2012) to this prob- grant EP/I037512/1. Edward Grefenstette is sup- lem.4 In terms of the number of tensors, the lexi- ported by EPSRC grant EP/I037512/1. We would con would need to contain a tensor for every word- like to thank Tamara Polajnar, Laura Rimell, Nal category pair; this is at least an order of magnitude Kalchbrenner and Karl Moritz Hermann for useful more tensors then the number of matrices learnt in discussion. existing work (Socher et al., 2012; Hermann and Blunsom, 2013). Furthermore, the order of the tensors is now higher. Syntactic categories such as References ((N /N )/(N /N ))/((N /N )/(N /N )) are not un- common in the wide-coverage grammar of Hock- M. Baroni and R. Zamparelli. 2010. Nouns are vectors, adjectives are matrices: Representing enmaier and Steedman (2007), which in this case adjective-noun constructions in semantic space. In would require an 8th-order tensor. This combina- Conference on Empirical Methods in Natural Lan- tion of many word-category pairs and higher-order guage Processing (EMNLP-10), Cambridge, MA. tensors results in a huge number of parameters. M. Baroni, R. Bernardi, and R. Zamparelli. 2013. As a solution to this problem, we are investigat- Frege in space: A program for compositional dis- ing ways to reduce the number of parameters, for tributional semantics (to appear). Linguistic Issues example using tensor decomposition techniques in Language Technologies. (Kolda and Bader, 2009). It may also be possi- Johan Bos, Stephen Clark, Mark Steedman, James R. ble to reduce the size of some of the complex cat- Curran, and Julia Hockenmaier. 2004. Wide- egories in the grammar. Many challenges remain coverage semantic representations from a CCG parser. In Proceedings of COLING-04, pages 1240– before a type-driven compositional distributional 1246, Geneva, Switzerland. semantics can be realised, similar to the work of Bos for the model-theoretic case (Bos et al., 2004; Johan Bos. 2005. Towards wide-coverage seman- Bos, 2005), but in this paper we have set out the tic interpretation. In Proceedings of the Sixth In- ternational Workshop on Computational Semantics theoretical framework for such an implementation. (IWCS-6), pages 42–53, Tilburg, The Netherlands. Finally, we repeat a comment made earlier that the compositional framework makes no assump- Stephen Clark and James R. Curran. 2007. Wide- coverage efficient statistical parsing with CCG tions about the underlying vector spaces, or how and log-linear models. Computational Linguistics, they are to be interpreted. On the one hand, this 33(4):493–552. flexibility is welcome, since it means the frame- work can encompass many techniques for building Stephen Clark and Stephen Pulman. 2007. Combining symbolic and distributional models of meaning. In word vectors (and tensors). On the other hand, it Proceedings of AAAI Spring Symposium on Quan- means that a description of the framework is nec- tum Interaction, Stanford, CA. AAAI Press. essarily abstract, and it leaves open the question Stephen Clark. 2014. Vector space models of lexical 4Non-linear transformations are inherent to neural net- meaning (to appear). In Shalom Lappin and Chris works, whereas the framework in this paper is entirely linear. Fox, editors, Handbook of Contemporary Semantics However, as hinted at earlier in the paper, non-linear transfor- second edition. Wiley-Blackwell. mations can be applied to the output of each tensor, turning the linear networks in this paper into extensions of those in Daoud Clarke. 2012. A context-theoretic frame- Socher et al. (2012) (extensions in the sense that the tensors work for compositionality in distributional seman- in Socher et al. (2012) do not extend beyond matrices). tics. Computational Linguistics, 38(1):41–71.
53 B. Coecke, M. Sadrzadeh, and S. Clark. 2010. Math- Jayant Krishnamurthy and Tom M. Mitchell. 2013. ematical foundations for a compositional distribu- Vector space semantic parsing: A framework for tional model of meaning. In J. van Bentham, compositional vector space models. In Proceed- M. Moortgat, and W. Buszkowski, editors, Linguis- ings of the 2013 ACL Workshop on Continuous Vec- tic Analysis (Lambek Festschrift), volume 36, pages tor Space Models and their Compositionality, Sofia, 345–384. Bulgaria. Bob Coecke, Edward Grefenstette, and Mehrnoosh Joachim Lambek. 2008. From Word to Sentence. Sadrzadeh. 2013. Lambek vs. Lambek: Functorial A Computational Algebraic Approach to Grammar. vector space semantics and string diagrams for Lam- Polimetrica. bek calculus. Annals of Pure and Applied Logic. T. K. Landauer and S. T. Dumais. 1997. A solu- R. Collobert and J. Weston. 2008. A unified architec- tion to Plato’s problem: the latent semantic analysis ture for natural language processing: Deep neural theory of acquisition, induction and representation networks with multitask learning. In International of knowledge. Psychological Review, 104(2):211– Conference on Machine Learning, ICML, Helsinki, 240. Finland. Jeff Mitchell and Mirella Lapata. 2008. Vector-based James R. Curran. 2004. From Distributional to Seman- models of semantic composition. In Proceedings of tic Similarity. Ph.D. thesis, University of Edinburgh. ACL-08, pages 236–244, Columbus, OH. D.R. Dowty, R.E. Wall, and S. Peters. 1981. Introduc- Stephen Pulman. 2013. Distributional semantic mod- tion to Montague Semantics. Dordrecht. els. In Sadrzadeh Heunen and Grefenstette, editors, Compositional Methods in Physics and Linguistics. Katrin Erk. 2012. Vector space models of word mean- Oxford University Press. ing and phrase meaning: a survey. Language and Linguistics Compass, 6(10):635–653. Hinrich Schutze.¨ 1998. Automatic word sense dis- crimination. Computational Linguistics, 24(1):97– J. R. Firth. 1957. A synopsis of linguistic theory 1930- 124. 1955. In Studies in Linguistic Analysis, pages 1–32. Oxford: Philological Society. Richard Socher, Christopher D. Manning, and An- Edward Grefenstette, Georgiana Dinu, YaoZhong drew Y. Ng. 2010. Learning continuous phrase Zhang, Mehrnoosh Sadrzadeh, and Marco Baroni. representations and syntactic parsing with recursive 2013. Multistep regression learning for composi- neural networks. In Proceedings of the NIPS Deep tional distributional semantics. In Proceedings of Learning and Unsupervised Feature Learning Work- the 10th International Conference on Computational shop, Vancouver, Canada. Semantics (IWCS-13), Potsdam, Germany. Richard Socher, Brody Huval, Christopher D. Man- Gregory Grefenstette. 1994. Explorations in Auto- ning, and Andrew Y. Ng. 2012. Semantic composi- matic Thesaurus Discovery. Kluwer. tionality through recursive matrix-vector spaces. In Proceedings of the Conference on Empirical Meth- Edward Grefenstette. 2013. Category-Theoretic ods in Natural Language Processing, pages 1201– Quantitative Compositional Distributional Models 1211, Jeju, Korea. of Natural Language Semantics. Ph.D. thesis, Uni- versity of Oxford. Mark Steedman. 2000. The Syntactic Process. The MIT Press, Cambridge, MA. Karl Moritz Hermann and Phil Blunsom. 2013. The role of syntax in vector space models of composi- Peter D. Turney and Patrick Pantel. 2010. From tional semantics. Proceedings of ACL, Sofia, Bul- frequency to meaning: Vector space models of se- garia, August. Association for Computational Lin- mantics. Journal of Artificial Intelligence Research, guistics. 37:141–188. Julia Hockenmaier and Mark Steedman. 2002. Gen- erative models for statistical parsing with Combi- natory Categorial Grammar. In Proceedings of the 40th Meeting of the ACL, pages 335–342, Philadel- phia, PA. Julia Hockenmaier and Mark Steedman. 2007. CCG- bank: a corpus of CCG derivations and dependency structures extracted from the Penn Treebank. Com- putational Linguistics, 33(3):355–396. T. G. Kolda and B. W. Bader. 2009. Tensor decompo- sitions and applications. SIAM Review, 51(3):455– 500.
54