<<

Linguists Who Use Probabilistic Models Love Them: Quantification in Functional Distributional

Guy Emerson Department of Computer Science and Technology University of Cambridge [email protected]

Abstract promising framework for automatically learning Functional Distributional Semantics provides truth-conditional semantics from large datasets. a computationally tractable framework for In previous work (Emerson and Copestake, learning truth-conditional semantics from a 2017b, §3.5, henceforth E&C), I sketched how this corpus. Previous work in this framework has approach can be extended with a probabilistic ver- provided a probabilistic version of first-order sion of first-order , where quantifiers are in- logic, recasting quantification as Bayesian in- terpreted in terms of conditional probabilities. I ference. In this paper, I show how the previ- ous formulation gives trivial truth values when summarise this approach in §2 and §3. a precise quantifier is used with vague predi- There are four main contributions of this paper. cates. I propose an improved account, avoid- In §4.1, I first point out a problem with my pre- ing this problem by treating a vague vious approach. Quantifiers like every and some as a distribution over precise predicates. I con- are treated as precise, but predicates are vague. nect this account to recent work in the Rational This leads to trivial truth values, with every triv- Speech Acts framework on modelling generic ially false, and some trivially true. quantification, and I extend this to modelling donkey sentences. Finally, I explain how the Secondly, I show in §4.2–4.4 how this problem generic quantifier can be both pragmatically can fixed by treating a vague predicate as a distri- complex and yet computationally simpler than bution over precise predicates. precise quantifiers. Thirdly, in §5 I look at vague quantifiers and generic sentences, which present a challenge for 1 Introduction classical (non-probabilistic) theories. I build on Model-theoretic semantics defines meaning in Tessler and Goodman(2019)’s account of gener- terms of truth, relative to model structures. In the ics using Rational Speech Acts, a Bayesian ap- simplest case, a model structure consists of a proach to (Frank and Goodman, 2012). of individuals (also called entities). The mean- I show how generic quantification is computation- ing of a content word is a predicate, formalised ally simpler than classical quantification, consis- as a truth-conditional which maps indi- tent with evidence that generics are a “default” viduals to truth values (either truth or falsehood). mode of processing (for example: Leslie, 2008; Because of this precisely defined notion of truth, Gelman et al., 2015). arXiv:2006.03002v1 [cs.CL] 4 Jun 2020 naturally supports logic, and has be- Finally, I show in §6 how this probabilistic ap- come a prominent approach to formal semantics. proach can provide an account of donkey sen- For detailed expositions, see: Cann(1993); Allan tences, another challenge for classical theories. In (2001); Kamp and Reyle(2013). particular, I consider generic donkey sentences, Mainstream approaches to distributional se- which are doubly challenging, and which provide mantics represent the meaning of a word as a counter-examples to the claim that donkey pro- vector (for example: Turney and Pantel, 2010; nouns are associated with universal quantifiers. Mikolov et al., 2013; for an overview, see: Emer- Taking the above together, in this paper I show son, 2020b). In , Functional Distributional how a probabilistic first-order logic can be asso- Semantics represents the meaning of a word as ciated with a neural network model for distribu- a truth-conditional function (Emerson and Copes- tional semantics, in a way that sheds light on long- take, 2016; Emerson, 2018). It is therefore a standing problems in formal semantics. 2 Generalised Quantifiers ∀x picture(x) → ∃z∃y tell(y)∧story(z)∧ARG1(y, x)∧ARG2(y, z) Partee(2012) recounts how quantifiers have Figure 1: A first-order logical , representing played an important role in the development of the most likely reading of Every picture tells a story. model-theoretic semantics, seeing a major break- is not discussed in this paper. through with Montague(1973)’s work, and cul- minating in the theory of generalised quantifiers every(x) (Barwise and Cooper, 1981; Van Benthem, 1984). Ultimately, model theory requires quantifiers to picture(x) a(z) give truth values to . An example of a logical proposition is given in Fig. 1, with a quan- story(z) ∃(y) tifier for each logical variable. This also assumes a neo-Davidsonian approach to event semantics > tell(y) ∧ ARG1(y, x) (Davidson, 1967; Parsons, 1990). ∧ ARG2(y, z) Equivalently, we can represent a logical propo- Figure 2: A scope tree, equivalent to Fig. 1 above. Each sition as a scope tree, as in Fig. 2. The truth of the non-terminal node is a quantifier, with its bound vari- scope tree can be calculated by working bottom- able in brackets. Its left child is its restriction, and its up through the tree. The leaves of the tree are log- right child its body. ical expressions with free variables. They can be Quantifier Condition assigned truth values if each variable is fixed as some |R(v) ∩ B(v)| > 0 an individual in the model structure. To assign a to the whole proposition, we work up every |R(v) ∩ B(v)| = |R(v)| through the tree, quantifying the variables one at no |R(v) ∩ B(v)| = 0 1 at time. Once we reach the root, all variables have most |R(v) ∩ B(v)| > 2 |R(v)| been quantified, and we are left with a truth value. Table 1: Classical truth conditions for precise quanti- Each quantifier is a non-terminal node with two fiers, in generalised quantifier theory. children – its restriction (on the left) and its body (on the right). It quantifies exactly one variable, 3 Generalised Quantifiers in Functional called its bound variable. Each node also has free Distributional Semantics variables. For each leaf, its free variables are ex- Functional Distributional Semantics defines a actly the variables appearing in the logical expres- probabilistic graphical model for distributional se- sion. For each quantifier, its free variables are the mantics. Importantly (from the point of view of union of the free variables of its restriction and formal semantics), this graphical model incorpo- body, minus its own bound variable. For a well- rates a probabilistic version of model theory. formed scope tree, the root has no free variables. This is illustrated in Fig. 3. The top row defines Each node in the tree defines a truth value, given a a distribution over situations, each situation being fixed value for each free variable. an event with two participants.1 This generalises The truth value for a quantifier node is defined a model structure comprising a set of situations, based on its restriction and body. Given values for as in classical (Barwise and the quantifier’s free variables, the restriction and Perry, 1983). Each individual is represented by a body only depend on the quantifier’s bound vari- pixie, a point in a high-dimensional space, which able. The restriction and body therefore each de- represents the features of the individual. Two in- fine a set of individuals in the model structure – dividuals could be represented by the same pixie, the individuals for which the restriction is true, and and the space of pixies can be seen as a conceptual the individuals for which the body is true. We can space in the sense ofG ardenfors¨ (2000, 2014). write these as R(v) and B(v), respectively, where v denotes the values of all free variables. 1For situations with different structures (multiple events or different numbers of participants), we can define a family Generalised quantifier theory says that a quan- of such graphical models. Structuring the graphical model tifier’s truth value only depends on two quantities: in terms of semantic roles makes the simplifying assumption the cardinality of the restriction |R(v)|, and the that situation structure is isomorphic to a semantic depen- dency graph such as DMRS (Copestake et al., 2005; Copes- cardinality of the intersection of the restriction and take, 2009). In the general case, the assumption fails. For body |R(v) ∩ B(v)|. Table 1 gives examples. example, the ARG3 of sell corresponds to the ARG1 of buy. ARG1 ARG2 Quantifier Condition X Y Z some P (b | r, v) > 0 ∈ X every P (b | r, v) = 1 no P (b | r, v) = 0 most (b | r, v) > 1 Tr, X Tr, Y Tr, Z P 2 ∈ {⊥, >} V Table 2: Truth conditions for precise quantifiers, in terms of the conditional probability of the body given Figure 3: Probabilistic model theory, as formalised in the restriction (and given all free variables). These con- Functional Distributional Semantics. Each node is a ditions mirror Table 1. random variable. The plate (box in bottom row) de- notes repetition of nodes. Top row: pixie-valued random variables X, Y , Z to- one pixie for each free variable. Intuitively, the gether represent a situation composed of three individ- truth of a quantified expression depends on how uals. They are jointly distributed according to the se- likely B is to be true, given that R is true.3 mantic roles ARG1 and ARG2. Their joint distribution Truth conditions for quantifiers can be defined can be seen as a probabilistic model structure. in terms of (b | r, v), as shown in Table 2. For Bottom row: each predicate r in the vocabulary V has P a probabilistic truth-conditional function, which can be these precise quantifiers, the truth value is deter- applied to each individual. This gives a truth-valued ministic – if the condition in Table 2 holds, the random variable for each individual for each predicate. quantifier’s random variable Q has probability 1 of being true, otherwise it has probability 0. How- The bottom row of the graphical model defines ever, taking a probabilistic approach means that a distribution over truth values, so that each pred- we can naturally model vague quantifiers like few icate has some probability of being true of each and many. I did not give further details on this individual. Each predicate can therefore be seen point in E&C, but I will expand on this in §5. as a probabilistic truth-conditional function. In this paper, I will not discuss learning such 4 Quantification with Vague Predicates a model (for an up-to-date approach, see: Emer- Truth-conditional functions that give probabilities son, 2020a). Instead, the is on how we can strictly between 0 and 1 are motivated for both manipulate a trained model, to move from single practical and theoretical reasons. Practically, such predicates to complex propositions. a function can be implemented as a feedforward In previous work (E&C), I sketched an account neural network with a final sigmoid unit (as used of quantification. The idea is to follow generalised by E&C), whose output is never exactly 0 or 1. quantifier theory, but with a truth-valued random Theoretically, using intermediate probabilities of variable for each node in the scope tree. Similarly truth allows a natural account of (Good- to the classical case, the distributions for these man and Lassiter, 2015; Sutton, 2015, 2017). nodes are defined bottom-up through the tree. However, as we will see in the following sub- In the classical theory, we only need to know the section, intermediate probabilities pose a problem cardinalities |R(v)| and |R(v) ∩ B(v)|. In fact, all for E&C’s account of quantification. the conditions in Table 1 can be expressed in terms |R(v)∩B(v)| of the ratio |R(v)| . It therefore makes sense 4.1 Trivial Truth Values to consider the conditional probability P (b | r, v), Combining the conditions in Table 2 with vague 2 because this uses the same ratio, as shown in (1). predicates causes a problem, which can be illus- P (r, b | v) trated with a simple example. Consider a model P (b | r, v) = (1) structure containing only a single individual, and P (r | v) consider only the single predicate red, which is More precisely, B and R are truth-valued ran- true of this individual with probability p. Now dom variables for the body and restriction, and V consider the sentences (1) and (2). is a tuple-of-pixies-valued random variable, with 3This would not seem to cover so-called cardinal quanti- 2I use uppercase for random variables, lowercase for val- fiers like one and two. Under Link(1983)’s lattice-theoretic ues. I abbreviate P (X =x) as P (x), and P (T =>) as P (t). approach, a model structure contains plural individuals, so For example, P (b | r, v) means P (B => | R=>,V =v). numbers can be treated as normal predicates like adjectives. (1) Everything is red. ARG1 ARG2 X Y Z (2) is red. The body of each quantifier is simply the predi- cate red. For simplicity, we can assume that every- Tα,X Tβ,Y Tγ,Z thing and something put no constraints on their re- strictions. We need to calculate P (b | r, v). There are no free variables, and R is always true, so this T1 is simply P (b). Because there is only one individ- ual, this is simply the probability p. This means that (1) is true iff p = 1, and (2) is false iff p = 0. However, we have seen above T2 conditional how predicates will never be true with probability dependence exactly 0 or exactly 1. This means (1) is always probabilistic false, and (2) is always true, even though we have T3 scope tree assumed nothing about the individual! Figure 4: A probabilistic scope tree. T1, T2, T3 corre- 4.2 Distributions over Precise Predicates spond to non-terminal nodes in Fig. 2, going up through To avoid the problem in §4.1, we must only com- the tree. One random variable is marginalised out at a bine precise quantifiers with precise predicates time, until T3 is no longer dependent on any variables. (i.e. classical truth-conditional functions). To do this, we can view a vague predicate not as defin- tion for a quantifier node is conditionally depen- ing a probability of truth for each individual, but dent on its free variables, and is defined in terms as defining a distribution over precise predicates. of the distributions for its restriction and body, This induces a distribution for the quantifier. marginalising out the bound variable. The dis- Consider the example in §4.1. With probabil- tributions at the leaves of the tree are defined by ity p, red is a precise predicate that is true of the predicates, inducing a distribution for each quan- individual. In this case, both (1) and (2) are true. tifier node as we work up through the tree. With probability 1−p, red is a precise predicate Fig. 4 corresponds to Fig. 2, if we set α, β, that is false of the individual. In this case, both γ to be picture, tell, story. The distributions for (1) and (2) are false. Combining these cases, both Tα,X , Tβ,Y , Tγ,Z are determined by the predi- (1) and (2) are true with probability p, which has cates. We have three quantifier nodes in the clas- avoided trivial truth values. sical scope tree, and hence three additional truth Formalising a vague predicate as a distribution value nodes in the probabilistic scope tree. We first over precise predicates was also argued for by Las- define a distribution for T1, which represents the siter(2011). It can be seen as an improved version ∃(y) quantifier, and which depends on its free vari- of (Fine, 1975; Kamp, 1975; ables X and Z. It is true if, for situations involving Keefe, 2000, chapter 7), which avoids the problem the fixed pixies x and z, there is nonzero probabil- of higher-order vagueness, as shown by Lassiter. ity that they are the ARG1 and ARG2 of a telling- event pixie. Next, we define a distribution for T2, 4.3 Probabilistic Scope Trees which represents the a(z) quantifier, and depends To generalise the account in §4.2 to arbitrary scope on the free variable X. It is true if, for situations trees (see §4.4) and vague quantifiers (see §5), it is involving the fixed pixie x and story pixie z, there helpful to introduce a graphical notation for prob- is nonzero probability that T1 is true. Finally, we abilistic scope trees, illustrated in Fig. 4. This define a distribution for T3, which represents the makes the E&C account easier to visualise. The every(x) quantifier, and has no free variables. It is improved proposal in this paper modifies how the true if, for situations involving a picture pixie X, distribution for each truth value node is defined. we are certain that T2 is true. For a classical scope tree, the truth of a quanti- fier node depends on its free variables, and is de- 4.4 Probabilistic Scope Trees with Vague fined in terms of the extensions of its restriction Predicates as Distributions and body, in a way that removes the bound vari- In this section I show how to define the quantifier able. For a probabilistic scope tree, the distribu- nodes in §4.3 so that they are nontrivial. ARG1 ARG2 X Y Z

Tα,X Πα Tβ,Y Πβ Tγ,Z Πγ

T1 Π1

T Π 2 2 conditional dependence of truth values conditional dependence of T3 Π3 truth-conditional functions

Figure 5: The probabilistic scope tree in Fig. 4, explicitly showing random variables over precise functions.

To explicitly represent each vague truth- be seen as probabilistic semantic composition: the conditional function as a random variable over aim is to combine two truth-conditional functions precise functions, we need to add a function node to produce a distribution over truth-conditional for each truth value node in the graphical model. functions. This is illustrated by the nodes Π1, Π2, For example, this transforms Fig. 4 into Fig. 5. Π3 in Fig. 5, which are conditionally dependent For a truth value node T that is a leaf of the on other function nodes (indicated by the purple scope tree (the second row of Fig. 5), the distri- edges), forming a probabilistic scope tree. bution P (t) over truth values follows the descrip- Expanding (3) so it is dependent on the re- tion in §4.2. A precise predicate π : X → {>, ⊥} striction and body functions, we have (4). The maps pixies to truth values. Given π and a pixie x, aim is now to re-write the distribution for Q, us- the distribution for T is deterministic: T = π(x) ing an adapted version of E&C, in order to de- with probability 1. A distribution Π over pre- rive πQ in terms of πR and πB. As explained cise predicates π defines a vague predicate ψ, by in §3, the E&C account defines Q using the con- 4 marginalising out this distribution: ditional probability P (b | r, v). More precisely, P (q | v) = fQ(P (b | r, v)) for some fQ, such as P (t | x) = ψ(x) = Eπ [π(x)] (2) those defined by Table 2. With vague functions More generally, a truth value node Q is depen- now considered as distributions over precise func- dent on its free variables V . We can represent this tions, the conditional probability must be amended n in terms of a precise function π : X → {>, ⊥}, to P (b | r, v, πR, πB), as in (5), given precise func- where n is the number of free variables. Given val- tions πR and πB for the restriction and body. This ues v for the free variables, a distribution Π over can be re-written as a ratio of probabilities (corre- precise predicates π defines a vague predicate ψ, sponding to the classical sets), summing over pos- by marginalising out this distribution: sible values for the bound variable(s) U, as in (6). We can factorise out the distribution for U, accord- (q | v) = ψ(v) = [π(v)] (3) P Eπ ing to the conditional dependence structure (illus- What remains to be shown is that the E&C trated in Fig. 5), as in (7). Finally, we can express account of quantification (in §3) can be adapted R and B in terms of the functions πR and πB, and so that a quantifier’s distribution ΠQ over precise write the sum as an expectation, as in (8). Note functions πQ can be defined in terms of its restric- that πR and πB take u ∪ v as an – by tion function πR and body function πB. This can definition of a scope tree, if we combine a quan- tifier’s bound and free variables, we get the free 4I write expectations with a subscript to indicate the ran- dom variable being marginalised out. To write the expecta- variables of its restriction and body. I have written P tion in (2) explicitly as a sum: Eπ [π(x)] = π[π(x)P(π)]. u ∪ v rather than {u} ∪ v, to leave open the possi- bility that the quantifier has more than one bound variable, which will be relevant in §5. (q | v, π , π ) = [π (v)] (4) P R B EπQ|πR,πB Q some every no most  = fQ P (b | r, v, πR, πB) (5) P  u P (b, r, u | v, πR, πB) = fQ P (6) u P (r, u | v, πR, πB) P  (u | v) (r, b | u, v, πR, πB) many few = f u P P (7) Q P (u | v) (r | u, v, π ) u P P R Figure 6: Probabilities of truth for various quantifiers.  ! Eu|v πR(u ∪ v) πB(u ∪ v) Each x-axis is P (b | r, v, πR, πB), and each y-axis is = fQ (8)   (q | v, πR, πB), plotting the function fQ in orange. Eu|v πR(u ∪ v) P All axes range from 0 to 1. Quantifiers in the bottom (8) gives a probability of truth, hence a vague row are vague, requiring intermediate probabilities. function. Viewing it as a distribution over precise functions (as in §4.2), we finally have a definition A particularly challenging case of natural lan- of πQ in terms of πR and πB. Concretely, πQ re- guage quantification involves generic sentences, turns truth iff (8) is above a threshold. A uniform such as: dogs bark, ducks lay eggs, and mosquitoes distribution over thresholds in [0, 1] gives a distri- carry malaria. Generics are ubiquitous in natu- bution over such functions. ral language, but they are challenging for classi- Abbreviating the notation, we can write (9).A cal models, because the truth conditions seem to quantifier’s truth-conditional function depends on depend heavily on and on the the restriction and body functions, marginalising of use (for discussion, see: Carlson, 1977; out the bound variable. The ratio of expectations Carlson and Pelletier, 1995; Leslie, 2008). mirrors the classical ratio of cardinalities. While it is tempting to treat generic quantifi-   Eu [πRπB] cation as underspecification of a precise quanti- πQ ∼ fQ (9) Eu [πR] fier (for example: Herbelot, 2010; Herbelot and We can now recursively define functions for Copestake, 2011), this is at odds with evidence quantifier nodes, given functions in the leaves. We that generics are easier for children to acquire than can therefore see Fig. 4 as an abbreviated notation precise quantifiers (Hollander et al., 2002; Leslie, for Fig. 5. The dotted edges do not indicate condi- 2008; Gelman et al., 2015), and also easier for tional dependence of truth values, but conditional adults to process (Khemlani et al., 2007). dependence of truth-conditional functions. In contrast, Tessler and Goodman(2019) anal- yse generic sentences as being semantically sim- 5 Vague Quantifiers and Generics ple, with the complexity coming down to prag- matic . They use Rational Speech Acts While some, every, no, and most can be given pre- (RSA), a Bayesian approach to pragmatics (Frank cise truth conditions, other natural language quan- and Goodman, 2012; Goodman and Frank, 2016). tifiers are vague. In particular, we can consider the In this framework, literal truth is separated from terms few and many.5 pragmatic meaning. Communication is viewed as Under a classical account (for example: Barwise a game where a listener has a prior belief about and Cooper, 1981), many means that R(v) ∩ B(v) a situation, and a speaker wants to update the lis- is large compared to R(v), but how large is under- tener’s belief. Given a truth-conditional function, specified; similarly, few means this ratio is small. a literal listener updates their belief by condition- The underspecification of a proportion can natu- ing on truth, ruling out situations for which the rally be represented as a distribution. So, we can function returns false. A pragmatic speaker who define the meaning of a vague generalised quanti- observes a situation can choose an utterance which fier to be a function from (b | r, v) to a probabil- P is informative for a literal listener – in particular, ity of truth, as illustrated in Fig. 6. the utterance which maximises a literal listener’s 5Partee(1988) surveys work suggesting that few and many posterior probability for the observed situation. A are ambiguous between a vague cardinal reading and a vague proportional reading. As mentioned in Footnote 3, we can pragmatic listener can update their belief by con- treat cardinals as predicates rather than quantifiers. ditioning on a pragmatic speaker’s utterance. Tessler and Goodman’s insight is that this infer- ARG1 ARG2 X Y Z ence of pragmatic meanings can account for the behaviour of generic sentences. The literal mean- ing of a generic can be simple (it is more likely to T T T T be true as the proportion increases), but the prag- δ,X α,X β,Y γ,Z matic meaning can have a rich dependence on the world knowledge encoded in the prior over situ- ations. For example, Mosquitoes carry malaria R does not mean that all mosquitoes do (in fact, many do not) but it can be informative for the lis- tener: as most animals never carry malaria, even a Q small proportion is pragmatically relevant. Building on this, we could model the generic Figure 7: Emerson and Copestake(2017a)’s logical in- quantifier by setting fQ as the identity function ference, re-analysed as generic quantification. R is the (the same as many in Fig. 6). From (8), the prob- restriction, the of Tα,X , Tβ,Y , and ability of truth is then as shown in (10). However, Tγ,Z , while Tδ,X is the body. Generic quantification gives (q) = (t | t , t , t ), marginalising marginalising out Π and Π is computationally P P δ,X α,X β,Y γ,Z R B out all three bound variables (X, Y , and Z). expensive, as it requires summing over all possible functions. We can approximate this by reversing the order of the expectations, and so marginalis- This means the logical inference proposed by ing out ΠR and ΠB before U, as shown in (11), Emerson and Copestake(2017a) can in fact be where ψR and ψB are vague functions. Evaluating seen as generic quantification. This is illustrated a vague function is computationally simple. in Fig. 7, which corresponds to a sentence like " # Rooms that have stoves are kitchens, if α, β, γ, δ π (u ∪ v) π (u ∪ v) Eu|v R B are set to room, have, stove, kitchen.6 EπR,πB   (10) Eu|v πR(u ∪ v) Not only does this approach to quantification   deal with both precise and vague quantifiers in a Eu|v ψR(u ∪ v) ψB(u ∪ v) ≈   (11) uniform way, it can also explain why generics are Eu|v ψR(u ∪ v) easier to process than precise quantifiers. Abbreviating this, similarly to (9), we can write: 6 Donkey Sentences Eu [ψRψB] ψQ = (12) An example of a is shown in (3). Eu [ψR] They are challenging for classical semantic the- For precise quantifiers, using vague functions ories, because naive composition, shown in (4), gives trivial truth values (discussed in §4.1), but leaves a variable (y) outside the scope of its quan- for generics, (10) and (11) give similar probabili- tifier (Geach, 1962). The tempting solution in (5) ties of truth. To put it another way, a vague quan- requires a universal quantifier for an indefinite tifier doesn’t need precise functions. Modelling (a donkey), which would be non-compositional.7 generics with (10) was driven by the intuition that generics are vague but semantically simple. The (3) Every farmer who owns a donkey feeds it. alternative in (11) is even simpler, because we only (4) ∀xfarmer(x) ∧ ∃y[donkey(y) ∧   need to calculate Eu|v once in total, rather than own(x, y)] → feed(x, y) once for each possible πR and πB. This would (5) ∀x∀yfarmer(x) ∧ donkey(y) ∧ make generics computationally simpler than other own(x, y) → feed(x, y) quantifiers, consistent with the evidence that they are easier to acquire and to process. Kanazawa(1994), Brasoveanu(2008), and In fact, (11) takes us back to E&C’s conditional King and Lewis(2016) discuss how donkey sen- probability, as shown in (13). tences seem to admit multiple readings, which P vary in the strength of their truth conditions, and u P (u | v) P (r | u, v) P (b | u, v) ψQ(v) = P which depend on both lexical semantics and the u P (u | v) P (r | u, v) 6An example from RELPRON (Rimell et al., 2016). = P (b | r, v) (13) 7For simplicity, (4) and (5) suppress event variables. 7 Related Work W 1 ARG Functional Distributional Semantics is related to 2 ARG other probabilistic semantic approaches. Good- ARG1 ARG2 X Y Z man and Lassiter(2015) and Bernardy et al.(2018, 2019) represent meaning as a probabilistic pro- gram. This paper brings Functional Distributional

Tδ,W Tα,X Tβ,Y Tγ,Z Semantics closer to their work, because a proba- bilistic scope tree can be seen as a probabilistic program. An important practical difference is that R Functional Distributional Semantics represents all predicates in the same way (as functions of pixies), allowing a model to be trained on corpus data. Probabilistic TTR (Cooper, 2005; Cooper et al., Q 2015) also represents meaning as a probabilistic truth-conditional function. However, in this pa- Figure 8: Analysis of a generic donkey sentence, us- ing generic quantification. The quantifier node Q has per I have provided an alternative compositional semantics, in order to deal with vague quantifiers restriction R (a logical conjunction) and body Tδ,W . and generics. In principle, my proposal could be incorporated into a probabilistic TTR appproach. context of use. This kind of dependence is exactly Furthermore, although Cooper et al.(2015) dis- what Tessler and Goodman(2019) explained us- cuss learning, they assume a richer input than ing RSA, so I will apply the same tools here. available in distributional semantics. As discussed in §5, generics are more basic than Some hybrid distributional-logical systems ex- classical quantifiers, so I first consider generic ist (for example: Lewis and Steedman, 2013; donkey sentences, as illustrated in (6)–(8). An Grefenstette, 2013; Herbelot and Vecchi, 2015; analysis of (3) is given in Appendix A. Beltagy et al., 2016), but these do not discuss chal- (6) Farmers who own donkeys feed them. lenging cases like generics and donkey sentences. Explaining the multiple readings of donkey sen- (7) Linguists who use probabilistic models tences using pragmatic inference has been pro- love them. posed using non-probabilistic tools (for example: (8) Mosquitoes which bite birds infect them Champollion, 2016; Champollion et al., 2019). I with malaria. have provided a concrete computational method Example (8) shows it is inappropriate to use to calculate such , in the same way that a universal quantifier: not all mosquitoes carry Tessler and Goodman(2019) have provided a con- malaria, and not all bitten birds are infected (even crete account of generics. if bitten by a malaria-carrying mosquito). How- 8 Conclusion ever, this sentence still communicates that malaria is spread between birds by mosquitoes. This relies In this paper, I have presented a compositional on pragmatic inference, from prior knowledge that semantics for both precise and vague quanti- most animals cannot spread malaria. fiers, in the probabilistic framework of Functional Despite the challenge for classical theories, Distributional Semantics. I have re-interpreted generic donkey sentences can be straightforwardly previous work in this framework as performing handled by my proposed probabilistic approach. generic quantification, building on the approach of An example is shown in Fig. 8, which corresponds Tessler and Goodman(2019). I have shown how to (6), if α, β, γ, δ are set to farmer, own, donkey, generic quantification is computationally simpler feed. Intuitively, the more likely it is that a farmer than classical quantification, consistent with evi- owning a donkey implies the farmer feeding the dence that generics are a “default” mode of pro- donkey, the more likely it is for the sentence to be cessing. Finally, I have presented examples of true. Given world knowledge and a con- generic donkey sentences, which are doubly chal- text, this can lead to a sharp threshold for being lenging for classical theories, but straightforward uttered, using RSA’s pragmatic inference. under my proposed approach. Acknowledgements Lucas Champollion, Dylan Bumford, and Robert Hen- derson. 2019. Donkeys under discussion. Semantics This paper builds on chapter 7 of my PhD thesis and Pragmatics, 12(1):1–45. (Emerson, 2018), and I would like to thank my Robin Cooper. 2005. Austinian truth, attitudes and PhD supervisor Ann Copestake, for her support, . Research on Language and Computa- advice, and suggestions. I would also like to thank tion, 3(2-3):333–362. the anonymous reviewers, for pointing out areas that were unclear, and suggesting additional areas Robin Cooper, Simon Dobnik, Staffan Larsson, and Shalom Lappin. 2015. Probabilistic type theory and for discussion. natural language semantics. Linguistic Issues in I am supported by a Research Fellowship at Language Technology (LiLT), 10. Gonville & Caius College, Cambridge. Ann Copestake. 2009. Slacker semantics: Why su- perficiality, dependency and avoidance of commit- ment can be the right way to go. In Proceedings of the 12th Conference of the European Chapter of the Keith Allan. 2001. Natural language semantics. Association for Computational Linguistics (EACL), Blackwell. pages 1–9. Jon Barwise and Robin Cooper. 1981. Generalized Ann Copestake, Dan Flickinger, Carl Pollard, and quantifiers and natural language. Linguistics and Ivan A. Sag. 2005. Minimal Semantics: Philosophy, 4(2):159–219. An introduction. Research on Language and Com- putation, 3(2-3):281–332. Jon Barwise and John Perry. 1983. Situations and Atti- tudes. Massachusetts Institute of Technology (MIT) Donald Davidson. 1967. The logical form of action Press. sentences. In Nicholas Rescher, editor, The Logic of Decision and Action, chapter 3, pages 81–95. Uni- Islam Beltagy, Stephen Roller, Pengxiang Cheng, Ka- versity of Pittsburgh Press. Reprinted in: Davidson trin Erk, and Raymond J. Mooney. 2016. Repre- (1980/2001), Essays on Actions and Events, Oxford senting meaning with a combination of logical and University Press. distributional models. Computational Linguistics, 42(4):763–808. Guy Emerson. 2018. Functional Distributional Se- mantics: Learning Linguistically Informed Rep- Jean-Philippe Bernardy, Rasmus Blanck, Stergios resentations from a Precisely Annotated Corpus. Chatzikyriakidis, and Shalom Lappin. 2018. A com- Ph.D. thesis, University of Cambridge. positional Bayesian semantics for natural language. In Proceedings of the 1st International Workshop Guy Emerson. 2020a. Autoencoding pixies: Amor- on Language, Cognition and Computational Mod- tised variational inference with graph convolutions els, pages 1–10. for Functional Distributional Semantics. In Pro- ceedings of the 58th Annual Meeting of the Asso- Jean-Philippe Bernardy, Rasmus Blanck, Stergios ciation for Computational Linguistics (ACL). Chatzikyriakidis, Shalom Lappin, and Aleksandre Maskharashvili. 2019. Bayesian Inference Seman- Guy Emerson. 2020b. What are the goals of distribu- tics: A modelling system and a test suite. In Pro- tional semantics? In Proceedings of the 58th An- ceedings of the 8th Joint Conference on Lexical and nual Meeting of the Association for Computational (*SEM), pages 263–272. Linguistics (ACL). Adrian Brasoveanu. 2008. Donkey pluralities: plu- Guy Emerson and Ann Copestake. 2016. Functional ral information states versus non-atomic individuals. Distributional Semantics. In Proceedings of the Linguistics and Philosophy, 31(2):129–209. 1st Workshop on Representation Learning for NLP Ronnie Cann. 1993. Formal semantics: an introduc- (RepL4NLP), pages 40–52. Association for Compu- tion. Cambridge University Press. tational Linguistics. Gregory N. Carlson. 1977. to kinds in En- Guy Emerson and Ann Copestake. 2017a. Variational glish. Ph.D. thesis, University of Massachusetts at inference for logical inference. In Proceedings of Amherst. the Conference on Logic and Machine Learning in Natural Language (LaML), pages 53–62. Cen- Gregory N. Carlson and Francis Jeffry Pelletier, edi- tre for Linguistic Theory and Studies in Probability tors. 1995. The generic book. University of Chicago (CLASP). Press. Guy Emerson and Ann Copestake. 2017b. Semantic Lucas Champollion. 2016. Homogeneity in donkey composition via probabilistic model theory. In Pro- sentences. In Proceedings of the 26th Semantics ceedings of the 12th International Conference on and Linguistic Theory Conference (SALT 26), pages Computational Semantics (IWCS), pages 62–77. As- 684–704. sociation for Computational Linguistics. Kit Fine. 1975. Vagueness, truth and logic. Synthese, Hans A. W. Kamp. 1975. Two theories about adjec- 30(3-4):265–300. tives. In Edward L. Keenan, editor, Formal Seman- tics of Natural Language, chapter 9, pages 123–155. Michael C. Frank and Noah D. Goodman. 2012. Pre- Reprinted in: Stephen Davis and Brendan S. Gillon, dicting pragmatic reasoning in language games. Sci- editors (2004), Semantics: A Reader, chapter 26, ence, 336(6084):998–998. pages 541–562; Kamp (2013), Meaning and the Dy- namics of : Selected Papers of Hans Peter Gardenfors.¨ 2000. Conceptual spaces: The ge- Kamp, pages 225–261. ometry of thought. Massachusetts Institute of Tech- nology (MIT) Press. Hans A. W. Kamp and Uwe Reyle. 2013. From Dis- course to Logic: Introduction to Modeltheoretic Se- Peter Gardenfors.¨ 2014. Geometry of meaning: Se- mantics of Natural Language, Formal Logic and mantics based on conceptual spaces. Massachusetts Discourse Representation Theory, volume 42 of Institute of Technology (MIT) Press. Studies in Linguistics and Philosophy. Springer. Makoto Kanazawa. 1994. Weak vs. strong readings Peter Thomas Geach. 1962. Reference and generality: of donkey sentences and monotonicity inference in An examination of some medieval and modern theo- a dynamic setting. Linguistics and Philosophy, ries. Cornell University Press. 17(2):109–158.

Susan A. Gelman, Sarah-Jane Leslie, Alexandra M. Rosanna Keefe. 2000. Theories of Vagueness. Cam- Was, and Christina M. Koch. 2015. Children’s in- bridge Studies in Philosophy. Cambridge University terpretations of general quantifiers, specific quanti- Press. fiers and generics. Language, Cognition and Neuro- science, 30(4):448–461. Sangeet Khemlani, Sarah-Jane Leslie, Sam Glucks- berg, and Paula Rubio Fernandez. 2007. Do ducks Noah D. Goodman and Michael C. Frank. 2016. Prag- lay eggs? how people interpret generic assertions. matic language interpretation as probabilistic infer- In Proceedings of the Annual Meeting of the Cogni- ence. Trends in cognitive sciences, 20(11):818–829. tive Science Society, volume 29, pages 395–400.

Noah D. Goodman and Daniel Lassiter. 2015. Prob- Jeffrey C. King and Karen S. Lewis. 2016. . abilistic semantics and pragmatics: Uncertainty in In Edward N. Zalta, editor, The Stanford Encyclo- language and thought. In Shalom Lappin and Chris pedia of Philosophy, fall 2016 edition. Metaphysics Fox, editors, The Handbook of Contemporary Se- Research Lab, Stanford University. mantic Theory, 2nd edition, chapter 21, pages 655– 686. Wiley. Daniel Lassiter. 2011. Vagueness as probabilistic lin- guistic knowledge. In Rick Nouwen, Robert van Edward Grefenstette. 2013. Towards a formal distri- Rooij, Uli Sauerland, and Hans-Christian Schmitz, butional semantics: Simulating logical calculi with editors, Vagueness in Communication: Revised Se- tensors. In Proceedings of the 2nd Joint Conference lected Papers from the 2009 International Workshop on Lexical and Computational Semantics (*SEM), on Vagueness in Communication, chapter 8, pages pages 1–10. Association for Computational Linguis- 127–150. Springer. tics. Sarah-Jane Leslie. 2008. Generics: Cognition and ac- The Philosophical Review Aurelie´ Herbelot. 2010. Underspecified quantification. quisition. , 117(1):1–47. Ph.D. thesis, University of Cambridge. Mike Lewis and Mark Steedman. 2013. Combined distributional and logical semantics. Transactions Aurelie´ Herbelot and Ann Copestake. 2011. Formalis- of the Association for Computational Linguistics ing and specifying underquantification. In Proceed- (TACL), 1:179–192. ings of the 9th International Conference on Compu- tational Semantics (IWCS), pages 165–174. Associ- Godehard Link. 1983. The logical analysis of plurals ation for Computational Linguistics. and mass terms: A lattice-theoretical approach. In Rainer Bauerle,¨ Christoph Schwarze, and Arnim von Aurelie´ Herbelot and Eva Maria Vecchi. 2015. Build- Stechow, editors, Meaning, Use and the Interpreta- ing a shared world: mapping distributional to model- tion of Language, chapter 18, pages 303–323. Walter theoretic semantic spaces. In Proceedings of the de Gruyter. Reprinted in: Paul Portner and Barbara 20th Conference on Empirical Methods in Natural H. Partee, editors (2002), Formal semantics: The es- Language Processing (EMNLP), pages 22–32. As- sential readings, chapter 4, pages 127–146. sociation for Computational Linguistics. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Michelle A. Hollander, Susan A. Gelman, and Jon Dean. 2013. Efficient estimation of word represen- Star. 2002. Children’s interpretation of generic noun tations in vector space. In Proceedings of the 1st phrases. Developmental Psychology, 38(6):883– International Conference on Learning Representa- 394. tions (ICLR), Workshop Track. Richard Montague. 1973. The proper treatment of quantification in ordinary English. In K. Jaakko J. Hintikka, Julius M. E. Moravcsik, and Patrick Sup- pes, editors, Approaches to Natural Language, num- ber 49 in Synthese Library, chapter 10, pages 221– 242. Kluwer Academic Publishers. Reprinted in: Paul Portner and Barbara H. Partee, editors (2002), Formal semantics: The essential readings, chapter 1, pages 17–34. Terence Parsons. 1990. Events in the Semantics of English: A Study in Subatomic Semantics. Cur- rent Studies in Linguistics. Massachusetts Institute of Technology (MIT) Press. Barbara H. Partee. 1988. Many quantifiers. In Pro- ceedings of the Eastern States Conference on Lin- guistics (ESCOL), pages 383–402. Ohio State Uni- versity. Reprinted in: Partee (2004), Compositional- ity in Formal Semantics, pages 241–258, Blackwell.

Barbara H. Partee. 2012. The starring role of quanti- fiers in the history of formal semantics. In The Log- ica Yearbook 2012, pages 113–136. College Publi- cations.

Laura Rimell, Jean Maillard, Tamara Polajnar, and Stephen Clark. 2016. RELPRON: A relative clause evaluation dataset for compositional distributional semantics. Computational Linguistics, 42(4):661– 701.

Peter R. Sutton. 2015. Towards a probabilistic seman- tics for vague adjectives. In Henk Zeevat and Hans- Christian Schmitz, editors, Bayesian Natural Lan- guage Semantics and Pragmatics, chapter 10, pages 221–246. Springer.

Peter R. Sutton. 2017. Probabilistic approaches to vagueness and semantic competency. Erkenntnis.

Michael Henry Tessler. 2018. Communicating gen- eralizations: Probability, vagueness, and context. Ph.D. thesis, Stanford University. Michael Henry Tessler and Noah D. Goodman. 2019. The language of generalization. Psychological Re- view, 126(3):395–436. Peter D. Turney and Patrick Pantel. 2010. From fre- quency to meaning: Vector space models of se- mantics. Journal of Artificial Intelligence Research, 37:141–188. Johan Van Benthem. 1984. Questions about quanti- fiers. The Journal of Symbolic Logic, 49(2):443– 466. W 1 ARG ARG 2 ARG1 ARG2 X Y Z

Tα,X Tβ,Y Tγ,Z Tδ,W

QE1 QE2

TRC

TDP

Q∃ QGEN

Q∀

A Classical Donkey Sentences QGEN). This would be surprising in a , but is not a problem here – marginalising out In this analysis of a classical donkey sentence, the a random variable means that the quantifier node donkey is associated with a generic quan- is not dependent on that variable, but the random tifier, while all other quantifiers are precise. The variable is still part of the joint distribution, so it generic quantifier allows the range of readings as- can be referred to by other nodes. Because of this sociated with donkey sentences. double quantification, the scope “tree” is actually The above figure corresponds to example (3), if a scope DAG (directed acyclic graph). α β γ δ farmer own donkey feed , , , are set to , , , . QE1 and QE2 marginalise out the event vari- Intuitively, this analysis says that, if all farmers ables, respectively Y and W , with trivially true re- who own at least one donkey feed at least a pro- strictions and bodies Tβ,Y and Tδ,W , leaving free portion p of their donkeys, then this sentence is variables X and Z. They can be treated like some p true with probability . in Fig. 6. For given pixies x and z, QE1 is true if The probability of truth gradually increases x owns z; QE2 is true if x feeds z. with the proportion p. Given world knowledge Q∃ marginalises out Z, with TRC as restriction and a discourse context, this can lead to a sharp and QE1 as body, leaving the free variable X. It threshold proportion above which it is uttered, us- can be treated like some in Fig. 6. For a given x, it ing pragmatic inference in the RSA framework. If is true if x is a farmer and there is a donkey z such distinguishing small proportions is pragmatically that QE1 is true. relevant, the weak reading becomes preferred. If QGEN also marginalises out Z, with TDP as re- distinguishing large proportions is pragmatically striction and QE2 as body, leaving the free vari- relevant, the strong reading becomes preferred. able X. It uses the generic quantifier, as in (11). I will now go over all nodes in the graph. Firstly, For a given x, it considers donkeys z for which the distributions for Tα,X , Tβ,Y , Tγ,Z , Tδ,W are QRC is true; the probability of truth is the propor- determined by the predicates. tion of such z for which QE2 is true (out of don- The remaining truth value nodes are labelled for keys owned by farmer x, the proportion fed by x). convenience. TRC and TDP are logical conjunc- Finally, Q∀ marginalises out X, with T∃ as re- tions (for the relative clause and donkey pronoun, striction and QGEN as body, leaving no free vari- respectively). The remaining five nodes are quan- ables. It is treated as in Fig. 6. It is true if, when- tifier nodes, each quantifying one variable. ever T∃ is true of x, QGEN is true of x, considering Note that Z is quantified twice (by Q∃ and QGEN as a distribution over precise functions.