<<

Generating and Resolving Vague References

Timothy Meo1 and Brian McMahan2 and Matthew Stone2,3 1Linguistics, 2Computer Science, 3Center for Cognitive Science Rutgers University 1New Brunswick, NJ 08901-1184, 2Piscataway, NJ 08854-8019 [email protected]

Abstract us to use empirical methods to induce semantic We describe a method for distinguish- representations on a wide scale from multimodal ing in context using English color corpus data (Roy and Reiter, 2005; Steels and Bel- terms. Our approach uses linguistic the- paeme, 2005; McMahan and Stone, 2014). ories of vagueness to build a cognitive We present our ideas through a case study of model via Bayesian rational analysis. In the color vocabulary of English. In particular, particular, we formalize the likelihood that we study the problem solving involved in us- a speaker would use a to de- ing color descriptors creatively to distinguish one scribe one color but not another as a func- color swatch from another, similar color. In our tion of the background frequency of the model, these descriptions inevitably refine the in- color term, along with the likelihood of terpretation of language in context. We assume selecting standards in context that fit one that speakers make choices to fulfill their commu- color and not the other. Our approach ex- nicative goals while reproducing common patterns hibits the qualitative flexibility of human of description. Using corpus data, we are able color judgments and reaches ceiling per- to quantify how representative of typical English formance on a small evaluation corpus. speakers’ behavior a particular context-dependent semantic interpretation is. 1 Introduction Our model naturally exhibits many of the pref- A range of research across cognitive science, sum- erences of previous work on vague descriptions. marized in Section 2, suggests that people negoti- For example, the system avoids placing thresh- ate meanings interactively to draw useful distinc- olds in small gaps (van Deemter, 2006), that is, tions in context. This ability depends on using in regions of conceptual space that account for lit- creatively, interpreting them flexibly, and tle of the probability mass of possible interpreta- tracking the effects of utterances on the evolving tions. In such circumstances, the system prefers context of the conversation. We adopt a computa- more specific vocabulary, where interlocutors are tional approach to these fundamental skills. Our more likely to draw fine distinctions (Baumgaert- goal is to quantify them, scale them up, and eval- ner et al., 2012). Our approach realizes these ef- uate their possible contribution to coordination of fects by simple and uniform decision making that meaning in practical dialogue systems. extends to multidimensional spaces and arbitrary Our work extends three traditions in computa- collections of vocabulary. tional linguistics. Our approach to semantic repre- We begin the paper by describing the semantic sentation builds on previous research that empha- representation of vagueness in dialogue. Vague- sizes the context dependence and interactive dy- ness, we assume, is uncertainty about where to namics of meaning (Barker, 2002; Larsson, 2013; set the threshold in context for the concept evoked Ludlow, 2014). Our approach to pragmatic rea- by a term. Speakers have the option to triangu- soning builds on work on referring expressions late more precise thresholds by interactive strate- and its characterization of the problem solving in- gies such as accommodation, and this helps ex- volved in using vague language to identify entities plain how vague descriptions can be used to refer uniquely in context (Kyburg and Morreau, 2000; to objects precisely (van Deemter, 2006). van Deemter, 2006). Finally, we take a perceptu- In Section 3, we describe our model of speak- ally grounded approach to meaning, which allows ers’ decisions in conversation. We focus on speak- ers that aim to distinguish one thing from another; in these cases, we assume speakers aim to choose a term that’s interpreted so that it fits the target and excludes the distractor, while matching broader patterns of language use. We show how to combine the ideas in Section 4. We formalize the likelihood that a speaker would use a color term to describe one color but not an- other as a function of the likelihood of selecting standards to justify its application in this context, along with the background frequency of the color term. We describe an implementation of the for- malism and report its the qualitative and quan- Figure 1: A dog and a tan one—or a tan titative behavior in Section 5. It works with a dog and a one (Young et al., 2014). generic lexicon of more than 800 color terms and reaches ceiling performance in interpreting user self (Ludlow, 2014). And we won’t consider the color descriptions in the data set of Baumgaertner way multiple descriptions constrain one another, et al. (2012). While substantial additional research as in Figure 1, although we expect to explain it is required to explore the dynamics of vagueness as a side-effect of holistic interpretive processing in conversation, our results already suggest new (Stone and Webber, 1998). We see our work as a ways to apply generic models of the use of vague prerequisite for the model building and data col- language in support of sophisticated, open-ended lection required to address such issues. construction of meaning in situated dialogue. In our view, the users of Young et al. (2014) 2 The Linguistics of Vagueness are using tan to name color categories. Colors are visual sensations that vary continuously across a Figure 1 shows an image from a public data set space of possibilities. Color categories are clas- developed to study how people label images with sifiers that group regions in together captions (Young et al., 2014). One user chose to (Gardenfors,¨ 2000; Larsson, 2013). Color terms distinguish the dogs by calling one brown and the in English also have another sense, not at issue in other tan. Another distinguished the dogs by call- this paper, where they refer to an underlying prop- ing one tan and the other white. Each used the tan erty that correlates with color, as in pen (writes dog to refer to a different dog—yet the way each in red ink) (Kennedy and McNally, 2010). described the other dog left no doubt about the Empirically, color categories seem to be con- correct interpretation. This variability and context vex regions (Gardenfors,¨ 2000; Jager,¨ 2010)—in dependence is characteristic of vagueness in lan- fact, we model them as rectangular box-shaped re- guage. The dogs in Figure 1 are borderline cases; gions in –saturation–value (HSV) space. Thus, there’s no clear answer about whether they are tan color categories involve boundaries, thresholds or or not, and speakers are free to talk of either, both, standards that delimit the regions in color space or neither of them as tan, depending on their pur- where they apply; context sensitivity can be mod- poses in the conversation. eled as variability in the location of these bound- In this paper, we explore the descriptive vari- aries (Kennedy, 2007). For example, when we cat- ability seen in Figure 1. How is it that speak- egorize the lighter dog of Figure 1 as being distinc- ers can settle borderline cases in useful ways to tive in its color, we must have a color category that move a dialogue forward, and how can hearers rec- fits this dog but not the darker one. This category ognize those decisions? We won’t consider the will group together colors with a suitable interval interactive strategies that interlocutors can use to of , suitable low levels of saturation, confirm, negotiate or contest potentially problem- and suitably high values on the white– con- atic descriptions, although that’s obviously crucial tinuum. We can think of this category as one pos- for successful reference (Clark and Wilkes-Gibbs, sible interpretation for the tan. By contrast, 1986), for coordinated meaning (Steels and Bel- categorizing the darker dog of Figure 1 as distinc- paeme, 2005), and perhaps even for meaning it- tively tan involves choosing a category with dif- ferent thresholds for hue, saturation and value— dog. The next section explains how to formalize thresholds that fit the color of the darker dog but the reasoning involved in assessing these proba- exclude that of the lighter one. bilities, reviews one instantiation of this reasoning When interlocutors use vague terms in conver- for learning semantics, and develops another in- sation, they constrain the way others can use those stantiation for distinguishing colors in context. terms in the future (Lewis, 1979; Kyburg and Mor- reau, 2000; Barker, 2002). For example, if we hear 3 Rational Analysis of Descriptions one or the other dog of Figure 1 described as tan, Speakers can use language for a variety of pur- it constrains how we will interpret subsequent uses poses. Their decisions of what to say thus depend of the word tan. Concretely, we might update the on knowledge of language, their communicative perceptual classifier we associate with tan in this situation, and their communicative goals. Follow- context so that it fits the target dog and excludes its ing Anderson (1991), rational analysis invites us alternative (Larsson, 2013). We see this as a case to explain an agent’s action as a good way to ad- of accommodation, in the sense of (Lewis, 1979). vance the agent’s goals given the agent’s informa- As speakers, we often count on our interlocu- tion. When applied to communication, this ap- tors to accommodate us (Thomason, 1990). We proach allows us not only to derive utterances for can use vague terms confidently as long as the dis- systems but also to infer linguistic representations tinction we aim to draw with them is clear in con- from utterances when we know the agent’s com- text and as long as our choice is sufficiently in line municative situation and communicative goals. with the normal variation in the use of the word, We apply this methodology to color descrip- and therefore uncontroversial (Thomason, 1990; tions in McMahan and Stone (2014). We infer van Deemter, 2006). Such criteria seem to sup- linguistic representations from Randall Munroe’s port the speaker’s choice in Figure 1 to describe color corpus1 by assuming that subjects’ goals either dog as tan—provided the speaker provides were to say true things and match a target distri- a complementary description of the other dog. At bution of utterances. These results are available as the same time, if we use language in very un- our Lexicon of Uncertain Color Standards (LUX). usual ways, we can expect that our interlocutor We describe this experiment in Section 3.1. We may have difficulty understanding and may be re- continue in Section 3.2 by creating a new model luctant to accommodate us. In other words, to use of the task of creating a distinguishing description. vague language effectively, speakers must be sen- Here the goal is to describe one color, exclude an- sitive to whether their utterances update the dia- other, and match a target distribution. logue context in a natural way. A common idea in linguistics and philosophy 3.1 Lexicon of Uncertain Color Standards is that knowledge of language associates terms Munroe’s corpus was gathered by presenting sub- with a probability distribution over categories. jects with a color patch and allowing them to This distribution characterizes speakers’ informa- freely describe it. It’s not interactive language use, tion about the likelihood of different possible in- but we use it just to model knowledge of meaning. terpretations for the term that could make sense Like all crowdsourced data, Munroe’s methodol- in context (Williamson, 1996; Barker, 2002; Las- ogy sacrifices control over presentation of stimuli siter, 2009). In other words, vagueness amounts and curation of subjects’ responses for sheer scale to uncertainty about where to draw boundaries to of data collection. We work with a subset of data settle borderline cases. involving 829 color terms elicited over 2.18M tri- Thus, when we need to settle borderline cases to als. Each description is paired with the multiset of generate or understand utterances like the tan dog, color values on which subjects used it. We model knowledge of meaning lets us quantify how likely the data in HSV space, because color categories the different resolutions are. In Figure 1, for exam- generally differ in the Hue dimension. ple, knowledge of language says that tan can be in- LUX links color descriptions with context- terpreted, with a suitable probability, through cate- sensitive regions in HSV color space. An exam- gories that pick out just the lighter dog, but that tan ple is shown in Figure 2 for the Hue dimension. can also be interpreted, with a suitable probabil- ity, through categories that pick out just the darker 1blog.xkcd.com/2010/05/03/color-survey-results/ The plot shows a scaled histogram of subjects’ re- Label sponses. There is a region on the Hue dimension Parameters which subjects frequently described as yellowish Possible Observed with borderline cases on either side of it. Color Value Label Label true said To capture the patterns of human responses, the (X) (k ) (k ) rational analysis approach directly models the un- K Labels certainty described in Section 2. For each color Figure 3: A Bayes Rational Observer sees a color term, speakers have possible standards which can patch. The subjective likelihood P(ktrue(c)|c = x) be used to partition color space; they are unsure describes the likelihood that descriptor k is true of which are at work at any point. For example, the the current color c given that it is located HSV term yellowish green only fits those Hue values point x. The descriptor k is actually said propor- which are above a minimum threshold, τLower,H (or tional to this subjective likelihood and a weight τL,H for short), and below a maximum threshold, representing how often a label is said when it τUpper,H (or τU,H for short). We estimate the dis- is true: P(ksaid|ktrue(c)). In Munroe’s data, the tribution of possible thresholds; they are shown as shaded nodes are observed. the solid black lines in Figure 2. In choosing to use the color description to fit description k, also called availability and written a point x in HSV space, speakers make a seman- as α(k), is a background measure of how often tic judgment which constrains the possible stan- the term is used when it is true. Thus, to pick a dards. The naturalness of this judgment is mea- term that fits a color swatch and use language in sured in part by the probability mass of possible a natural way, subjects can pick a color term ac- standards which allow the description to be used. cording to the product of availability and subjec- For example, the applicability of yellowish green tive likelihood. Figure 3 summarizes this process is the probability of the color value x being be- in a graphical model. tween the minimum and maximum thresholds in In Equation 3 , we introduce a simpler nota- k each dimension. For a color description , this tion for Equation 2 that we build on in what fol- is mathematically defined fully in Equation 1 and L,d d U,d d d lows. We abbreviate P(τk < x < τk ) as φk (x ) more compactly in Equation 2. d d and show how φk (x ) can be defined by cases as a Lower,H H Upper,H function of how xd is situated with respect to the P(τk < x < τk )× lower limit µL,d and upper limit µU,d of the thresh- P Lower,S xS Upper,S × (τk < < τk ) old distributions: P(τLower,V < xV < τUpper,V) (1) k k  L,d L,d P(xd > τ ), xd ≤ µ = P(τL,d < xd < τU,d) (2)  k k ∏ k k d d d U,d d U,d d∈{H,S,V} φk (x ) = P(x < τk ), x ≥ µk (3)  1, otherwise The other factor in subjects’ choices is the saliency of the color term. The saliency of color LUX was learned from Munroe’s data by fit- ting the parameters of the φ function for each de- 1.0 Lower,Hue τY ellowishGreen scription on each dimension independently to the 0.8 Upper,Hue τY ellowishGreen frequency histogram. For example, the parame- Hue 0.6 φY ellowishGreen ters for the φ function for yellowish green in Fig- Yellowish Green data 0.4 ure 2 were fit by maximizing the probability that Probability the bins in the data histogram were sampled from 0.2 the φ curve with standard Gaussian noise. 0.0 µl µu

Hue 3.2 Distinguishing Descriptions Figure 2: The LUX model for “yellowish green” Munroe’s elicitation task is simple; in other set- on the Hue axis plotted against a scaled histogram tings, people have more complex communicative of responses. The φ curve, the likelihood of a color goals, such as unique reference. These goals mod- counting as “yellowish green”, is derived from the ulate the link between internal semantic represen- τ curves representing possible boundaries. tation and observed speaker choice. In Munroe’s task, we assume, the speaker sampled from possi- We will call this number the Y–but–not–Z con- ble descriptive terms based on terms’ availability fidence rating. This is the probability that the and how likely terms were to fit the target color thresholds in context are chosen in such a way that value. We now consider how this changes when color term k fits color Y but does not fit distractor speakers aim to differentiate between two objects. Z. (That’s P(ktrue(c)|c = Y) × P(¬ktrue(c)|c = Z) The literature offers a key insight to get us in the notation of Figure 3.) To generate a term started: referential expressions are marked as in context, we might consider each possible color such, and the scalar structure of vague meanings label, calculate its Y–but–not–Z confidence, and gives strong constraints on how vague terms can finally pick a term proportional to its confidence. be interpreted. For example, the fat pig can only We motivate our mathematical model by con- refer to the fatter of two pigs in the context, a cal- sidering a single perceptual dimension, most eas- culation that is easy to add to algorithms for refer- ily visualized as Hue. In this case, the Y–but–not– ring expression generation (Kyburg and Morreau, Z confidence is equal to the probability that the 2000; van Deemter, 2006). However, things be- upper and lower thresholds of that term can be set come substantially more complicated in the case such that Y falls inside them, and Z falls outside of color, because color is multidimensional and of them. Thus each confidence rating will involve color categories can be approximated in compet- the multiplication of two values: the probabilities ing ways, as with tan in Figure 1. associated with the upper and lower boundaries. We approach the problem probabilistically. To In Figure 4, Y and Z are borderline Hue val- generate likely unique references, the speaker ues; both are greener than the typical yellowish must sample from possible descriptive terms pro- green. In this case, there’s no constraint on the portional to terms’ availability, how likely terms lower threshold; the lower threshold fits the de- are to fit the target, and how likely terms are to ex- scription with probability 1. On the other hand, clude a distractor. This involves integrating over only the upper shaded region of Figure 4 supports all possible thresholds, to measure the probability a categorization of Y but not Z as yellowish green. that a description should be interpreted to include This area is equal to φ(Y)−φ(Z). This is the prob- one color and exclude another. In the ordinary ability that the Hue boundaries for this color term case where two colors are far enough apart, most will include Y and exclude Z. Symmetrical rea- thresholds work, and the approach defaults to the soning applies in the mirror-image case when the kinds of natural descriptions seen in descriptions colors are borderline yellow. of colors on their own. However, when the col- Another case is shown in Figure 5, in which Y ors become increasingly close, general color de- and Z fall on opposite sides: Y is borderline green, scriptions (such as green) no longer are likely to while Z is borderline yellow. In these contrast- signal the distinction we need, while more spe- ing borderline cases, it’s up to the speaker whether cific color descriptions are (such as green and to count Y in and Z out or vice versa, as in Fig- pale green). This qualitative behavior is an impor- ure 1. The choices can be good or bad, however, tant part of vague language, as observed by Baum- gaertner et al. (2012). (They also suggest that ac- curate models of color vagueness would be neces- sary for good performance in difficult cases.) The same model can inform the resolution of vague descriptions as well as generation. Resolv- ing reference requires reasoning about how well each description applies to each of the candidate referents. We explore this reasoning for genera- tion and understanding in the next section.

4 Algorithm and Implementation

The heart of our method is a measure of the con- Figure 4: The thresholds that separate two nearby fidence with which we can use a color term to de- borderline cases cover probability φ(Y) − φ(Z), scribe a color Y and to exclude a second color Z. here 0.74 − 0.05 = 0.69. Figure 7: In the multidimensional case, solutions Figure 5: The thresholds that separate two con- respect constraints from Y that are independent trasting borderline cases cover probability φ(Y) ∗ of Z, with probability φ1(Y); they also select ap- (1 − φ(Z)), here 0.74 ∗ (1 − 0.38) = 0.46. propriate standards that affect both Y and Z, with probability φ2(Y) − φ3(Z). because they constrain the context. The proba- bility that the upper threshold includes Y is φ(Y). dimensions as shown in Equation 4. The shaded area above Z represents the probabil- φ (Y) ∗ (φ (Y) − φ (Z)) (4) ity that the lower threshold is placed such that Z 1 2 3 is excluded; its area is equal to 1 − φ(Z). Thus, In this equation, we generalize our notation to the the Y–but–not–Z confidence rating for this case is general case as follows: φ(Y) ∗ (1 − φ(Z)). Again, there is a symmetrical d case with the colors reversed. • φ1(Y) is ∏φ(Y ) over dimensions d where Y Z Finally, if Y is not a borderline case, as in Fig- and are contrasting borderline cases ure 6 then Y does not constrain the thresholds d • φ2(Y) is ∏φ(Y ) over all other dimensions at all. Thus, the Y–but–not–Z confidence rating d for this case is (1 − φ(Z)). All three cases can • φ3(Z) is ∏φ(Z ) over all dimensions d be generalized to a common form, however. Let This expression is what we use in our implemen- φ (Y) be φ(Y) if Y is a borderline case opposite 1 tation to calculate each color term’s Y–but–not–Z Z, 1 otherwise. And let φ (Y) be φ(Y) if Y is 2 confidence rating. a borderline case next to Z, 1 otherwise. Then Given a confidence score, the evaluation is bal- all the formulas we have exhibited fit the scheme anced by the availability of the color description. φ1(Y) ∗ (φ2(Y) − φ(Z)). With this insight, we can extend our comparison to the three-dimensional case. The case is shown Algorithm 1 The scoring function to compare two in Figure 7 for a color description k. HSV tuples Y and Z for a single color term k function SCORE(k,Y,Z) To calculate this probability mass we generalize TermA ← 1 Y–but–not–Z calculation to a case analysis in three TermB ← 1 H H S S V V TermC ← φk (Z ) × φk(Z ) × φk (Z ) for each dimension d in (H,S,V) do if Y d is on opposite side from Zd then d d TermA ← TermA × φk (Y ) else d d TermB ← TermB × φk (Y ) end if end for score ← TermA × (TermB − TermC) score ← score × α(k) Figure 6: If Y is a clear case, we simply exclude return score Z, for probability 1 − φ(Z), here 1 − 0.38 = .62. end function For example, a common color term like green has Because the colors in Figure 8 are so close, con- a high availability, whereas a less frequent term, text has a strong effect in selecting differentiating British racing green, has a much lower one. By descriptions. As the two colors get further apart, weighting a term’s score by its availability, we en- there’s less probability mass assigned to interpre- sure that the program is less likely to generate rare tations that categorize them the same way. Un- color labels unless they clearly target a difficult der these circumstances, the differentiating color distinction that the program needs to make. terms converge to the color terms predicted by With this score function complete, we arrive at the generic model. This recalls the heuristic of the basic outline of our algorithm. The algorithm Baumgaertner et al. (2012) that is shown in Algorithm 1. The distinguish function are used unless needed to distinguish. In other cycles through the dictionary, calculates the Y– words, our model produces marked descriptions but–not–Z confidence for each term k, and returns only when coarser terms are less reliable in dis- the results in sorted order. In the cases in which k tinguishing the two colors, so they are necessary describes Z better than it describes Y, the function to achieve the communicative goal of distinguish- will evaluate to a negative number. Such cases are ing the two colors. This recalls the “small gaps” rejected—given our model, the terms cannot de- constraint of van Deemter (2006). scribe Y without also describing Z. As a first step towards quantifying the perfor- 5 Results mance of the model, we got the data collected by Baumgaertner et al. (2012). They showed subjects We have created an interactive visualization that color swatches in arrays of four, and asked sub- allows viewers to confirm the qualitative proper- jects either to identify a particular target swatch ties of our model for themselves. Figure 8 shows in words (as director) or to pick the swatch that a screenshot of the visualization. best fit a verbal description (as matcher). At is- Users click on either of the two color swatches sue was the ability of human matchers or various on the left to select colors, which are passed to the algorithms to find the original target swatch (the program as two HSV triplets. The middle column correct swatch) given directors’ descriptions. Peo- then displays a list of color terms associated with ple’s success in these tasks depends on how diffi- those swatches; this is context-independent data cult it is to distinguish the alternatives. Because pulled directly from LUX. Terms are displayed in problems are so variable and task dependent, there two colors: terms that are generally good descrip- can be no universal benchmark of performance tions of the target color but are bad at distinguish- in identifying colors, but the results are helpful ing it from its alternative are grayed out. For ex- in understanding what we have accomplished and ample, green is grayed out at the top in Fig- where further research is necessary. ure 8, because it’s such a good description of the Baumgaertner et al. (2012) report an analysis of lower swatch. The column on the right then dis- 29 judgments about the interpretation of color de- plays the results of the generation model for the scriptions in context across a range of difficulty two colors. Typically, no term appears in both levels. Their baseline algorithm, which interprets lists—as is true in Figure 8—because it’s rare to colors based on the nearest focal value in RGB find cases like Figure 1 where there are two plau- space, links 23 of them to the swatch the direc- sible, competing ways to refine the meaning of a tor was instructed to describe. Of the remainder, color term so as to fit one color but not the other.2 three represent clear problems with their system. Results are ranked by normalized confidence val- Our system, by contrast, gets all these 26 correct. ues; colors move up in the rankings when they The remaining three cases raise the same problems more precisely distinguish the target color from its for both approaches. There seems to be one case of alternative. For example, pale green and yellow- human error: the director is signaled to describe a green overtake the more general as brown swatch but produces blueberry, apparently descriptions of the lower color in Figure 8. describing the adjacent purplish- swatch. And 2Our model does recognize a surprising difference be- two are cases of sparse data: the items deep tween lime and lime green in Figure 8. This isn’t a fluke: blue and dull fall out of the frequent the same difference shows up in CSS color definitions for ex- ample. We suspect that lime green evokes the peel of the fruit vocabulary of Munroe’s data set. The two out-of- but lime is named for the juice. vocabulary cases arise in the most difficult setting, Figure 8: A screenshot of our interactive visualization, contrasting two . The system’s descriptions emphasize the greater saturation and greener hue of the top color, and the lower saturation and yellower hue of the bottom color. where directors must use low frequency terms to also have difficulty. We were unable to find pat- describe closely related colors; we get 71% right terns of systematic error in our system. while human matchers recover the swatch signaled to the human director only 78% of the time.3 Thus, 6 Conclusion we conclude that we need larger and more targeted We have explored a problem solving approach to data sets to distinguish the performance of our new the use of vague language. We have presented the algorithm from that of people. theoretical rationale for our approach, described a Baumgaertner et al. (2012) 29 key examples broad-scale implementation, and offered a prelim- are drawn from a larger elicitation experiment inary empirical evaluation. that produced 196 different tokens, again across a Our work is pervasively informed by previous range of conditions. Our system resolves 152 cor- work on the semantics and pragmatics of dialogue. rectly as written. Another 28 are out of vocabulary But we have not deployed or evaluated our work but closely related to terms the system would re- with interactive language use. That’s an obvious solve correctly (differing in spelling, comparative and important next step. or superlative , hedges, paraphrases or We’re excited by opportunities our work brings other lightweight modifiers). The system gets 8 to assess the role of linguistic knowledge and ra- wrong as written (again, this seems to include sev- tional problem solving in conversation. If success- eral cases of human error); 6 are out of vocabulary ful, these efforts will lead to better interactive sys- and closely related to terms that the system would tems. But even if not, we think they will help to get wrong; and 2 are completely different from characterize speakers’ interactive strategies, and any of our vocabulary items. All the system er- thus to pinpoint the distinctive mechanisms that rors are on low frequency items in situations with support meaning making in dialogue. close distractor colors, where we’ve seen people Acknowledgments 3Interestingly, our system correctly resolves the alterna- tive items dark grey blue and salmon pink in these cases. If This work was supported in part by NSF DGE- we can deal with the productivity of low frequency descrip- tions, we see no obstacle to matching or even exceeding hu- 0549115. Thanks to the SEMDIAL reviewers for man performance. helpful comments. References [Ludlow2014] Peter Ludlow. 2014. Living Words: Meaning Underdetermination and the Dynamic Lex- [Anderson1991] John R. Anderson. 1991. The adap- icon. Oxford University Press, Oxford. tive nature of human categorization. Psychological Review, 98(3):409. [McMahan and Stone2014] Brian McMahan and Matthew Stone. 2014. A Bayesian approach to [Barker2002] Chris Barker. 2002. The dynamics of grounded color semantics. Manuscript, Rutgers vagueness. Linguistics and Philosophy, 25(1):1–36. University. [Baumgaertner et al.2012] Bert Baumgaertner, Raquel [Roy and Reiter2005] Deb Roy and Ehud Reiter. 2005. Fernandez,´ and Matthew Stone. 2012. Towards a Connecting language to the world. Artif. Intell., flexible semantics: colour terms in collaborative ref- 167(1-2):1–12. erence tasks. In Proceedings of the First Joint Con- ference on Lexical and Computational Semantics- [Steels and Belpaeme2005] Luc Steels and Tony Bel- Volume 1: Proceedings of the main conference and paeme. 2005. Coordinating perceptually grounded the shared task, and Volume 2: Proceedings of the categories through language. A case study for Sixth International Workshop on Semantic Evalua- colour. Behavioral and Brain Sciences, 28(4):469– tion, pages 80–84. Association for Computational 529. Linguistics. [Stone and Webber1998] Matthew Stone and Bonnie [Clark and Wilkes-Gibbs1986] H. H. Clark and Webber. 1998. Textual economy through close cou- D. Wilkes-Gibbs. 1986. Referring as a collabora- pling of syntax and semantics. In Proceedings of tive process. Cognition, 22:1–39. International Natural Language Generation Work- shop, pages 178–187. [Gardenfors2000]¨ Peter Gardenfors.¨ 2000. Conceptual Spaces. MIT. [Thomason1990] Richmond H. Thomason. 1990. Accommodation, meaning and implicature. In [Jager2010]¨ Gerhard Jager.¨ 2010. Natural color cate- Philip R. Cohen, Jerry Morgan, and Martha E. Pol- gories are convex sets. In Maria Aloni, Harald Bas- lack, editors, Intentions in Communication, pages tiaanse, Tikitu de Jager, and Katrin Schulz, editors, 325–363. MIT Press, Cambridge, MA. Logic, Language and Meaning - 17th Amsterdam Colloquium, Amsterdam, The Netherlands, Decem- [van Deemter2006] Kees van Deemter. 2006. Generat- ber 16-18, 2009, Revised Selected Papers, volume ing referring expressions that involve gradable prop- 6042 of Lecture Notes in Computer Science, pages erties. Computational Linguistics, 32(2):195–222. 11–20. Springer. [Williamson1996] Timothy Williamson. 1996. Vague- [Kennedy and McNally2010] Chris Kennedy and ness. Routledge, London. Louise McNally. 2010. Color, context and compositionality. Synthese, 174(1):79–98. [Young et al.2014] Peter Young, Alice Lai, Micah Ho- dosh, and Julia Hockenmaier. 2014. From image [Kennedy2007] Christopher Kennedy. 2007. Vague- descriptions to visual denotations: New similarity ness and grammar: the semantics of relative and ab- metrics for semantic inference over event descrip- solute gradable adjectives. Linguistics and Philoso- tions. Transactions of the Association for Computa- phy, 30(1):1–45. tional Linguistics, 2:67–78. [Kyburg and Morreau2000] Alice Kyburg and Michael Morreau. 2000. Fitting words: Vague words in con- text. Linguistics and Philosophy, 23(6):577–597.

[Larsson2013] Staffan Larsson. 2013. Formal seman- tics for perceptual classification. Journal of Logic and Computation.

[Lassiter2009] Daniel Lassiter. 2009. Vagueness as probabilistic linguistic knowledge. In Rick Nouwen, Robert van Rooij, Uli Sauerland, and Hans-Christian Schmitz, editors, Vagueness in Com- munication - International Workshop, ViC 2009, held as part of ESSLLI 2009, Bordeaux, France, July 20-24, 2009. Revised Selected Papers, volume 6517 of Lecture Notes in Computer Science, pages 127– 150. Springer.

[Lewis1979] David Lewis. 1979. Scorekeeping in a language game. Journal of Philosophical Logic, 8(3):339–359.