Monte Carlo Simulation Generation Through Operationalization of Spatial Primitives

A Dissertation

Presented to The Faculty of the Graduate School of Arts and Sciences Brandeis University Department of Computer Science Dr. James Pustejovsky, Advisor

In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy

by Nikhil Krishnaswamy August, 2017 The signed version of this form is on file in the Graduate School of Arts and Sciences.

This dissertation, directed and approved by Nikhil Krishnaswamy’s committee, has been accepted and approved by the Graduate Faculty of Brandeis University in partial fulfillment of the requirements for the degree of:

DOCTOR OF PHILOSOPHY

Eric Chasalow, Dean of Arts and Sciences

Dissertation Committee: Dr. James Pustejovsky, Chair Dr. Kenneth D. Forbus, Dept. of Electrical Eng. and Comp. Sci., Northwestern University Dr. Timothy J. Hickey, Dept. of Computer Science, Brandeis University Dr. Marc Verhagen, Dept. of Computer Science, Brandeis University c Copyright by

Nikhil Krishnaswamy

2017 For my father Acknowledgments

Like a wizard, a thesis never arrives late, or early, but precisely when it means to; but it would never arrive at all without the people who helped it along the way. First and foremost, I would like to thank Prof. James Pustejovsky for taking a chance on a crazy idea and tirelessly pursuing opportunities for the topic and for me in particular, for always taking time to discuss ideas and applications any time, any place (during the week, on the weekend, with beer, without beer, on three continents). And for, when I requested to stay on at Brandeis after completing my Master’s degree, writing a letter of recommendation on my behalf, to himself. That story usually kills. I would like to thank my committee members: Dr. Marc Verhagen, for hours of stimulating discussions; Prof. Tim Hickey, for letting me poach his animation students to help me develop the simulation software; and Prof. Ken Forbus, with whom it is an honor to have the opportunity to share my research. Additionally, thank you all for your perceptive, thorough, and insightful feedback in molding the draft copy of this thesis into its final form. To my friends and family, thank you; particularly to my wife, Heather, for letting this shady roommate of an idea move in with us; to my mother and stepfather, Uma Krishnaswami and Satish Shrikhande, for their unwavering faith and support—didn’t I say not to worry? I promised I could handle it. To the student workers who contributed many enthusiastic hours developing VoxSim, thank

v you; particularly to Jessica Huynh, Paul Kang, Subahu Rayamajhi, Amy Wu, Beverly Lum, and Victoria Tran. I’d probably have quit long ago without you. To the community of Unity developers, whose collective knowledge I spent hours reading online. Special thanks to Neil Mehta of LaunchPoint Games for prompt response and service helping me debug his plugin. To the faculty and staff of the Brandeis Master’s program in Computational Linguistics, partic- ularly to Prof. Lotus Goldberg for her consistent good wishes, cheer, advice, and ongoing interest in this thesis, and to all the students I’ve had the pleasure of getting to know along the way. Clearly there’s something special going on here because I’ve done everything I can to avoid leaving. Without Dr. Paul Cohen and the Communicating with Computers project at DARPA, this research would likely never have gotten off the ground. Working on the program I have enjoyed, and continue to enjoy, many fruitful collaborations that drove, and sometimes forced, development on VoxSim and have really put the software through a stress test. In this regard, I’d like to give particular thanks to Prof. Bruce Draper and his group at Colorado State University. Additional thanks also goes to Eric Burns at Lockheed Martin, for his advice and encourage- ment during the early part of my Ph.D., where I learned to spot opportunities as they arose, and seize them before time ran out. This work was supported by Contract W911NF-15-C-0238 with the U.S. Defense Advanced Research Projects Agency (DARPA) and the Army Research Office (ARO). Approved for Public Release, Distribution Unlimited. The views expressed herein are mine and do not reflect the official policy or position of the Department of Defense or the U.S. Government. All errors and mistakes are, of course, my own. I would like to dedicate this research and its culmination, this dissertation, to the memories of those family who passed on during my time in graduate school: my grandfather, Mr. V. Krishna Swami Iyengar; my grandmother, Mrs. Hema Krishnaswamy; my stepbrother, Kedar Shrikhande;

vi and my father, Dr. Sumant Krishnaswamy.

vii Abstract

Monte Carlo Simulation Generation Through Operationalization of Spatial Primitives A dissertation presented to the Faculty of the Graduate School of Arts and Sciences of Brandeis University, Waltham, Massachusetts by Nikhil Krishnaswamy

Much existing work in text to scene generation focuses on generating static scenes, which leaves aside entire word classes such as motion verbs. This thesis introduces a system for generating animated visualizations of motion events by integrating dynamic semantics into a formal model of events, resulting in a simulation of an event described in natural language. Visualization, herein defined as a dynamic three-dimensional simulation and rendering that satisfies the constraints of an associated minimal model, provides a framework for evaluating the properties of spatial predicates in real-time, but requires the specification of values and parameters that can be left underspecified in the model. Thus, there remains the matter of determining what, if any, the “best” values of those parameters are. This research explores a method of using a three-dimensional simulation and visualization interface to determine prototypical values for underspecified param- eters of motion predicates, built on a game engine-based platform that allows the development of semantically-grounded reasoning components in areas in the intersection of theoretical reasoning and AI.

viii Contents

Abstract viii

1 Introduction 1 1.1 Background ...... 3 1.2 Information-Theoretic Foundations ...... 4 1.3 Linguistic Underspecification in Motion Events ...... 5 1.4 Related Prior Work ...... 6

2 Framework 8 2.1 VoxML: Visual Object Concept Modeling Language ...... 12 2.2 VoxSim ...... 20 2.3 Spatial Reasoning ...... 23 2.4 Object Model ...... 28 2.5 Action Model ...... 29 2.6 Event Model ...... 29 2.7 Event Composition ...... 31 2.8 Specification Methods ...... 36

3 Methodology and Experimentation 38 3.1 Preprocessing ...... 40 3.2 Operationalization ...... 48 3.3 Monte Carlo Simulation ...... 52 3.4 Evaluation ...... 57

4 Results and Discussion 65 4.1 Human Evaluation Task 1 Results ...... 66 4.2 Human Evaluation Task 2 Results ...... 80 4.3 Automatic Evaluation Task Results ...... 92 4.4 Mechanical Turk Worker Response ...... 106 4.5 Summary ...... 108

ix CONTENTS

5 Future Directions 112 5.1 Extensions to Methdology ...... 114 5.2 VoxML and Robotics ...... 114 5.3 Information-Theoretic Implications ...... 115

A VoxML Structures 117 A.1 Objects ...... 117 A.2 Programs ...... 131 A.3 Relations ...... 136 A.4 Functions ...... 138

B Underspecifications 139

C [[TURN]]: Complete Operationalization 141

D Sentence Test Set 147

E Data Tables 166 E.1 DNN with Unweighted Features ...... 167 E.2 DNN with Weighted Features ...... 170 E.3 DNN with Weighted Discrete Features ...... 173 E.4 DNN with Feature Weights Only ...... 176 E.5 Combined Linear-DNN with Unweighted Features ...... 179 E.6 Combined Linear-DNN with Weighted Features ...... 182 E.7 Combined Linear-DNN with Weighted Discrete Features ...... 185 E.8 Combined Linear-DNN with Feature Weights Only ...... 188

F Publication History 191

x List of Tables

2.1 Example voxeme properties ...... 10 2.2 VoxML OBJECT attributes ...... 13 2.3 VoxML OBJECT HEAD types ...... 13 2.4 VoxML PROGRAM attributes ...... 16 2.5 VoxML PROGRAM HEAD types ...... 16 2.6 Example VoxML ATTRIBUTE scalar types ...... 18

3.1 Test set of verbal programs and objects ...... 39 3.2 Program test set with underspecified parameters ...... 48 3.3 Number of videos captured per motion predicate ...... 56

4.1 Acceptability judgments and statistical metrics for “move x” visualizations, condi- tioned on respecification predicate ...... 66 4.2 Acceptability judgments and statistical metrics for “turn x” visualizations, condi- tioned on respecification predicate ...... 67 4.3 Acceptability judgments and statistical metrics for unrespecified “turn x” visual- izations, conditioned on rotation angle ...... 68 4.4 Acceptability judgments and statistical metrics for “roll x” visualizations, condi- tioned on path length ...... 69 4.5 Acceptability judgments and statistical metrics for “slide x” visualizations, condi- tioned on translocation speed ...... 69 4.6 Acceptability judgments and statistical metrics for “spin x” visualizations respeci- fied as “roll x,” conditioned on path length ...... 70 4.7 Acceptability judgments for unrespecified “spin x” visualizations, conditioned on rotation axis ...... 70 4.8 Acceptability judgments and statistical metrics for “lift x” visualizations, condi- tioned on translocation speed and distance traversed ...... 71 4.9 Acceptability judgments and statistical metrics for “put x touching y” visualiza- tions, conditioned on relations between x and y at event start and completion . . . . 72 4.10 Acceptability judgments and statistical metrics for “put x touching y” visualiza- tions, conditioned on x movement relative to y ...... 73

xi LIST OF TABLES

4.11 Acceptability judgments and statistical metrics for “put x near y” visualizations, conditioned on distance between x and y at event start and completion ...... 73 4.12 Acceptability judgments and statistical metrics for “put x near y” visualizations, conditioned on start and end distance intervals between x and y ...... 74 4.13 Acceptability judgments and statistical metrics for “put x near y” visualizations, conditioned on distance between x and y and POV-relative orientation at event completion ...... 75 4.14 Acceptability judgments and statistical metrics for “lean x” visualizations, condi- tioned on rotation angle ...... 78 4.15 Acceptability judgments and statistical metrics for “flip x” visualizations, condi- tioned on rotation axis and symmetry axis ...... 79 4.16 Acceptability judgments for “close x” visualizations, conditioned on motion manner 79 4.17 Acceptability judgments for “open x” visualizations, conditioned on motion manner 80 4.18 Probabilities and statistical metrics for selection of “move” predicate for “move x” event, conditioned on respecification predicate ...... 81 4.19 Probabilities and statistical metrics for selection of “turn” predicate for “turn x” visualizations, conditioned on respecification predicate ...... 82 4.20 Probabilities and statistical metrics for selection of “turn” predicate for unrespeci- fied “turn x” visualizations, conditioned on rotation angle ...... 82 4.21 Probabilities and statistical metrics for selection of “roll” predicate for “roll x” visualizations, conditioned on path length ...... 83 4.22 Top 3 most likely predicate choices for “roll x” visualizations, conditioned on path length ...... 83 4.23 Probabilities and statistical metrics for selection of “slide” predicate for “slide x” visualizations, conditioned on path length and translocation speed ...... 84 4.24 Probabilities and statistical metrics for selection of “spin” predicate for “spin x” visualizations respecified as “roll x,” conditioned on path length ...... 85 4.25 Probabilities for selection of “spin” predicate for unrespecified “spin x” visualiza- tions, conditioned on rotation axis ...... 85 4.26 Probabilities and statistical metrics for selection of “lift” predicate for “lift x” vi- sualizations, conditioned on translocation speed and distance traversed ...... 86 4.27 Probabilities and statistical metrics for selection of “put on/in” predicate for “put x on/in y” visualizations, conditioned on translocation speed ...... 87 4.28 Probabilities and statistical metrics for selection of “put touching” predicate for “put x touching y” visualizations, conditioned on relative orientation between x and y at event completion ...... 87 4.29 Probabilities and statistical metrics for selection of “put near” predicate for “put x near y” visualizations, conditioned on distance traveled ...... 87 4.30 Probabilities and statistical metrics for selection of “lean on” predicate for “lean x on y” visualizations, conditioned on rotation angle ...... 88

xii LIST OF TABLES

4.31 Probabilities and statistical metrics for selection of “lean against” predicate for “lean x against y” visualizations, conditioned on rotation angle ...... 89 4.32 Probabilities and statistical metrics for selection of “flip on edge” predicate for “flip x on edge” visualizations, conditioned on rotation axis and symmetry axis . . 89 4.33 Probabilities and statistical metrics for selection of “flip at center” predicate for “flip x at center” visualizations, conditioned on rotation axis and symmetry axis . . 89 4.34 Top 3 most likely predicate choices for “flip x {on edge, at center}” visualizations, conditioned on rotation axis and symmetry axis ...... 90 4.35 Probabilities for selection of “close” predicate for “close x” visualizations, condi- tioned on motion manner ...... 91 4.36 Top 3 most likely predicate choices for “close x” visualizations, conditioned on motion manner ...... 91 4.37 Probabilities for selection of “open” predicate for “open x” visualizations, condi- tioned on motion manner ...... 91 4.38 Top 3 most likely predicate choices for “open x” visualizations, conditioned on motion manner ...... 92 4.39 Accuracy tables for baseline automatic evaluation ...... 93

B.1 Underspecified parameters and satisfaction conditions ...... 140

E.1 Accuracy tables for “vanilla” DNN automatic evaluation ...... 169 E.2 Accuracy tables for DNN automatic evaluation with weighted features ...... 172 E.3 Accuracy tables for DNN automatic evaluation with weighted discrete features . . 175 E.4 Accuracy tables for DNN automatic evaluation with feature weights alone . . . . . 178 E.5 Accuracy tables for linear-DNN automatic evaluation ...... 181 E.6 Accuracy tables for linear-DNN automatic evaluation with weighted features . . . 184 E.7 Accuracy tables for linear-DNN automatic evaluation with weighted discrete features187 E.8 Accuracy tables for linear-DNN automatic evaluation with feature weights alone . 190

xiii List of Figures

2.1 3D model of a bowl ...... 10 2.2 Bowl with box collider shown in green ...... 11 2.3 [[PLATE]], an OBJECT ...... 15 2.4 [[PUT]], a PROGRAM ...... 17 2.5 [[SMALL]], an ATTRIBUTE ...... 18 2.6 [[TOUCHING]], a RELATION ...... 19 2.7 [[TOP]], a FUNCTION ...... 20 2.8 VoxSim architecture schematic ...... 21 2.9 Dependency parse for Put the apple on the plate and transformation to predicate- logic form ...... 22 2.10 VoxML structure for [[CUP]] with associated geometry ...... 24 2.11 VoxML structures for [[ON]] and [[IN]] ...... 25 2.12 Execution of “put the spoon in the mug” ...... 26 2.13 Orientation-dependent visualizations of “put the lid on the cup” ...... 27 2.14 Context-dependent visualizations of “put the paper on the TV” ...... 27 2.15 Object model of lifting and dropping an object ...... 28 2.16 Action model of lifting and dropping an object ...... 29 2.17 Event model of lifting and dropping an object ...... 30 2.18 Execution of “put the yellow block on the red block” using embodied agent . . . . 31 2.19 End state of “lean the cup on the block” ...... 32 2.20 Unsatisfied vs. satisfied “lean” ...... 33 2.21 VoxML structure for [[LEAN]] (compositional) ...... 34 2.22 Visualization of “switch the blocks” ...... 35 2.23 VoxML structure for [[SWITCH]]...... 35 2.24 VoxML structure for [[SWITCH]] (unconstrained) ...... 36 2.25 Abbreviated VoxML type structure for [[ROLL]]...... 37

3.1 VoxML and DITL for put(y,z) ...... 42 3.2 VoxML and DITL for slide(y) ...... 43 3.3 VoxML and DITL for roll(y) ...... 44 3.4 VoxML and DITL for turn(y) ...... 45

xiv LIST OF FIGURES

3.5 VoxML and DITL for move(y) ...... 45 3.6 C# operationalization of [[TURN]] (abridged) ...... 51 3.7 Test environment with all objects shown ...... 54 3.8 Snapshot from video capture in progress ...... 54 3.9 Automatic capture process diagram ...... 55 3.10 “Densified” feature vector for “open the book” ...... 56 3.11 Sparse feature vector for “move the grape” ...... 56 3.12 Sparse feature vector for “put the block in the plate” ...... 57 3.13 HET1 task interface ...... 60 3.14 HET2 task interface ...... 62

4.1 Baseline accuracy on restricted choice set ...... 94 4.2 Baseline accuracy on unrestricted choice set ...... 94 4.3 “Vanilla” DNN accuracy on restricted choice set ...... 96 4.4 “Vanilla” DNN accuracy on unrestricted choice set ...... 96 4.5 DNN with weighted features accuracy on restricted choice set ...... 97 4.6 DNN with weighted features accuracy on unrestricted choice set ...... 97 4.7 DNN with weighted discrete features accuracy on restricted choice set ...... 98 4.8 DNN with weighted discrete features accuracy on unrestricted choice set ...... 98 4.9 DNN with feature weights only accuracy on restricted choice set ...... 99 4.10 DNN with feature weights only accuracy on unrestricted choice set ...... 99 4.11 Linear-DNN accuracy on restricted choice set ...... 100 4.12 Linear-DNN accuracy on unrestricted choice set ...... 100 4.13 Linear-DNN with weighted features accuracy on restricted choice set ...... 101 4.14 Linear-DNN with weighted features accuracy on unrestricted choice set ...... 101 4.15 Linear-DNN with weighted discrete features accuracy on restricted choice set . . . 102 4.16 Linear-DNN with weighted discrete features accuracy on unrestricted choice set . . 102 4.17 Linear-DNN with feature weights only accuracy on restricted choice set ...... 103 4.18 Linear-DNN with feature weights only accuracy on unrestricted choice set . . . . . 103 4.19 Word clouds depicting worker response to HET1 ...... 106 4.20 Word clouds depicting worker response to HET2 ...... 107

C.1 C# operationalization of [[TURN]] (unabridged) ...... 146

xv Chapter 1

Introduction

The expressiveness of natural language is difficult to translate into visuals, and much existing work in text to scene generation has focused on creating static images, such as WordsEye (Coyne and Sproat, 2001), LEONARD (Siskind, 2001), and work by Chang et al. (2015). This research de- scribes an approach centered on motion verbs, that uses a rich formal model of events and maps from a natural language expression, through Dynamic Interval Temporal Logic (Pustejovsky and Moszkowicz, 2011), into a 3D animated visualization. Building on a method for modeling natural language predicates in a 3D environment (Pustejovsky and Krishnaswamy, 2014), a modeling lan- guage to encode semantic knowledge about entities described in natural language in a composable way (Pustejovsky and Krishnaswamy, 2016a), and a spatial reasoner to generate visual simulations involving novel objects and events, this thesis presents a system, VoxSim, that uses the real-world semantics of objects and events to generate animated scenes in real time, without the need for a prohibitively complex animation interface (Krishnaswamy and Pustejovsky, 2016a,b). Semantic interpretation requires access to both knowledge about words and how they compose. As the linguistic phenomena associated with lexical semantics have become better understood, sev- eral assumptions have emerged across most models of word meaning. These include the following:

• Lexical meaning can be analyzed componentially, either through predicative primitives or a system of types;

• The selectional properties of predicators can be explained in terms of these components;

• An understanding of event semantics and the different role of event participants seems crucial for modeling linguistic utterances.

1 CHAPTER 1. INTRODUCTION

Lexical semantic analysis, in both theoretical and computational linguistics, typically involves identifying features in a corpus that differentiate the data points in meaningful ways.1 Combining these strategies, we might, for instance, posit a theoretical constraint that we hope to justify through behavioral distinctions in the data. An example of this is the theoretical claim that motion verbs can be meaningfully divided into two classes: manner- and path-oriented predicates (Jackendoff, 1983; Talmy, 1985, 2000). These constructions can be viewed as encoding two aspects of meaning: how the movement is happening and where it is happening. The former strategy is illustrated in (a) and the latter in (b) (where m indicates a manner verb, and p indicates a path verb).

(a) The ball rolledm.

(b) The ball crossedp the room.

With both verb types, adjunction can make reference to the missing aspect of motion, by intro- ducing a path (as in (c)) or the manner of movement (as in (d)).

(c) The ball rolledm across the room.

(d) The ball crossedp the room rolling.

Differences in syntactic distribution and grammatical behavior in large datasets, in fact, corre- late fairly closely with the theoretical claims made by linguists using small introspective datasets (Harris, 1954; Durand, 2009). The path-manner distinction is a case where data-derived classifications should correlate nicely with theoretically-inspired predictions. However, it is often the case that lexical semantic dis- tinctions are formal stipulations in a linguistic model based on unclear correlations between pre- dicted classes and underlying corpus data, leaving the possibility that these class groupings are either arbitrary or derived from inappropriate data. As an example, the manner of movement class from Levin (1993) gives drive, walk, run, crawl, fly, swim, drag, slide, hop, roll as manner of motion verbs, although it is unclear what underlying data is used to make this grouping. Assuming the two-way distinction between path and manner predication of motion mentioned above, these verbs do, in fact, tend to pattern according to the latter class in the corpus. Given that they are all manner of motion verbs, however, any data-derived distinctions that emerge within this

1Meaningful in terms of prior theoretical assumptions or observably differentiated behaviors.

2 CHAPTER 1. INTRODUCTION

class will have to be made in terms of additional syntactic or semantic dimensions. While it is most likely possible to differentiate, for example, the verbs slide from roll, or walk from hop in a corpus, given enough data, it is important to realize that conceptual and theoretical modeling is often necessary to reveal the factors that semantically distinguish such linguistic expressions in the first place. This problem can be approached with the use of minimal model generation. As Blackburn and Bos (2008) point out, theorem proving (essentially type satisfaction of a verb in one class as opposed to another) provides a “negative handle” on the problem of determining consistency and informativeness for an utterance, while model building provides a “positive handle” on both. In this research, simulation construction provides a positive handle on whether two manner of motion processes are distinguished in the underlying model. Further, the simulation must specify how they are distinguished, the analogue to informativeness. Factors in these event distinctions can be either temporal, such as rhythmic distinctions in run vs. walk, or spatial, such as limb motion in hop, for instance.

1.1 Background

Notions of simulation figure in some work in the literature of cognitive linguists of the past decade (Bergen, 2012; Lakoff, 2009), but have been largely unaddressed by the computational linguis- tic community, due in part to arguments against the efficacy of simulation in explaining natural language understanding (Davis and Marcus, 2016), particularly regarding linguistic phenomena involving continuous ranges or underspecified values. Nonetheless, existing computational ap- proaches to semantic processing, when taken together, provide a framework on which to implement a simulator as an extension of a model builder. This thesis endeavors to demonstrate that that sim- ulation, when modeled within a dynamic qualitative spatial and temporal semantics, can provide a robust environment for examining the interpretation of linguistic behaviors, including those de- scribed qualitatively. The result is a research instrument within which to test these interpretations, that can be used to inform qualitative and quantitative models of motion events. As mentioned, dynamic interpretations of event structures divide movement verbs into “path” and “manner of motion” verbs. In both cases, the location of the moving argument is reassigned at each step or frame, but where path verbs conduct the reassignment relative to a specified location, manner verbs do not, and the location is specified through an optional prepositional adjunct. The

3 CHAPTER 1. INTRODUCTION

spoon falls and The spoon falls into the cup are both equally grammatical, but result in different “mental instantiations” of the event described. In order to render a visualization of these instantiations, or “simulations,” a computational system must be able to infer some path or manner information that is missing from the predicate from the objects involved, or from their composition with the event.2 This requires, at some level, real-world knowledge about the distinctions between minimal pairs of motion predicates, as well as of how the visual instantiations of lexical objects interact with their environment to enable reasoning. How these are modeled will be addressed in Section 2.3. Visual instantiations of lexemes require an encoding of their situational context, or a habitat (Pustejovsky, 2013; McDonald and Pustejovsky, 2014), as well as afforded behaviors that the ob- ject can participate in, that are either Gibsonian or telic in nature (Gibson, 1977, 1979; Pustejovsky, 1995). For instance, a cup may afford containing another object, or being drunk from. Many event descriptions presuppose such conditions that rarely appear in linguistic data, but a visualization lacking them will make little sense to the observer. This linguistic “dark matter,” conspicuous by its absence, is thus easily exposable through simulation.

1.2 Information-Theoretic Foundations

Questions on the meaning of sentences or concepts date to the beginning of philosophy, with philosophers from Plato to Kant offering their own takes on what constitutes a distinct concept. Gottlob Frege in Uber¨ Sinn und Bedeutung (1892) offers the distinction between “sense” (Sinn) and “reference” (Bedeutung), where the sense is the thought expressed by a sentence and the reference is the truth value in some world (w). Where true synonyms would be indistinguishable in w, most subtle differences that arise out of differing lexical choices (e.g., event class, Aktionsart) do in fact create minimal contrasts that differentiate meaning. Rudolf Carnap (1947) provides an interpretation of Frege, based on Tarski’s technique for model-theoretic semantics (1936), that proposes a distinction of intensional meaning: the indi- vidual concept; and extensional meaning: what makes it true in some model M. Thus it can be argued that Frege’s reference (Bedeutung) and Carnap’s extensional meaning can be unified in the set of parameter values that make an action, property, or proposition true under M for the thought

2This assumes that there is no prior database of missing information accessible, such as one compiled from previously-enacted events or “remembered experiences” as is discussed by, e.g., Bod (1998). The semantic encod- ing of such information in a simulation context discussed in Section 2.1.

4 CHAPTER 1. INTRODUCTION

being expressed. That is, for a sentence describing an event (e.g., “the ball rolls”), there exists a set of ranges for parameters (speed, rotation, etc.) that make that sentence true under M and a contrasting set that make it false. Until these parameters have distinct values in M, the truth of the sentence under M is not possible to ascertain. An information theoretician like Shannon (1948) might say that the entropy of a predicate in a minimal model may, depending on the predicate, exist in a non-minimal state until values are assigned to the required parameters.

1.3 Linguistic Underspecification in Motion Events

As minimal models allow components to be underspecified, it is permissible to simply model that “the ball rolled” without providing information such as direction, speed, size of the ball, friction between the ball and the supporting surface, etc.; this information can be specified, but the model is still considered complete without it. When a ball rolls in the real world, these model-unspecified components all have values assigned to them, even if those values are not specifically measured. Following Carnap in his interpretation of Frege, the meaning of a sentence can be said to be determined by testing what would make it true (Soames, 2015). Therefore, given a visualization of an event generated from a known input sentence, and a pair (or set) of sentences potentially describing what appears in the visualization (and blind to the original input), the appropriateness of the visualization to a description can be assessed by a pairwise similarity judgment (Rumshisky et al., 2012) (i.e., asking an annotator to judge a video showing a ball rolling as matching the sentence “The ball rolls” or the sentence “The ball slides”). If a value assigned in the simulation to an underspecified parameter results in a visualization that is judged to not match the input sentence s, that visualization cannot be said to represent a model in which the proposition p denoted by s is true. Thus, there may be a set of values [a] for an parameter underspecified in s for which the resulting visualization represents a proposition that, in a Kripke semantics (Kripke, 1965) with a model M, is M  ps[a] and another set of values [b] which result in a proposition that is actually M 2 ps[b]. The task is then to separate the two sets, and trying to solve this problem computationally entails two further tasks:

• Building a computationally coherent model of a world that can be evaluated from an embod- ied human perspective;

5 CHAPTER 1. INTRODUCTION

• Determining values for all salient components that create a simulation that satisfies a human judge’s notion of given event.

Additionally, event simulations where values for the aforementioned components overlap sig- nificantly suggest the existence of value ranges that define a prototypical notion of the event, a la Rosch (1973, 1983).

1.4 Related Prior Work

This work is related to frameworks and implementations for object placement and orientation in static scenes (Coyne and Sproat, 2001; Siskind, 2001; Chang et al., 2015). The introduced focus on motion verbs in early prototypes of this simulation work and related studies (Pustejovsky and Kr- ishnaswamy, 2014; Pustejovsky, 2013) led to two additional lines of research: an explicit encoding for how an object is itself situated relative to its environment; and an operational characterization of how an object changes its location or how an agent acts on an object over time. Pustejovsky and others have developed into the former into a semantic notion of situational context, called a habitat (Pustejovsky, 2013; McDonald and Pustejovsky, 2014), while the latter is addressed by dy- namic interpretations of event structure, including Dynamic Interval Temporal Language, or DITL (Pustejovsky and Moszkowicz, 2011; Pustejovsky, 2013). I would of course also be remiss not to mention the canonical example of a language parser hooked up to a “blocks world” environment, SHRDLU (Winograd, 1971), which served me, as it served so many others, as a model of both project aim and design mechanics. The interdisciplinary nature of this work is influenced by cognitive linguistic work regarding the role of “embodiment” in interpreting mental simulations (Bergen, 2012; Narayanan, 1997; Feldman and Narayanan, 2004; Feldman, 2006), interpreting the spatial aspects of cognitive rea- soning in computational and algebraic frameworks (Randell et al., 1992; Bhatt and Loke, 2008; Mark and Egenhofer, 1995; Kurata and Egenhofer, 2007; Albath et al., 2010), along with temporal analogues (Allen, 1983). The notion of a verb enacted as a program over its arguments (Naumann, 1999) is foundational to the implementation, resulting in testable satisfaction conditions that are calculated compositionally with the affected objects, in terms of the aforementioned qualitative spatial reasoning approaches. In implementing a platform to allow experimenting with the underspecification question men- tioned in Section 1.3, this work has leveraged the Unity game engine by Unity Technologies (Gold-

6 CHAPTER 1. INTRODUCTION stone, 2009). Game engines have the advantage of providing relatively user-friendly tools for de- velopers to implement a variety of subsystems “out of the box,” from graphics and rendering to UI to physics. This has allowed work to proceed beyond the scope of standard game engine com- ponents, into areas in the intersection between theoretical reasoning and AI and real-time game or “game-like” environments, in the vein of work presented by Forbus et al. (2002), Dill (2011), and Ma and McKevitt (2006), and the mapping of spatial constraints to animation by Bindiganavale and Badler (1998).

7 Chapter 2

Framework

The framework for this research rests on the definition of three terms:

1. Minimal model — the universe containing a set of arguments and a set of predicates, in- terpretations of those arguments, and subsets each defining the interpretation of a predicate (Gelfond and Lifschitz, 1988). For this research, each predicate is assumed to be a logic program and each argument is assumed to be a constant.

2. Simulation — the minimal model with values assigned to all unspecified variables. A min- imal model can therefore be considered an underspecified simulation according to this defi- nition, to which variable values can be assigned arbitrarily or by some rule or heuristic. This thesis is primarily concerned with using a visual simulation system to determine a set of “best practices” for assigning these values.

3. Visualization — the process by which each linguistic/semantic object in the simulation is linked to a “visual object concept” which is enacted within the virtual world with the variable values assigned by the simulation evaluated and reassigned at every frame according to the program encoded by the predicate in question. The final step is rendering, in which the computer draws the “finished product” at the frame rate specified by the visualization system.

In order to take the simulation from a fleshed-out model to a rendered visualization, require- ments include, but are not limited to, the following components:

1. A minimal embedding space (MES) for the simulation must be determined. This is the 3D region within which the state is configured or the event unfolds;

8 CHAPTER 2. FRAMEWORK

2. Object-based attributes for participants in a situation or event need to be specified; e.g., orientation, relative size, default position or pose, etc.;

3. An epistemic condition on the object and event rendering, imposing an implicit point of view (POV);

4. Agent-dependent embodiment; this determines the relative scalar factors of an agent and its event participants to their surroundings, as the entity engages in the environment.

In order to construct a robust simulation from linguistic input, an event and its participants must be embedded within an appropriate minimal embedding space. This must sufficiently enclose the event localization, while optionally including room enough for a frame of reference visualization of the event (the viewer’s perspective). The above list enumerates the need for semantic-adjacent or “epi-semantic” information in simulation generation. This is the type of information that influences behavior, interpretation, and entailed consequences of events, but is not directly involved in representing the predicative force of a particular lexeme, a la qualia structure (Pustejovsky, 1995). The modeling language VoxML (Visual Object Concept Markup Language) (Pustejovsky and Krishnaswamy, 2016a) forms the scaffold used to link lexemes to their visual instantiations, termed the “visual object concept” or voxeme. In parallel to a lexicon, a collection of voxemes is termed a voxicon. There is no requirement on a voxicon to have a one-to-one correspondence between its voxemes and the lexemes in the associated lexicon, which often results in a many-to-many correspondence. That is, the lexeme plate may be visualized as a [[SQUARE PLATE]], a [[ROUND PLATE]]1, or other voxemes, and those voxemes in turn may be linked to other lexemes such as dish or saucer. Each voxeme is linked to an object geometry (if a noun—OBJECT in VoxML), a DITL pro- gram (if a verb or VoxML PROGRAM), an attribute set (VoxML ATTRIBUTEs), or a transformation algorithm (VoxML RELATIONs or FUNCTIONs). VoxML is used to specify the “epi-semantic” in- formation beyond that which can be directly inferred from the geometry, DITL, or attribute proper- ties. VoxSim does not rely on manually-specified categories of objects with identifying language, and instead procedurally composes the properties of voxemes in parallel with the lexemes they are

1Note on notation: discussion of voxemes in prose will be denoted in the style [[VOXEME]] and should be taken to refer to a visualization of the bracketed concept. Where relevant, images of actual visualizations will be provided as well.

9 CHAPTER 2. FRAMEWORK linked with. These properties may be specified in the voxeme’s VoxML markup or calculated from properties natively accessible by the Unity framework. A non-exhaustive list of voxeme properties and their accessibility is shown below in Table 2.1.

VoxML-specified Concavity Symmetry Semantic head Event typing NL predicate Unity-calculated Physical object size Location/orientation Dimensionality Non-nominal scalar value Event satisfaction condition

Table 2.1: Example voxeme properties

VoxML augments engine-accessible data structures such as geometries. As an illustrative ex- ample, let us consider a bowl. Common linguistic knowledge links the lexeme “bowl,” when refer- ring to a physical object (PHYSOBJ according to Pustejovsky’s Generative Lexicon (GL) (1995)), with an object that, among other properties, typically has some concavity in its physical structure. It is a simple matter for a 3D artist to create a model of such an object, as shown below:

Figure 2.1: 3D model of a bowl

10 CHAPTER 2. FRAMEWORK

Unity (or any game engine)2 has native access to object parameters such as size, position, and orientation that allows it to calculate certain additional information about the object, of the type enumerated in the lower box of Table 2.1. Other properties of the object represented by the 3D model, such as those on the left side of Table 2.1, are difficult to calculate from the object geometry alone without a complex geometrical analysis algorithm, or impossible entirely without sophisticated AI:

• Whether or not the object is concave, flat, or convex;

• Symmetry of the object about a certain axis or around a certain plane;

• What, if any, component of the entity (object, event, etc.) is most semantically salient;

• Multiple predicates that may denote the entity represented by the geometry.

Many of these questions border on computer vision or discourse modeling problems, well outside the scope of this work. Thus, the software cannot procedurally add these parameters to its knowledge base at runtime. It is not computationally feasible, for example, to calculate collision volumes for every object in the voxicon that always closely map to the geometries. Instead, due to the time and resource constraints of creating a platform that is quickly deployable on the average user’s (and developer’s) hardware, automatically computed collision boxes must be used, such as that shown below.

Figure 2.2: Bowl with box collider shown in green

2Subsequent references to Unity functionality should be taken to refer to capabilities provided by most game engines in some shape or form. Unity is simply the platform of choice that has been used to implement the VoxSim software.

11 CHAPTER 2. FRAMEWORK

With this information alone, the simulator has no way of knowing that there are points on the exterior of the bowl’s geometry (including the bottom of what we would call the bowl’s “interior,” the approximate location of which is indicated in Figure 2.2 with a white line) that are lower on the Y-axis than the collider volume at that same X- and Z-value. Human beings as language interpreters know this and can describe the difference as above, but the computer does not without a world-knowledge data bank, and furthermore cannot do anything with this information without the resources to access all relevant parameters. VoxML provides a compact way of representing a minimally-required set of parameters to enable compositional reasoning.

2.1 VoxML: Visual Object Concept Modeling Language

The VoxML specification is laid out in greater detail in Pustejovsky and Krishnaswamy (2016a), but an abbreviated overview follows here. Following GL, VoxML entities are given a feature structure enumerating:

(a) Atomic Structure (GL FORMAL): objects expressed as basic nominal types.

(b) Subatomic Structure (GL CONST): mereotopological structure of objects.

(c) Event Structure (GL TELIC and GL AGENTIVE): origin and functions associated with an object.

(d) Macro Object Structure: how objects fit together in space and through coordinated activities.

VoxML covers five entity types: OBJECT,PROGRAM,ATTRIBUTE,RELATION, and FUNC- TION, which are closely correlated to nouns, verbs, adjectives and adverbs, adpositions, and func- tional constructions (e.g. “top of x”), respectively. These entity types represent semantic knowl- edge of the associated real-world concepts as represented as three-dimensional models, and of events and attributes related to and enacted over these objects. It is intended to overcome the limitations of existing 3D visual markup languages by allowing for the encoding of a wealth of semantic knowledge that can be exploited by a variety of systems and platforms, leading to mul- timodal simulations of real-world scenarios using conceptual objects that represent real-world se- mantic qualities. It shares many of the goals pursued in Dobnik et al. (2013) and Dobnik and Cooper (2013), for specifying a rigidly-defined type system for spatial representations associated with linguistic expressions, and is extensible for new needs and additions to the semantic model.

12 CHAPTER 2. FRAMEWORK

2.1.1 Objects

The VoxML OBJECT is used for modeling nouns. The set of OBJECT attributes is shown below:

LEX OBJECT’s lexical information TYPE OBJECT’s geometrical typing HABITAT OBJECT’s habitat for actions AFFORD STR OBJECT’s affordance structure EMBODIMENT OBJECT’s agent-relative embodiment

Table 2.2: VoxML OBJECT attributes

The LEX attribute contains the subcomponents PRED, the lexical predicate denoting the object, and TYPE, the object’s type according to Generative Lexicon. The TYPE attribute (distinct from LEX’s TYPE subcomponent) contains information to define the object geometry in terms of primitives. HEAD is a primitive 3D shape that roughly describes the object’s form (e.g. calling an apple an “ellipsoid”), or the form of the object’s most semantically salient subpart. For completeness, possible HEAD values are grounded in mathematical formalisms defining families of polyhedra (Grunbaum,¨ 2003), and, for annotator’s ease, common primitives found across the “corpus” of 3D artwork and 3D modeling software3 (Giambruno, 2002). Using common 3D modeling primitives as convenience definitions provides some built-in redundancy to VoxML, as is found in NL description of structural forms. For example, a rectangular prism is the same as a parallelepiped that has at least two defined planes of reflectional symmetry, meaning that an object whose HEAD is rectangular prism could be defined two ways, an association which a reasoner can unify axiomatically. Possible values for HEAD are given below:

HEAD prismatoid, pyramid, wedge, parallelepiped, cupola, frustum, cylindroid, ellipsoid, hemiellipsoid, bipyramid, rectangular prism, toroid, sheet

Table 2.3: VoxML OBJECT HEAD types

These values are not intended to reflect the exact structure of a particular geometry, but rather a

3Mathematically curved surfaces such as spheres and cylinders are in fact represented, computed, and rendered as polyhedra by most modern 3D software.

13 CHAPTER 2. FRAMEWORK

cognitive approximation of its shape, as is used in some image-recognition work (e.g. Goebel and Vincze (2007)). Object subparts are enumerated in COMPONENTS.CONCAVITY can be concave, flat, or convex, and refers to any concavity that deforms the HEAD shape. ROTATSYM (rotational symmetry) defines any of the world’s three orthogonal axes around which the object’s geometry may be rotated for an interval of less than 360 degrees and retain identical form as the unrotated geometry. REFLECTSYM (reflectional symmetry), is defined similarly—if an object may be bi- sected by a plane defined by two of the world’s three orthogonal axes and then reflected across that plane to obtain the same geometric form as the original object, it is considered to have reflectional symmetry across that plane. The values of ROTATSYM and REFLECTSYM are intended to be world-relative, because ob- jects are always situated in a minimal embedding space defined by Cartesian coordinates, and the axes/planes of symmetry are those denoted in the world, not of the object. Thus, a tetrahedron— which in isolation has seven axes of rotational symmetry, no two of which are orthogonal—when placed in the MES such that it cognitively satisfies all “real-world” constraints, must be situated with one base downward (as a tetrahedron placed any other way will fall over). This reduces the salient in-world axes of rotational symmetry to one: the world’s Y-axis. When the orientation of the object is ambiguous relative to the world, the world should be assumed to provide the grounding value. The HABITAT element defines habitats, which per Pustejovsky (2013) and McDonald and Pustejovsky (2014) are conditioning environments in which an object exists that enable or dis- able certain actions being taken with the object. These habitats may be INTRINSIC to the object, regardless of what action it participates in, such as intrinsic orientations or surfaces. An example would be a computer monitor with an intrinsic front, and a geometry in which that intrinsic front faces along the positive Z-axis. We adopted the terminology of “alignment” of an object dimen- 0 sion, d ∈ {X,Y,Z}, with the dimension, d , of its embedding space, Ed0 , as follows: align(d, Ed0 ). EXTRINSIC habitats must be satisfied for particular actions to take place, such as a bottle that must be placed on its side in order to be rolled across a surface. In VoxML encoding, each habitat is given a label and a numerical index for future reference. AFFORD STR describes the set of specific actions that may be taken with objects, subject to their current conditioning habitats. The habitats supply the requisite conditions, and the affordance structure encodes the actions that may be taken under those conditons, and the states that results from those actions being taken. There are low-level GIBSONIAN affordances, which involve ma-

14 CHAPTER 2. FRAMEWORK nipulation or maneuver-based actions (grasping, holding, lifting, touching, etc.); there are also TELIC affordances (Pustejovsky, 1995), which link directly to what goal-directed activity can be accomplished, by means of the GIBSONIAN affordances. EMBODIMENT qualitatively describes the SCALE of the object compared to an in-world agent (typically assumed to be a human) as well as whether the object is typically MOVABLE by that agent.

   plate              PRED = plate        LEX =      =     TYPE physobj                HEAD =    sheet[1]             COMPONENTS = surface[1], base             CONCAVITY = concave    TYPE =              ROTATSYM = {Y }                 REFLECTSYM = {XY,YZ}                         =      UP align(Y, EY )      = [2]      INTR      HABITAT =   TOP = top(+Y )                        = ...     EXTR                 =     A1 H[2] → [put(x, y)]contain(y, x)        AFFORD STR =      A = H → [grasp(x, [1])]     2 [2]                 SCALE = < agent        EMBODIMENT =         MOVABLE = true       

Figure 2.3: [[PLATE]], an OBJECT

A complete OBJECT voxeme is linked to a geometry, but unlinked markup can be used to specify typical or prototypical visualization parameters. Numerals in brackets denote references to and reentrancies from other parameters of the vox- eme. In the bottle example of an EXTRINSIC habitat discussed above, the habitat of a bottle on its side might be denoted as [3]UP = align(Y, E⊥Y ) (that is, the bottle “upward” vector is created by aligning its object-space Y axis with a vector perpendicular to the world-space Y axis). An associated affordance might be H[3] → [roll(x, [1])], where that habitat (habitat-3), affords the

15 CHAPTER 2. FRAMEWORK

rolling of the object ([1]) by some agent x. In Figure 2.3, [1] denotes both the semantic HEAD and the “surface” subcomponent, indicating that they refer to the same part of the geometry (as HEAD is always linked to a geometric form). The HABITAT structure [2] illustrates the role the habitat plays in activating the entity’s affordance structure. Namely, if the appropriate conditions are satisfied (defined by habitat-2), then the telic affordance associated with a plate is activated; every putting of x on y results in y containing x. Thus an affordance is notated as HABITAT → [EVENT]RESULT.

2.1.2 Programs

PROGRAM is used for modeling verbs. The current set of PROGRAM attributes is shown below:

LEX PROGRAM’s lexical information TYPE PROGRAM’s event typing EMBEDDING SPACE If different from the MES, PROGRAM’s embedding space as a function of the participants and their changes over time

Table 2.4: VoxML PROGRAM attributes

Like OBJECTs, a PROGRAM’s LEX attribute contains the subcomponents PRED, the predi- cate lexeme denoting the program, and TYPE, the program’s type as given in a lexical semantic resource, e.g., its GL type. Top-level component TYPE contains the HEAD, its base form; ARGS, references to the partic- ipants; and BODY, subevents that are executed in the course of the program’s operation. Top-level values for a PROGRAM’s HEAD are given below:

HEAD state, process, transition assignment, test

Table 2.5: VoxML PROGRAM HEAD types

The [[HEAD]] of a program as shown above is given in terms of how the visualization of the action is realized. Basic program distinctions, such as test versus assignment are included within this typology and further distinguished through subtyping.

16 CHAPTER 2. FRAMEWORK

   put              PRED = put        LEX =      =     TYPE transition event                HEAD = transition                      A =      1 x:agent                  ARGS =  A2 = y:physobj                   =       A3 z:location         TYPE =                    =      E1 grasp(x, y)                  =  E =      BODY  2 [while(hold(x, y), move(y)]                   E = [at(y, z) → ungrasp(x, y)]       3               

Figure 2.4: [[PUT]], a PROGRAM

No specified EMBEDDING SPACE indicates that the embedding space for the event is the same as the MES. This PROGRAM is agent driven. Should no agent exist, the agent may be excluded from the argument structure (see Section 2.4: Object Model). When beginning the execution a PROGRAM, any subevents that are already satisfied may be skipped; thus, if during the execution of [[PUT]], if the intended path is blocked and the simulation system must replan, grasp(x, y) may be omitted when [[PUT]] resumes—there is no need to ungrasp and re-grasp the object. PROGRAM voxemes may be linked with an abstract visual representation, such as an image schema (Johnson, 1987; Lakoff, 1987).

2.1.3 Attributes

ATTRIBUTEs fall into families structured according to some SCALE (cf. Kennedy and McNally (1999)). The least constrained scale is a conventional sortal classification, and its associated at- tribute family is the set of pairwise disjoint and non-overlapping sortal descriptions (non-super types). VoxML terms this a nominal scale, following Stevens (1946) and Luce et al. (1990). A two-state subset of this domain is a binary classification. By introducing a partial ordering over values, we can have transitive closure, assuming all orderings are defined; this defines an ordinal scale. When fixed units of distance are imposed between the elements on the ordering, we arrive at an interval scale. When a zero value is introduced that can be defined in terms of

17 CHAPTER 2. FRAMEWORK

a quantitative value, we have a rational scale.4 In reality there are many more attribute categories than those listed, but the goal in VoxML is to use these types as the basis for an underlying cognitive classification for creating measurements from different attribute types. In other words, these scale types denote how the attribute is to be reasoned with, not what its precise qualitative features are. As VoxML is intended to model visu- alizations of physical objects and programs, it is intended to model “the image is red” but not “the image is depressing”. Examples of different SCALE types follow:

Scale Example domain Example values ordinal DIMENSION big, little, large, small, long, short binary HARDNESS hard, soft nominal COLOR red, green, blue rational MASS 1kg, 2kg, etc. interval TEMPERATURE 0◦C, 100◦C, etc.

Table 2.6: Example VoxML ATTRIBUTE scalar types

VoxML also denotes an attribute’s ARITY. transitive attributes are considered to describe object qualities that require comparison to object prototypes (e.g. the small cup vs. the big cup), whereas intransitive attributes do not require that comparison (a red cup is not red compared to other cups; it is red in and of itself). Finally, every attribute must be applied to an object, so attributes’ ARG represents said object and its typing, denoted identically to the individual ARGS of VoxML PROGRAMs.

   small             =  PRED =    LEX  small                SCALE = ordinal               TYPE =  ARITY = transitive             =     ARG x:physobj      

Figure 2.5: [[SMALL]], an ATTRIBUTE

4The difference between an interval scale and rational scale is subtle, but they can be distinguished by a question of whether the zero value on the scale indicates “nothing” or “something.” 0 kg indicates “no mass,” while 0◦C does not indicate “no temperature.”

18 CHAPTER 2. FRAMEWORK

[[SMALL]] is transitive meaning that a “small” x is small relative to a prototypical instan- tiation of x.

2.1.4 Relations

ARELATION’s type structure specifies a binary CLASS of the relation: configuration or force dynamic, describing the nature of the relation that exists between the objects under its scope. These classes themselves have subvalues—for configurational relations these are values enumerated in a qualitative spatial relation calculus such as the Region Connection Calculus (Ran- dell et al., 1992). For force dynamic relations, subvalues are relations defined by forces between objects, such as “support” or “suspend,” many of which are defined as resultant states in an affor- dance structure. Also specified are the arguments participating in the relations. These, as above, are represented as typed variables. CONSTR denotes an optional constraint on the relation, such as y→HABITAT→INTR[align], which denotes that the INTRINSIC habitat of ydenoted by an align function must be satisfied by the current placement of y in order for the relation in question to be in effect.

   touching             LEX =  PRED = is touching                     CLASS = config                 VALUE = EC                      TYPE =   A1 = x:3D            ARGS =        =       A2 y:3D                      CONSTR = nil      

Figure 2.6: [[TOUCHING]], a RELATION

2.1.5 Functions

FUNCTIONs’ typing structures take as ARG the OBJECT voxeme being computed over. REFERENT takes any subparameters of the ARG that are semantically salient to the function, such as the voxeme’s HEAD. If unspecified, the entire voxeme should be assumed as the referent. MAPPING denotes the type of transformation the function performs over the object, such as dimensionality reduction (notated as dimension(n):n-1 for a function that takes in an object of n dimensions and

19 CHAPTER 2. FRAMEWORK

returns a region of n-1 dimensions). Finally, ORIENTATION provides three values: SPACE, which notes if the function is performed in world space, object space, or pov (camera-relative) space; AXIS, which notes the primary axis and direction the function exploits relative to that space; and ARITY, which returns transitive or intransitive based on the boolean value of a specified input variable (x[y]:intransitive denotes a function that returns intransitive if the value of y in x is true). Definitions of transitive and intransitive follow those for ATTRIBUTEs, so in the example below, ARITY of [[TOP]] would be intransitive if the INTRINSIC habitat top(+Y ) of x is satisfied by the current placement of the object in question in the virtual world.

   top             LEX =  PRED = top                =    ARG x:physobj             REFERENT = x→HEAD                 MAPPING = dimension(n):n-1                        SPACE = world           TYPE =         AXIS =       +Y                  ORIENTATION =  ARITY = x→HABITAT→                         INTR[top(axis)]:                       intransitive               

Figure 2.7: [[TOP]], a FUNCTION

Like PROGRAMs, ATTRIBUTE,RELATION, and FUNCTION voxemes may be linked with an abstract visual representation where relevant.

2.2 VoxSim

VoxSim is the real-time language-driven event simulator implemented on top of the VoxML plat- form. A build of VoxSim can be downloaded at http://www.voxicon.net/download. The Unity project and latest source is at https://github.com/VoxML/VoxSim.

20 CHAPTER 2. FRAMEWORK

2.2.1 Software Architecture

VoxSim uses the Unity game engine (Goldstone, 2009) for graphics and I/O processing. Input is a simple natural language sentence, which is part-of-speech tagged, dependency-parsed, and transformed in to a simple predicate-logic format. These NLP tasks may be handled with a variety of third-party tools, such as the ClearNLP parser (Choi and McCallum, 2013), SyntaxNet (Andor et al., 2016), or TRIPS (Ferguson et al., 1998), which interface with the simulation software using a C++ communications bridge and wrapper. 3D assets and VoxML-modeled entities (created with other Unity-based tools) are loaded externally, either locally or from a web server. Commands to the simulator may be input directly to the software UI, or may be sent over a generic network connection or using VoxSim Commander, a companion app for iOS.

Parser VoxML Resources

VoxSim Communications Voxeme Simulator Commander Bridge Geometries iOS Unity

Figure 2.8: VoxSim architecture schematic

The simulator forms the core of VoxSim, and performs operations over geometries supplied to its resource library and informed by VoxML semantic markup. The communications bridge facilitates generic socket-level communication to third-party packages and exposes the simulator to commands given over a network connection so it can be easily hooked up to remote software. Arrows indicate the directionality of communication to or from each component. Given a tagged and dependency parsed sentence, we can the transform it into predicate-logic format using the root of the parse as the VoxML PROGRAM, which accepts as many arguments as are specified in its type structure, and subsequently enqueuing arguments that are either constants (i.e. VoxML OBJECTs) or evaluate to constants at runtime (all other VoxML entity types). Other non-constant VoxML entity types are treated similarly, though usually accept only one argument. Thus, the dependency arc CASE(plate, on) becomes on(plate). The resulting predicate-logic is evaluated from the innermost first-order predicates outward until a single first-order representation is reached.

21 CHAPTER 2. FRAMEWORK

ROOT NMOD DOBJ CASE DET DET put/VB the/DT apple/NN on/IN the/DT plate/NN

1. p := put(a[]) 5. nmod := on(iobj) 2. dobj := the(b) 6. iobj := the(c) 3. b := (apple) 7. c := plate 4. a.push(dobj) 8. a.push(nmod) put(the(apple),on(the(plate)))

Figure 2.9: Dependency parse for Put the apple on the plate and transformation to predicate-logic form

2.2.2 LTSs to Manage Events

An event sequence can be viewed as a type of a labeled transition system (LTS) (van Benthem, 1991; van Benthem et al., 1994), in which the usage of distinct event classes comports with an LTS’s notion of “process equivalence” where the equivalence relations respect but are not defined by the observational properties in two LTSs under comparison, as enumerated by van Benthem and Bergstra (1994). An event is distinguished by a label and its argument set as enumerated by its VoxML structure. It can be indicated and selected by name, with its argument structure being filled in from the linguistic parse or by various specification methods discussed subsequently, most abstractly in Section 2.8. VoxSim manages its event queue by way of an “event manager” that maintains a list of the events currently requested of the system, under the assumption that the event at the front of a non- empty queue is the event currently being executed. Thus, the frontmost event always exists in a

first-order state (Lωω), while other events may be of higher order forms (Lω1ω) until they come to the front of the queue, at which point they are “first-orderized” (Keisler, 1971; Chang and Keisler, 1973; van Benthem et al., 1993) by the process depicted in Figure 2.9, where the result of the parse is what is inserted into the queue. Should an event require precondition satisfaction, such as a [[PUT]] event requiring that the object be grasped before being moved, an event that will satisfy this precondition can be inserted into the event manager system at appropriate point in the transition graph such that when the precondition is satisfied, the remainder of the initially-prompted event’s subevent structure can

22 CHAPTER 2. FRAMEWORK

resume execution from that point. Insertion into the event manager/transition graph can be nested, so if [[PUT]] requires that the object be grasped and [[GRASP]] requires that the object be touched, the call to [[GRASP]] from [[PUT]] can handle the precondition insertion of [[REACH FOR]] (the predicate label for touching + hand in grasp pose) before returning control back to [[GRASP]], and subsequently [[PUT]]. As events proceed, VoxSim maintains the current set of relations that exist between every pair of objects in the scene. Thus, should a precondition inserted into the event manager already be satisfied, the moment it comes to the front of the queue and is evaluated to first-order form, it will be removed from the transition graph as a satisfied event, keeping the system moving. Event sequences, as broadly implemented in the VoxSim event manager, are not necessarily

deterministic (¬2∀xy((Raxy ∧ Raxz → y = z))), as information required to either initiate or resolve the event (sometimes both) must be computed from the composition of object and event properties as encoded in VoxML, or given value at runtime (discussed in Section 2.8).

2.3 Spatial Reasoning

VoxML is not intended to encode every piece of information needed to reason about objects and events. Rather, the VoxML-encoded information about the entity is interpreted at runtime when composed with current situational context.

23 CHAPTER 2. FRAMEWORK

   cup           PRED = cup       LEX =          TYPE = physobj                =    HEAD cylindroid[1]             COMPONENTS = surface,interior             CONCAVITY =    TYPE =  concave             ROTATSYM = {Y }                 REFLECTSYM = {XY,YZ}                              UP = align(Y, EY )      = [2]      INTR      HABITAT =   =       TOP top(+Y )                      EXTR = ...                   A1 = H[2] → [put(x, on([1]))]support([1], x)             =    AFFORD STR =  A2 H[2] → [put(x, in([1]))]contain([1], x)             A = H → [grasp(x, [1])]     3 [2]                 SCALE =

Figure 2.10: VoxML structure for [[CUP]] with associated geometry

For instance, a cup of the type shown in Figure 2.10, on a surface with its opening upward, may afford containing another object, or being drunk from, but in other habitats or configurations these affordances may be deactivated. [[CUP]]’s VoxML markup shows a concave object with rotational symmetry around the Y axis and reflectional symmetry across the XY and YZ planes, meaning that it opens along the Y axis. [[CUP]]’s HABITAT information further situates the opening along its positive Y axis, meaning that if the cup’s opening along its +Y axis is unobstructed, then it affords containment in its current habitat. Habitats established by preceding context, such as “The cup is flipped over,” may deactivate or activate these affordances or others. Following from the convention that agents of a VoxML PROGRAM must be explicitly singled out in the associated implementation by belonging to certain entity classes (e.g., humans), affor- dances describe what can be done to the object, and not what actions it itself can perform. As mentioned, an affordance is notated as HABITAT → [EVENT]RESULT, and an instance such as

H[2] → [put(x, on([1])]support([1], x) can be paraphrased as “In habitat-2, an object x can be put on component-1, which results in component-1 supporting x.” This procedural reasoning from

24 CHAPTER 2. FRAMEWORK habitats and affordances, executed in real time, allows VoxSim to infer the complete set of spatial relations between objects at each state and track changes in the shared context between human and computer. This allows the simulator to become a way to trace the entailments of spatial cues through a narrative. In order to place an object in(cup), the system must first determine if the intended contain- ing object (i.e., the cup) affords containment by default by examining its affordance structure. Affordance-2 of [[CUP]] (Figure 2.10) demonstrates this. It requires the cup to be currently situated in a habitat which allows other objects to be placed partially or completely inside it (represented by RCC relations PO, TPP, or NTPP, as shown in Figure 2.11, where ProperPart ≡ TPP ∪ NTPP). [[IN]] is shown in contrast to [[ON]], showing distinction in configurational value and constraints.

     in     on                   LEX =  PRED = in    = PRED = on       LEX                              CLASS = config     CLASS = config                         =     VALUE = PO k TPP k     VALUE EC                               NTPP              A1 = x:3D           TYPE =  =           ARGS      TYPE =  =      A = y:3D       A1 x:3D       2      =           ARGS            A = y:3D           2      CONSTR = y→HABITAT→                             CONSTR = y→HABITAT→     INTR[align]                     INTR[align]?      

Figure 2.11: VoxML structures for [[ON]] and [[IN]]

Finally, the system must test to see if the object to be placed inside the containing object can fit there in its current orientation. If so, it is moved into position through the execution of [[PUT]] (see Figure 2.4). If not, the system checks if any rotation around the three orthogonal world axes will result in the placed object having local dimensions that will fit inside the containing object. If so, the object is rotated into that orientation and then moved. Figure 2.12 shows a successful execution of this process. If the placed object cannot fit inside the intended containing object in any configuration, the system returns a message stating that the requested action is impossible to perform.

25 CHAPTER 2. FRAMEWORK

Figure 2.12: Execution of “put the spoon in the mug”

The orientation requirement is not explicitly encoded in the [[PUT]] markup, but is implicitly enforced by the constraints on the minimal embedding space, such that if the cup is upside down with its opening EC to the table surface, it would have to be flipped over before putting an object inside it. This form of “backchaining” is used to satisfy preconditions that are automatically infer- able from the configuration space. For example, if an agent needs to be touching an object but that object is out of reach, it should move itself into the appropriate position to touch the object without an explicit command. In Figure 2.11, the question mark in the typing of [[IN]] denotes the test on the habitat, where [[ON]] requires no test to be conducted, as transformations on any object being placed “on” another object can be mapped directly from the habitat currently occupied without testing over a set of conditional transformations before executing. For instance, given a cup sitting upright (in the proper orientation), “put the lid on the cup” clearly describes an end state where the lid closes the cup’s opening. However, if the cup is on its side, we find that, if the end location is chosen at random, such as by a Monte Carlo method, from possible configurations left available by the currently operating set of constraints and configurations, both a lid closing the opening of the cup, and one that is touching the cup on the positive Y-axis (i.e., explicitly “on top of” the cup independent of orientation) can be computed as satisfying the command “put the lid on the cup” (Figure 2.13).

26 CHAPTER 2. FRAMEWORK

Figure 2.13: Orientation-dependent visualizations of “put the lid on the cup”

Ambiguity of resultant configuration occurs even in default habitats. To say that something is “on the TV” can usually be interpreted specifically relative to the type of the object (e.g. using a type system such as Generative Lexicon (Pustejovsky, 1995)), where a physical object on the TV is on top of it while an image is on the screen, the TV’s semantic head per VoxML.5 However, in the case where the object on the TV is something, such as a sheet of paper, that could be placed on either surface, the ambiguity remains. When VoxSim must make a simulation assignment of values through random choice, either result may be generated as the end state of the event (as shown in Fig. 2.14), and a human is required to judge the appropriateness of either choice.

Figure 2.14: Context-dependent visualizations of “put the paper on the TV”

Underspecification can thus arise in both post-conditions and in the action sequence of the event itself, such as when parameters like speed and direction of motion are left unknown. Addi- tionally, depending on the motion predicate being simulated, the manner of motion itself may be left underspecified. “Move the cup to the center of the table” implies a path by the specification

5As VoxSim was developed with the intention of simulating compositional, non-idiomatic, physically-grounded spatial events, we ignore here the idiomatic non-spatial reading of on TV, denoting “the information content available through the medium ‘TV’.”

27 CHAPTER 2. FRAMEWORK

of the destination, but the manner by which the cup should be moved is not specified. The kind of forward composition illustrated thus far may narrow the search space—in this example, the table provides a surface to be moved over, which may make sliding or rolling (if the cup lacks a han- dle) preferable to other forms of motion—but ultimately the predicate itself contains unspecified parameters and so may be replaced in a simulation with another motion predicate that satisfies the same basic constraints and imposes others on top of them.

2.4 Object Model

The example illustrated in Figure 2.12, where the moving objects just translocate and rotate in empty space without any visible manipulating force, shows the object model of the spatial rea- soning approach used by the VoxSim software. The object model of motion enacts the verbal program on the objects involved only, and “factors out” any agent that might be encoded in the program. Thus the subevent structure of [[PUT]] (Figure 2.4) is reduced from grasp(x, y), [while(hold(x, y), move(y)], [at(y, z) → ungrasp(x, y)] to [while(¬at(y, z), move(y))]. The absence of an agent in the scene obviates the need for grasping and holding preconditions to be satisfied to move the object. All that remains is the object movement until the enacted event is satisfied, as shown in Figure 2.15.

Figure 2.15: Object model of lifting and dropping an object

28 CHAPTER 2. FRAMEWORK 2.5 Action Model

This approach can be augmented with an embodied agent that simultaneously enacts the same program as is being enacted over the manipulated object. When done without moving the objects themselves, this results in the inverse transformation on the program as mentioned above, factoring out the objects and leaving a “pantomime” version of the event. The action model alone allows VoxSim to be used to make agents in the scene gesture by acting out programs without rig-attaching objects. This facilitates symmetric communication between the human user and the computer using linguistic and gestural modalities. While not the focus of this particular line of research, we discuss the uses of VoxSim in gestural communication in Pustejovsky et al. (2017).

Figure 2.16: Action model of lifting and dropping an object

2.6 Event Model

The composition of the object and action models is facilitated by a process of “rig-attachment.” Since agents in the 3D world must comprise a mesh model (as those that make up ordinary ob- jects) with a rig to facilitate skeleton animation6, this allows the creation of temporary parent-child relationships between nodes on the rig (joints) and objects in the scene. While these temporary re- lations are in effect, the two objects move in concord, maintaining their offset as it was at the point the rig-attachment was created. Any program enacted over the parent also affects the child, thus

6A human body rig is always a directed, rooted tree whose nodes and edges form roughly the shape of a human body. The foundations of skeletal animation are laid out by Magnenat-Thalmann et al. (1988).

29 CHAPTER 2. FRAMEWORK composing the object and action models into a single “event model.” Agents in VoxSim manipu- late objects by way of “graspers” (e.g. hands), so a grasp(x, y) event, when triggered explicitly or as a subevent of another event, creates the rig-attachment, and ungrasp(x, y) severs it. In Figure 2.17 below, rig-attachment is created by grasp in the first frame and severed by ungrasp after the second frame.

Figure 2.17: Event model of lifting and dropping an object

When forces applied by rig-attachment are ceased, other present forces continue to be applied without the previous resistance. One such force is gravity, always pulling downward, the conse- quence of which is seen in the third frame of Figure 2.17, where the apple is released by the agent and falls back toward the table. Figure 2.18 shows the composed event model visualization of “put the yellow block on the red block”. Frame 1 shows the grasp(x, y) subevent of [[PUT]], creating the rig-attachment of the yellow block to the right wrist node (the most immediate parent node that exerts control over every vertex in the agent’s hand or grasper). Frame 2 shows a snapshot of the while loop, where the grasper and grasped object move toward the target location, maintaining rig-attachment. At frame 3, having reached the target location computed by on(red(block)) (computed using the same reasoning described in Section 2.3), rig-attachment is severed, allowing the yellow block to rest at that location.

30 CHAPTER 2. FRAMEWORK

Figure 2.18: Execution of “put the yellow block on the red block” using embodied agent

The object/action model distinction allows the semantics of a program to be abstracted into the general motion to execute the program (e.g., the trajectory to move an arbitrary object from source to destination) and the specifics applicable to the arguments (e.g., how to grasp a given object to move it, how to calculate the destination location given object properties and current habitat, etc.). Primitive behaviors like “grasp,” “move (translate),” “move (rotate),” etc. can be composed into more complex behaviors like “put” or “lean.”

2.7 Event Composition

Broadly, all motion in a three-dimensional space can be broken down into sequences of translations and rotations, making these obvious choices for primitives in a dynamic motion semantics. For example, a motion like “lean,” in the sense of “lean x on y” can be conceptually broken down into these steps:

The desired goal is to have x supported by y while rotated at some angle θ offset from its normal resting position.

1. Turn x such that its major axis is offset from the +Y axis of the world by θ◦

2. Move x so it touches a side of y

The end state of a visualization of “lean the cup on the block” might look like Figure 2.19, with variation in the angle specified.

31 CHAPTER 2. FRAMEWORK

Figure 2.19: End state of “lean the cup on the block”

Implementing this behavior requires accounting for a few considerations imposed by con- straints on the 3D engine or by situations that may arise in the simulated world, namely:

• The starting position and orientation of x is arbitrary;

• it may not be lying flat;

• it may not be axis-aligned.

• 3D transformations by default take the shortest path from start to destination;

• a single rotation that satisfies the condition above may result in a configuration that is phys- ically unstable and not satisfying “lean.”

A real-world example of a unsatisfied “lean” vs. a satisfied “lean” is shown in Figure 2.20 below. While in both pictures, the leaned object’s major axis is at the same angle relative to the ground, in the image on the left, the object must be held up while in the image on the right is it fully supported and situated in the physical world. The compositional semantics of motion events in simulation must reflect and account for these types of distinctions in motion verbs.

32 CHAPTER 2. FRAMEWORK

Figure 2.20: Unsatisfied vs. satisfied “lean”

The semantics of “lean” require not only that the object’s major axis be at an angle relative to the ground object and supporting surface, but that the object also end the motion resting on an edge that can support its shape against the supporting surface. Thus we should assume that the object’s minor axis should never end the event parallel to the ground plane. We can encode automatic satisfaction of this implicit constraint in the VoxML by specifying not one type of rotation, but two:

1. Turn x such that its minor axis is offset from the +Y axis of the world by (90-θ)◦

2. Turn x such that its major axis is offset from the +Y axis of the world by θ◦, constraining motion around the object’s minor axis current orientation after transformation 1

3. Move x so it touches a side of y

This exercises three types of primitive motions and forces the distinction of two types of rota- tion in our primitive set:

1. TURN-1: turn(x:obj,V1:axis,EV2 :axis) — turn object x so that object axis V1 is aligned with

world axis V2

2. TURN-2: turn(x:obj,V1:axis,EV2 :axis,EV3 :axis) — turn object x so that object axis V1 is

aligned with world axis V2, constraining motion to around world axis V3

3. PUT: put(x:obj,y:loc) — put object x at location y

33 CHAPTER 2. FRAMEWORK

This gives us a complete VoxML representation for [[LEAN]] in terms of primitive motions.

   lean              PRED = lean        LEX =          TYPE = transition event                =    HEAD transition                   =      A1 x:agent                  =  A = y:physobj      ARGS  2                   A3 = z:location                                 E1 = grasp(x, y)                         E2 = [while(hold(x, y), turn(x, y,                         align(minor(y),     TYPE =               E × (90 − θ, about(E )))))]       Y ⊥Y                   E = [while(hold(x, y), turn(x, y,       3      BODY =              align(major(y),                         EY × (θ, about(E⊥Y ))),                         about(minor(y))))]                   E =       4 [while(hold(x, y), put(x, y))]                   E = [at(y, z) → ungrasp(x, y)]      5              

Figure 2.21: VoxML structure for [[LEAN]] (compositional)

Events can be composed in a fashion that leaves components underspecified, as well. Figure 2.22 shows two blocks switching positions on a table. The sequence can be glossed as roughly:

“Move the brown block in front of the blue block. Move the blue block to the loca- tion originally occupied by the brown block. Move the brown block to the location originally occupied by the blue block.”

34 CHAPTER 2. FRAMEWORK

Figure 2.22: Visualization of “switch the blocks”

This can modeled in VoxML as:

   switch              PRED = switch        LEX =          TYPE = transition event                =    HEAD transition                       ARGS =  A1 = y[]:physobj                                   E1 = def(w, as(loc(y[0]))),                 TYPE =         def(v, as(loc(y[1])))                        BODY =  E2 = slide(y[0], in front(v))                   =       E3 slide(y[1], w))                   E = slide(y[0], v))      4              

Figure 2.23: VoxML structure for [[SWITCH]]

The program takes a list (pair) of objects, stores their starting locations in E1 (in the variables w and v), and in E2–E4 slides them in sequence across the supporting surface to the locations calculated. If we do not require that the “switch” action keeps the objects constrained to the surface, as slide does, the manner primitives in the program can be changed:

35 CHAPTER 2. FRAMEWORK

   switch              PRED = switch        LEX =          TYPE = transition event                =    HEAD transition                       ARGS =  A1 = y[]:physobj                                   E1 = def(w, as(loc(y[0]))),                 TYPE =         def(v, as(loc(y[1])))                        BODY =  E2 = put(y[0], in front(v))                   =       E3 put(y[1], w))                   E = put(y[0], v))      4              

Figure 2.24: VoxML structure for [[SWITCH]] (unconstrained)

2.8 Specification Methods

While the distinctions discussed above be specified according to situational needs in VoxML, in the absence of complete situational knowledge there is often no way to decide on the best speci- fication value. A minimal model is agnostic to these types of manner distinctions in complex and underspecified predicates. There still exist bits to be set before the simulation can be executed, requiring us to fall back to other methods to fill in that information. From the composition of VoxML and dynamic event semantics, we can achieve a model of an event that is “filled out” with information to an extent far greater than that provided by a min- imal model. Forward composition can take a minimal model of an event and augment it with general-domain lexical world knowledge about the event predicate and its participants, which al- lows VoxSim to create a more extensive informational context than a minimal model provides, which it can then share with its human user. Visualization, an intuitively accessible modality for most humans, becomes a medium through which to share that context. From the minimal model, VoxSim’s interpretation of encoded VoxML and dynamic semantic knowledge allows it to define the interpretation of the event predicate as a logic program with further specification than the logic program does in isolation, by drawing on and composing knowledge about the arguments, rela- tions, and preconditions involved in the event. For example, in the simple utterance “roll the ball,”

36 CHAPTER 2. FRAMEWORK humans as language interpreters know that there must be a surface to roll the ball over, even though no such surface is mentioned in the utterance given. The VoxML type encoding for roll (Figure.

2.25) makes this explicit, representing the surface as A3.

   roll       LEX = ...              HEAD = process                        A1 = x:agent                   =     TYPE =  ARGS =  A2 y:physobj                   A = z:surface      3                 BODY = ...      

Figure 2.25: Abbreviated VoxML type structure for [[ROLL]]

As mentioned above, in a minimal model, the nature of the surface (shape, consistency, etc.) can be left out completely. With forward composition and VoxML, VoxSim fills in some of these parameters by composing the event with the surface object. Object habitats and configurations therefore become conditioning environments that enforce constraints on the minimal model. While the integration of context, spatially-conditioning environments, and real-world knowl- edge may provide sufficient information to determine the nature of a spatial constraint or set of constraints, the precise instantiated value, down to the 3D coordinates and rotation of an object at each step of program execution, may still be left imprecise. For example, through forward com- position, VoxSim may be able to determine the region within which an object must be placed to satisfy the completion of an event it is commanded to simulate, but the precise location within that region where the object should be by the end of the event is both left underspecified and there is no method from the lexical semantics to determine exactly what value that location should take. Nonetheless, in order for any platform such as VoxSim to execute the fully specified program at runtime, all required parameters must have a value assigned, including those that are potentially never mentioned in the linguistic utterance linked to the event and never raised in the additional encoding used by forward composition, and it is in this search space that this research conducts experimentation.

37 Chapter 3

Methodology and Experimentation

VoxSim provides a method not only for generating 3D visualizations using an intuitive natural lan- guage interface instead of specialized skill sets (a primary goal of programs such as WordsEye), but also a platform on which researchers may conduct experiments on the discrete observables of motion events while evaluating semantic theories, thus providing data to back up theoretical intuitions. It provides a robust and extensible platform for using scene visualization and simula- tion to expose the presuppositions underlying complex events by testing automatically generated simulations for their concordance with the event description they are generated from. Using the VoxSim platform and a test set of objects and events (shown in Table 3.1), sets of visualizations were automatically generated with randomly-assigned parameter values for the linguistically underspecified components, and evaluated with the aim of determining two things:

1. If, for a given event, any value(s) for an underspecified parameter can be considered qualita- tively “better” than others;

2. If such values emerge from the data, their componential nature (that is, if there is a single “best” value or a range of values, the extent of such a range, whether the values are fixed over the dataset or relative to other event parameters, etc.).

38 CHAPTER 3. METHODOLOGY AND EXPERIMENTATION

Programs Objects move x put x on y block grape turn x put x in y ball banana roll x lean x on y plate bowl slide x lean x against y cup knife spin x flip x on edge disc pencil lift x flip x at center book paper sheet stack x close x blackboard put x near y open x bottle put x touching y apple

Table 3.1: Test set of verbal programs and objects

Scene visualization work is not well-reflected in current evaluation, due to sparsity of datasets and lack of a general-domain gold standard (Johansson et al., 2005). As automatic evaluation runs the risk of testing against an overfitted model (Cawley and Talbot, 2010), evaluation was conducted using two complementary human-driven methods, augmented by an automatic method. Thus the result of this line of research provides:

1. A robust and extensible framework for mapping natural language expressions of motion events into a dynamic logic and from there into a minimal model, simulation, and visualiza- tion, as defined at the beginning of Chapter 2;

2. A software platform for implementing these simulations and rendering them visually, with recoverable mappings to the corresponding dynamic logic program, NL expression, and parse;

3. A demonstration of how to use the resulting simulation and visualization platform to test for semantic presuppositions regarding linguistically underspecified parameters in motion events;

4. Human- and automatically-judged data regarding the nature of prototypical values for the aforementioned parameters in these events.

39 CHAPTER 3. METHODOLOGY AND EXPERIMENTATION 3.1 Preprocessing

3.1.1 Test Set Creation

To generate the test sentences, begin by taking a full list of sentences generated by all possible combinations of program and object predicates, effectively generating a list P ×O for every p([]) ∈ P rograms and o ∈ Objects. Since this inevitably generated some semantically or syntactically invalid sentences, such as ?Close the block1, a small annotation task was constructed to mark each entry in the list as valid, grammatically invalid, semantically invalid (i.e., grammatical but meaningless in the vein of “colorless green ideas sleep furiously”), or uncertain. A total of 3,381 sentences, unvetted, were submitted to 3 annotators:

• Annotator 1: Asian-American female, fluent or native English speaker, 20-29 years old, 3+ years post-secondary education

• Annotator 2: Asian-American female, fluent or native English speaker, 20-29 years old, 4+ years post-secondary education, bachelor’s degree

• Annotator 3: European-American female, fluent or native English speaker, 20-29 years old, 6+ years post-secondary education, bachelor’s and master’s degree

Annotators were given the following instructions by which to judge each sentence:

• fine: Nothing wrong with this sentence. It’s properly formed and makes sense.

• ungrammatical: The sentence is incorrectly structured, and not grammatical English.

• nonsense: The sentence is well-formed but doesn’t make sense. Given a sentence of the form “VERB the NOUN,” if the answer to the question “Can the NOUN be VERBed?” is a a firm negative the sentence should be judged as “nonsense.” An example would be “close the block,” as a block typically cannot be closed.

• awkward: Catch-all for everything else. The sentence is well-formed and not obviously nonsense, but there’s still something off about it. For example, maybe the result of executing

1The question mark here denotes questionable semantic acceptability, in line with commonly-accepted linguistic acceptability/grammaticality notation.

40 CHAPTER 3. METHODOLOGY AND EXPERIMENTATION

the command given in the sentence would result in a structure that’s physically unstable, like trying to balance a block on top of a grape. There may be other cases, but if a sentence is not obviously “fine,” obviously “ungrammatical,” or obviously “nonsense,” it may belong here.

Once annotations were collected, each judgment option was given a corresponding score (un- grammatical = 0, nonsense = 1, awkward = 2, fine = 3). Annotators’ scores for each sentence were totaled, for a raw score of 0-9. Accepted sentences were those that scored 5 or greater. This allowed for the inclusion of some marginal sentences that were evaluated as “awkward” by two annotators and “nonsense” by one. However, since a raw score of 5 would also admit sentences judged as “nonsense” by two annotators and “fine” by one, which would violate the principle of majority rule, an additional constraint was enforced that eliminated sentences scoring 5 that re- ceived more than one evaluation as “nonsense” or “ungrammatical.” Thus, in the 0-9 range, the 5 score was divided into two partitions based on the mode of the sentences’ raw individual scores. This resulted in 388 sentences being rejected. From the remaining sentence list, sentences were eliminated if the subject and indirect object arguments were the same (e.g, “put the block on the block”), as only one of each nominal object was being tested in any visualization. Finally we removed entries that contained references to objects that were not in the test set but were included in the unvetted sentence list due to human error. While initially included in the test set, and included in the set of sentences used for the sen- tence simulability annotation task, the relations left(y), right(y), behind(y), and in front(y) were removed from the final test set of sentences and instead realized as possible value specifications of touching(y) (see Table 3.2). The only other underspecified value in commands of the form put(x,{left,right,behind,in front}(y) is the speed of translocation, and it was expected that the dis- tribution of these values should fall very close to the resulting distribution in put(x,{on,in,touching, near}(y)), so these relations were removed as inputs to the experiment in order to shrink the search space of the overall task as well as to prevent the evaluated data from being too biased towards instances of put(x,y) events. The resulting final test set contained 1,119 sentences. These are all listed in Appendix D.

3.1.2 Identifying Underspecified Parameters

In order to generate a full set of visualizations, the following had to be determined for each verbal predicate:

41 CHAPTER 3. METHODOLOGY AND EXPERIMENTATION

1. Parameters that are left underspecified or unconstrained;

2. Testable satisfaction conditions in the program.

This examination was performed on both a linguistic and programmatic level, using DITL and VoxML representations of the program.2 Examples follow.

   put              PRED = put        LEX =          TYPE = transition event                =    HEAD transition                   =      A1 x:agent                  ARGS =  A = y:physobj       2    put(y,z)                A3 = z:location    w := loc(y); (w 6= z?;     0 t         +       w 6= w ; d(z,w ) > d(z,w ))     t t−1 t−1 t  TYPE =       E1 = grasp(x, y)                   =       E2 [while(hold(x, y),                  =       BODY  move(y)]                   E = [at(y, z) →       3                   ungrasp(x, y)]                   

Figure 3.1: VoxML and DITL for put(y,z)

(a) put requires a prepositional adjunct which typically includes the destination location (z in Figure 3.1). The prepositional adjunct specifies the event’s ending location, and at each time step, the location of the moving object along the path is expected to approach the destination by the measure of the distance function (various distance heuristics can be used to allow for convoluted or non-direct paths, such as those requiring object avoidance; this research uses A* path planning with the Manhattan distance heuristic). No value is given for the speed of motion.

The DITL program encodes a test for whether or not the object has reached its destination (wt 6= z?), and continuous, Kleene-iterated location change toward the destination (subject to the

distance heuristic in use) if the test is not satisfied (wt 6= wt−1; d(z,wt−1) > d(z,wt)). This maps

to a satisfaction test d(z,loc(y)n) = 0,(loc(y)t 6= loc(y)t−1)[0,n]?.

2In these DITL programs, subscripted letters refer to object parameter values at time step t, and d refers to a distance function between two vectors.

42 CHAPTER 3. METHODOLOGY AND EXPERIMENTATION

   slide              PRED = slide        LEX =     TYPE = process                 =    HEAD process                        A1 = x:agent                slide(y)   =  A = y:physobj      ARGS  2                w0 := loc(y);    A3 = z:surface        +         (wt 6= wt−1; EC(y,z))  =       TYPE           E1 = grasp(x, y)                         E2 = [while(hold(x, y),            BODY =              while(EC(y, z),                   move(y)))]                     

Figure 3.2: VoxML and DITL for slide(y)

(b) slide enforces a constraint on the motion of the object, requiring it to remain EC with the surface being slid across (z in Figure 3.2; only implicit in the NL utterance unless explicated in an adjunct). In the bare predicate, no reference is made to what speed or direction the object should be moving as long as the EC constraint is maintained.

The DITL program requires a step-wise location change ((wt 6= wt−1)). The object may rotate, but the rotation should not be proportional to the path length, path shape, and speed of the ob- ject (that could make the motion a rolling). It must maintain the EC constraint with the surface

(EC(y,z))). These conditions map to a satisfaction test: EC(y,z)[0,n]?,loc(y)n 6= loc(y)0,rot(y) n n P−−−−−−−−−−−−−→ P−−−−−−−−−−−−−→ 6∝ loc(y)t − loc(y)t−1. The final parameter of the test rot(y) 6∝ loc(y)t − loc(y)t−1 is t=0 t=0 not expressed explicitly in the DITL program, but must be expressed in the satisfaction test because while rotation is optional and may be arbitrary, there are conditions under which the nature of the rotation change would render the motion no longer a “slide.”

43 CHAPTER 3. METHODOLOGY AND EXPERIMENTATION

   roll              PRED = roll        LEX =     TYPE = process                 =    HEAD process                 roll(y)   =      A1 x:agent                w0 := loc(y), v0 := rot(y);   =  A = y:physobj      ARGS  2                (wt 6= wt−1;    A3 = z:surface                abs(vt-v0) > abs(vt−1-v0);  =       TYPE    +     EC(y,z))    E1 = grasp(x, y)                         E2 = [while(hold(x, y),            BODY =              while(EC(y, z),                   move(y), rotate(y)))]                     

Figure 3.3: VoxML and DITL for roll(y)

(c) roll enforces the same EC constraint as slide, and also makes no mention of the speed or direc- tion of object translocation. The moving object’s speed and direction of rotation are also not specified, but these are further constrained by the speed and direction of translocative motion when composed with the size of the object, so if a value is established for those parameters, the necessary direction and speed of rotation can be computed.

The DITL program requires a location change ((wt 6= wt−1)), a rotation change in a consistent

direction (abs(vt-v0) > abs(vt−1-v0)), and a maintenance of the EC constraint (EC(y,s))), and n P−−−−−−−−−−−−−→ so maps to the satisfaction test EC(y,z)[0,n]?,loc(y)n 6= loc(y)0,rot(y) ∝ loc(y)t − loc(y)t−1. t=0 Unlike slide, the total rotation change must be proportional to the path shape, path length, and movement speed to properly be considered a “roll.”

44 CHAPTER 3. METHODOLOGY AND EXPERIMENTATION

   turn              PRED = turn    =     LEX      TYPE = process                HEAD = process                    turn(y)    A = x:agent       1      ARGS =           w0 := rot(y);    A2 = y:physobj        +         (wt 6= wt−1)  TYPE =                 E1 = grasp(x, y)                        BODY =  E2 = [while(hold(x, y),                         rotate(y))]               

Figure 3.4: VoxML and DITL for turn(y)

(d) turn lexically singles out only the rotation of its argument. No reference is made to speed or direction of rotation. No reference is made to speed or direction of translocation either, but as the verb turn focuses only on the rotation, parameters of translocation theoretically have no bearing on the correctness of the event operationalization.

+ However, the DITL program (wt 6= wt−1) has only one parameter, the object’s rotation, and

so maps to the satisfaction test rot(y)n 6= rot(y)0.

   move           PRED =    move    LEX =      =     TYPE process                HEAD = process                    move(y)    A1 = x:agent            ARGS =        =    w0 := loc(y), v0 := rot(y);    A2 y:physobj        +         ((wt 6= wt−1) ∨ (vt 6= vt−1))  TYPE =            =      E1 grasp(x, y)                  BODY =  E = [while(hold(x, y),       2                   move(x))]                     

Figure 3.5: VoxML and DITL for move(y)

(e) move is one of the most (perhaps the most) underspecified motion verbs, as all path and manner of motion verbs are special cases of move. Tautologically, any kind of motion can be a move;

45 CHAPTER 3. METHODOLOGY AND EXPERIMENTATION

this is reflected in the VoxML structure as well as the fact that move shows up in a number of Levin verb classes (viz. “slide”, “roll”, “object alternation”) (Levin, 1993), and maps to a wide variety of motion subtypes in VerbNet (Kipper et al., 2006). In bare move, speed and direction values are left unspecified, and a further open question is whether some other motion event may be enacted instead of the operationalization of move specifically and still satisfy the observer’s definition of move.

+ The DITL program ((wt 6= wt−1) ∨ (vt 6= vt−1)) maps to the satisfaction test loc(y)n 6= loc(y)0

∨ rot(y)n 6= rot(y)0. The location or rotation of the object (or both) may have changed, and it will have moved. As this test otherwise leaves the search space too wide open for a Monte Carlo method to provide much insight without generating a very large set of test simulations, we instead respecify “move” as a different event randomly selected from the test set, leaving “motion manner” as the underspecified parameter, as listed in Table 3.2 below.

The full list of programs and their satisfaction tests are given in Appendix B. Having determined the underspecified parameters for each predicate, the search space can fur- ther be constrained by filtering out parameters that, while not explicitly singled out in the linguistic utterance, can have their values inferred or calculated from known parameters. Examples of this type would be direction of motion in a [[PUT]] event or rotation speed in a [[ROLL]] event, which, as discussed in the sections above, are constrained by other parameters and must fall at a certain value in order for the event to execute. The full table of underspecified parameters tested on for each predicate is given below, along with their types. These parameters have been manually selected for saliency based on the criteria and rationale previously outlined in this section. Program Underspecified Type Possible parameters values move(x) motion manner predicate {turn(x), roll(x), slide(x), spin(x), lift(x), stack(x[]), put(x,on(y)), put(x,in(y)), put(x,near(y)), lean(x,on(y)), lean(x,against(y)), flip(x,edge(x)), flip(x,center(x))}

46 CHAPTER 3. METHODOLOGY AND EXPERIMENTATION

turn(x) rot speed float {(0..12.5]} rot axis axis {X, Y, Z} rot angle float {(0◦..180◦]} rot dir sign {+, -} motion manner predicate {roll(x), spin(x), lean(x,on(y)), lean(x,against(y)) flip(x,edge(x)), flip(x,center(x))} roll(x) transloc dir 3-vector V ∈ {h[0..1], 0, [0..1]i | C(x)/2 < mag(V) ≤ 1} slide(x) transloc speed float {(0..5]} transloc dir 3-vector V ∈ {h[0..1], 0, [0..1]i | mag(V) ≤ 1} spin(x) rot angle float {(180◦..540◦]} rot speed float {(0..12.5]} rot axis axis {X, Y, Z} rot dir sign {+, -} motion manner predicate {roll(x)} lift(x) transloc speed float {(0..5]} transloc dir 3-vector {h0, y-y(x), 0i} stack(x[]) placement order list {[1,2], [2,1]} put(x,touching(y)) transloc speed float {(0..5]} rel orientation predicate {left(y), right(y), behind(y), in front(y), on(y)} put(x,on(y)) transloc speed float {(0..5]} put(x,in(y)) transloc speed float {(0..5]} put(x,near(y)) transloc speed float {(0..5]} transloc dir 3-vector V ∈ {hy-x(x), y-y(x), y-z(x)i | d(x,y) < d(edge(s(y),y)), IN(s(y)), ¬IN(y)}3 lean(x,on(y)) rot angle float {[25◦..65◦]} lean(x,against(y)) rot angle float {[25◦..65◦]} flip(x,edge(x)) rot axis axis {X, Y, Z} symmetry axis axis {X, Y, Z} flip(x,center(x)) rot axis axis {X, Y, Z} symmetry axis axis {X, Y, Z} ( manner = put: {(0..5]} close(x) motion speed float manner = turn: {(0..12.5]}

3s(y) represents the surface of the object currently supporting y.

47 CHAPTER 3. METHODOLOGY AND EXPERIMENTATION

( manner = put: {(0..5]} open(x) motion speed float manner = turn: {(0..12.5]}  manner = put: {hy − x(x),  transloc dir 3-vector y − y(x),  y − z(x)i} rot angle float  manner = turn: {(0◦..180◦)} Table 3.2: Program test set with underspecified parameters

In the case where the parameter is of type predicate, this requires the event to be respecified before executing (e.g., instantiating a [[MOVE]] event as a [[SPIN]]), and the respecified predicate may itself contain underspecified values, which must then be specified before execution. How- ever, in the case where an underspecified event is respecified to another predicate that can itself be respecified to one of a set of events (e.g., respecifying [[MOVE]] to [[TURN]] which could itself be optionally realized as [[ROLL]], [[SPIN]], [[LEAN]], or [[FLIP]]), the respecified predicate is re- quired to give value assignment to its underspecified parameters rather than attempting to respecify to a different predicate again. That is, a predicate cannot be respecified more than once, so if a [[MOVE]] is respecified to a [[TURN]], that [[TURN]] event must be given value assignment for its speed, axis, angle, and direction of rotation at that point. This is also why put(x,touching(y)) is not permitted as a respecification of move(x), as touching(y) itself requires respecification to another relation.

3.2 Operationalization

Each VoxML/DITL primitive program maps to a method executed by VoxSim. Each of these meth- ods, written in C# (although Unity also supports JavaScript and Boo scripting), operationalizes the parameters and tests of the program in real time over the specified arguments. Underspecified vari- ables (discussed in Section 3.1 and shown in Tables 3.2 and B.1), must be assigned values for the software to run. Operationalization follows a two-pass process, which is initiated once an unexecuted event reaches the front of the event queue. First, the satisfaction conditions under which the event will be considered complete must be calculated, with regard to the particulars of the objects involved, and informed by forward composition from the object and event VoxML, and the parameters must be

48 CHAPTER 3. METHODOLOGY AND EXPERIMENTATION given value through Monte Carlo simulation value assignment. This results in the event being fully evaluated down to its first-order form at the front of the event queue (as discussed in Section 2.2.2). In the second pass, the now-fully evaluated first-order event assigns to its VoxML [[OBJECT]] arguments the transformations required to satisfy the event as calculated during the first pass. In both passes, all predicates encountered are invoked in order to evaluate them. This allows for compactness and completeness of code and allows any changes to the verbal program to ap- ply in both “evaluation” mode and “execution” mode without discrepancy. The only difference is that predicates invoked in “evaluation” mode are not allowed to make changes to the event man- ager/transition graph, with the exception of inserting events required to satisfy preconditions. This reserves dequeuing events for the “execution” mode. An abridged, schematic C# operationalization of [[TURN]] follows on the next pages. The en- tire method is printed in Appendix C. Many calls and references are made to the VoxSim, Unity, and .NET APIs in the course of operationalizing a predicate, for calculating geometric values and parameterizing inputs to various subsystems, but the code demonstrates the two types of “turn” event discussed in Section 2.7 and the process of assigning random values to underspecified vari- ables.

public void TURN(object[] args) { ...

// look for agent ... // add preconditions ...

// add postconditions ...

// override physics rigging ...

if (args [0] is GameObject) { GameObject obj = (args [0] as GameObject); Voxeme voxComponent = obj.GetComponent (); if (voxComponent != null) { if (!voxComponent.enabled) {

49 CHAPTER 3. METHODOLOGY AND EXPERIMENTATION

voxComponent.gameObject.transform.parent = null; voxComponent.enabled = true; }

if (args [1] is Vector3 && args [2] is Vector3) { // args[1] is local space axis // args[2] is world space axis if (args [3] is Vector3) { // args[3] is world space axis sign = Mathf.Sign (Vector3.Dot(Vector3.Cross ( obj.transform.rotation * (Vector3)args [1], (Vector3)args [2]), (Vector3)args[3])); angle = Vector3.Angle ( obj.transform.rotation * (Vector3)args [1], (Vector3)args [2]); // rotation from object axis [1] // to world axis [2] // around world axis [3]

if (voxComponent.turnSpeed == 0.0f) { voxComponent.turnSpeed = RandomHelper.RandomFloat (0.0f, 12.5f, (int)RandomHelper.RangeFlags.MaxInclusive); }

targetRotation = (Quaternion.AngleAxis (sign * angle, (Vector3)args [3]) * obj.transform.rotation).eulerAngles; rotAxis = Constants.Axes.FirstOrDefault ( a => a.Value == (Vector3)args [3]).Key; } else { // rotation from object axis[1] to world axis [2]

if (voxComponent.turnSpeed == 0.0f) { voxComponent.turnSpeed = RandomHelper.RandomFloat ( 0.0f, 12.5f, (int)RandomHelper.RangeFlags.MaxInclusive); }

50 CHAPTER 3. METHODOLOGY AND EXPERIMENTATION

targetRotation = Quaternion.FromToRotation( (Vector3)args [1], (Vector3)args [2]).eulerAngles; angle = Vector3.Angle ((Vector3)args [1], (Vector3)args [2]); } } else { if (voxComponent.turnSpeed == 0.0f) { voxComponent.turnSpeed = RandomHelper.RandomFloat (0.0f, 12.5f, (int)RandomHelper.RangeFlags.MaxInclusive); }

targetRotation = (obj.transform.rotation * UnityEngine.Random.rotation).eulerAngles; angle = Quaternion.Angle(transform.rotation, Quaternion.Euler(targetRotation)); }

voxComponent.targetRotation = targetRotation; } }

// add to events manager ...

// record parameter values ...

return; }

Figure 3.6: C# operationalization of [[TURN]] (abridged)

Individual segments of the VoxML program map to individual segments of the linked DITL program (see Section 3.1 for examples), and segments of the DITL program map directly to the C# code.

51 CHAPTER 3. METHODOLOGY AND EXPERIMENTATION

• GameObject obj = (args [0] as GameObject) accesses the DITL variable y, the object;

• targetRotation = Quaternion.[...] specifies the nature of the DITL rotation

update wt 6= wt−1;

• If three vector arguments are specified, then it is assumed that the third is the axis about which to constrain rotation. If only two are specified, rotation is calculated relative to the current orientation and takes the shortest path.

Due to the object-oriented nature of the VoxSim architecture, some generic operations are abstracted out to other classes rather than operationalized directly in the predicate. For example, the variable targetRotation in the above example sets the goal rotation of the object for the execution of the predicate (here [[TURN]]). The Voxeme “component” (Unity terminology for a certain type of member class instance that is updated every frame, providing a clean way of handling DITL and VoxML Kleene iterations) on the the Unity GameObject that contains the VoxML [[OBJECT]] geometry is what actually handles the frame-to-frame update of the object geometry’s location and orientation. Since all geometries in the simulation are part of voxemes, which contain the Voxeme component, there is no need to handle the state-to-state update in the predicate itself, and the operationalization only needs to specify the nature of the update between the start and end states. Predicates for all VoxML entity types can be operationalized, subject to their typing constraints. For example, the declaration for ON: public Vector3 ON(object[] args), shows that the operationalization takes an object or list of objects and returns a location, which is in line with the configurational CLASS of the VoxML relation [[ON]].

3.3 Monte Carlo Simulation

Using the list of test sentences, the list of underspecified parameters and satisfaction conditions for each predicate, and their operationalizations in code, sets of visualized simulations were created for each sentence with random values assigned for each underspecified parameter in the event predicate. These simulations were evaluated for the best match between visualization and test sentence, should one or a set thereof exist.

52 CHAPTER 3. METHODOLOGY AND EXPERIMENTATION

Parameters were randomized in two ways:

(a) If a numerical/vector value is needed for a parameter such as speed or direction, it is assigned by a random number generator. The range of possible random values were constrained to values that allow the motion from event start to event end to be completed in under 15 sec- onds (this value was been chosen arbitrarily to allow evaluators to complete their tasks in a reasonable amount of time);

(b) If the predicate denotes an underspecified manner of motion, a different predicate that satis- fies the remainder of the original predicate’s satisfaction test was chosen at random from the available specification set and then executed instead of the predicate from the input.

The value ranges available are shown in Table 3.2. Randomization was performed using the Unity engine’s built-in randomizer, which uses a uniform distribution, in line with standard Monte Carlo methods (Sawilowsky, 2003). Resampling was allowed in cases where the randomly gen- erated value violated some constraint on the predicate (e.g., generating a location inside another object or off the table). All simulations were run in the same environment, with only the motion predicate and object participants changing. Objects were initialized in a X- and Z axis-aligned grid pattern, as shown in Figure 3.7. During capture of an event all objects not involved in the event were removed from the scene, as shown in Figure 3.8. Most events finished executing in 1-3 seconds. Video capture was automatically stopped after 15 seconds if the event had not yet completed.

53 CHAPTER 3. METHODOLOGY AND EXPERIMENTATION

Figure 3.7: Test environment with all objects shown

Figure 3.8: Snapshot from video capture in progress

3.3.1 Automated Capture

Video capture was performed using the Flashback Video Recorder, an FFmpeg-based Unity pack- age from LaunchPoint Games (http://launchpointgames.com/unity/flashback.html). Three videos were captured for each input sentence in the test set, each with values assigned anew to its underspecified parameters. Unity generates the parameters as discussed above and logs

54 CHAPTER 3. METHODOLOGY AND EXPERIMENTATION them to a SQL database, along with the path of the video file, the input string, the parsed predicate- logic form of the string, the parsed form with objects resolved to their internal unique names, the event predicate alone, the objects involved in the current simulation, and a feature vector of all underspecified parameters and their value assignments. As each event is completed (or in rare cases, times out before completing), VoxSim writes all changes to the database, requests a new event from the event spawner script over the communica- tions bridge (which the event spawner retrieves from the input sentence lists), waits two seconds to allow Flashback to prepare to capture the event, and repeats the process until the list of input sentences is exhausted. Figure 3.9 shows a digram of this process. The label at the bottom of each green box shows the language or framework that the component contained in that box runs on.

Event Communications Input Lists Spawner Bridge VoxSim Bash Python C++ Unity/C# SQL Figure 3.9: Automatic capture process diagram

The final dataset consisted of 3,357 videos of simulated motion events. The distribution per predicate is given in Table 3.3. Program # videos captured move(x) 45 turn(x) 45 roll(x) 45 slide(x) 45 spin(x) 45 lift(x) 45 put(x,touching(y)) 630 put(x,on(y)) 582 put(x,in(y)) 174 put(x,near(y)) 580 lean(x,on(y)) 552 lean(x,against(y)) 503 flip(x,edge(x)) 9 flip(x,center(x)) 36 close(x) 9 open(x) 12

55 CHAPTER 3. METHODOLOGY AND EXPERIMENTATION

Table 3.3: Number of videos captured per motion predicate

3.3.2 Feature Vectors

For each video, the SQL database stores a sparse feature vector saved as a JSON string. Parameters that are fully specified by the predicate itself are left blank in the vector and only those features requiring value assignment are stored for that event instance. For uniformity, vectors are saved as “densified” vectors where the non-valued features are left as empty strings. { "MotionSpeed":"12.21398", "MotionManner":"turn(front cover)", "TranslocSpeed":"", "TranslocDir":"", "RotSpeed":"", "RotAngle":"104.7686", "RotAxis":"", "RotDir":"", "SymmetryAxis":"", "PlacementOrder":"", "RelOrientation":"", "RelOffset":"" }

Figure 3.10: “Densified” feature vector for “open the book”

The densified vectors are made properly sparse for evaluation, meaning that some show many specified features where others may display only one. { "MotionManner":"put(grape,near(apple))", "TranslocSpeed":"3.62548", "TranslocDir":"<0.433338; 8.106232E-05; -0.1310182>", "RelOffset":"<-1.066662; -0.06591892; -0.6310182>" }

Figure 3.11: Sparse feature vector for “move the grape”

56 CHAPTER 3. METHODOLOGY AND EXPERIMENTATION

{ "TranslocSpeed":"1.944506" }

Figure 3.12: Sparse feature vector for “put the block in the plate”

3.3.3 Alternate Descriptions

Having captured the event test set visualizations for evaluation with respect to their original inputs, two alternate “captions” for each event were procedurally generated for a total of three description options per video. Heuristics for generating the candidate descriptions were as follows:

1. One candidate is always the true original input sentence.

2. If the simulation required a respecification of the event to another predicate, one candidate sentence is constructed out of that respecification. For example, the “move the grape” event reflected in Figure 3.11 would include “put the grape near the apple” as a candidate sentence.

3. If the original input contains a prepositional adjunct, one candidate sentence is constructed by alternating that preposition with another that co-occurs with the same event predicate in the test set (i.e., “put on” vs. “put in/touching/near” or “lean on” vs. “lean against”).

4. If the number of candidate sentences is less that 3, choose at random another predicate from the test set and apply it to the theme object of the original input.

5. Repeat steps 3 and 4 until the number of candidate sentences reaches 3.

The three sentences were then put in a randomized order before being logged to a new table in the same SQL database as the feature vectors and video file path information.

3.4 Evaluation

All code written for preprocessing and the human- and machine-driven evaluation tasks is available at https://github.com/nkrishnaswamy/thesis-docs-utils. The actual videos captured are stored on Amazon S3 (Simple Storage Service) from Amazon Web Services, at https://s3.amazonaws.com/

57 CHAPTER 3. METHODOLOGY AND EXPERIMENTATION

voxsim-videos/. The link will lead to the document tree in which individual videos are sorted by the Key field.

3.4.1 Human-Driven Evaluation

Human-driven evaluation was conducted using the Amazon Mechanical Turk (MTurk) platform, which is widely considered to be a good source of high-quality, inexpensive data (Buhrmester et al., 2011). A series of Amazon MTurk HITs (Human Intelligence Tasks) were used to assess the correctness of each visualization relative to a description, or of each of the set of descriptions relative to a visualization. Human judgments of a visualization are given as “acceptable” or “unac- ceptable” relative to the event’s linguistic description whereas human judgments of a sentence are given as “acceptable” or “unacceptable” relative to a provided visualization of an event.

Human Evaluation Task #1 (HET1)

In the first evaluation task, “Turkers” (judges, or workers) were asked to choose from A, B, and C, the three animated visualizations generated from a single input sentence, which one best depicted the sentence. The instructions given were:

Below is a sentence and three videos generated by an automatic natural language pro- cessing and simulation system using that sentence as input. Read the sentence, then watch each of the videos. Choose which of the videos, A, B or C, you think best shows what the sentence is saying.

• There are options if you think multiple videos satisfy the description equally well. • “None” is also a valid choice if you think none of the videos accurately depict what is described. • Use your intuition. It’s better to go with your first instinct than to overthink the answer.

The multiple choice options offered were, where $DESCRIPTION refers to the input sen- tence for the individual HIT:

58 CHAPTER 3. METHODOLOGY AND EXPERIMENTATION

• Video A best represents $DESCRIPTION

• Video B best represents $DESCRIPTION

• Video C best represents $DESCRIPTION

• Videos A and B represent $DESCRIPTION equally well

• Videos A and C represent $DESCRIPTION equally well

• Videos B and C represent $DESCRIPTION equally well

• All videos represent $DESCRIPTION equally well

• None of these videos represent $DESCRIPTION well

Workers could also optionally briefly explain their answers.

59 CHAPTER 3. METHODOLOGY AND EXPERIMENTATION

Figure 3.13: HET1 task interface

60 CHAPTER 3. METHODOLOGY AND EXPERIMENTATION

Results from this experiment are presented in Section 4.1.

Human Evaluation Task #2 (HET2)

The second evaluation task flips HET1 around, giving workers a single video and the three associ- ated candidate captions generated as described in Section 3.3.3. Workers were asked to choose the best description(s) for the video, if any. The instructions given were:

Below is a video generated by an automatic natural language processing and simula- tion system. Watch the video, then choose the sentence, 1, 2 or 3, that best describes what’s being done in the video.

• There are options if you think multiple sentences describe the video equally well. • “None” is also a valid choice if you think none of the sentences accurately de- scribe what is depicted. • Use your intuition. It’s better to go with your first instinct than to overthink the answer.

The multiple choice options offered were, where $CANDIDAT E{1, 2, 3} refer to the candi- date sentences:

• $CANDIDAT E1 best describes the events in the video

• $CANDIDAT E2 best describes the events in the video

• $CANDIDAT E3 best describes the events in the video

• $CANDIDAT E1 and $CANDIDAT E2 describe the events in the video equally well

• $CANDIDAT E1 and $CANDIDAT E3 describe the events in the video equally well

• $CANDIDAT E2 and $CANDIDAT E3 describe the events in the video equally well

• All sentences describe the events in the video equally well

• None of these sentences describe the events in the video well

61 CHAPTER 3. METHODOLOGY AND EXPERIMENTATION

As in HET1, workers could also optionally briefly explain their answers. Results from this experiment are presented in Section 4.2.

Figure 3.14: HET2 task interface

For these experiments, the number of possible visualizations per description or possible de- scriptions per visualization in each HIT was set to 3 in order to keep the number of choices per HIT as low as possible while still allowing for more than a pairwise comparison. Each HIT was completed by 8 individual workers, for a total of 35,808 individual evaluations: 8,952 for HET1 and 26,856 for HET2. Workers were paid $0.01 per individual task. Results underwent a QA process based on methods similar to those developed by Ipeirotis et al. (2010).

62 CHAPTER 3. METHODOLOGY AND EXPERIMENTATION

3.4.2 Automatic Evaluation

Automatic Evaluation Task (AET)

HET2 effectively requires annotators to predict which sentence was used to generate the visual- ization in question. Annotator agreement on a “most correct” description that correctly selects the original sentence indicates an event visualized (with variable values for underspecified parameters) that conforms to human notions, while disagreement, or agreement on an incorrect selection for the input sentence, should indicate a high likelihood that one or more variable value falls outside of prototypical ranges, resulting in confusion on the judges’ parts as to what the original predicate was. As this closely resembles a classification task with a discrete label set, it is possible to create an analogous task using machine learning. Automatic evaluation allows us to quickly assess more visualizations per input sentence than human evaluation methods, but as mentioned runs a risk of evaluating according to an overfitted model. However, HET2 provides a human-evaluated dataset, and with it a unique opportunity to compare the machine learning results with a well-suited gold standard. For this machine evaluation, a baseline maximum entropy logistic regression classifier was constructed using the Python Natural Language Toolkit (NLTK) (Loper and Bird, 2002) that took in the feature vectors collected during the automatic event capture process and, for each vector, predicted the most likely original input sentence from the three candidates provided to evaluators, as well as choosing the most likely input from all 1,119 input sentences in the test set. 10-fold cross-validation with a convergence threshold of .0001 and a cutoff of 1,000 training steps was conducted on three levels of granularity.

1. Predicting the verb only

2. Predicting the verb plus prepositional adjunct, if one exists

3. Predicting the entire input sentence

Once a baseline accuracy was captured, a multilayer neural network was constructed, con- sisting of four layers of 10, 20, 20, and 10 nodes, respectively, using the TensorFlow framework (Abadi et al., 2016). Using the aforementioned feature vectors as input, a variety of variations of this network were run:

63 CHAPTER 3. METHODOLOGY AND EXPERIMENTATION

1. A “vanilla” four-layer DNN

2. DNN with features weighted by IDF metric (see below)

3. DNN with IDF weights on the discrete features only4

4. DNN excluding feature values and including IDF-weighted binary presence or absence only

5. A combined linear-DNN classifier, using linear estimation for continuous features and DNN classification for discrete features

6. Combined linear-DNN classifier with features weighted by IDF metric

7. Combined linear-DNN classifier with IDF weights on the discrete features only

8. Combined linear-DNN classifier excluding feature values and including IDF-weighted bi- nary presence or absence only

IDF as a feature weight Following the intuition that the presence or absence of an underspeci- fied parameter feature can be a strong predictor of the type of motion class, and the fact that certain classes of underspecified parameters occur across multiple motion classes while other are more specific, it becomes clear that this is effectively a term frequency-inverse document frequency metric of informativity across motion classes, where the feature vector is the “document” and the feature is the “term.” Moreover, since each feature occurs a maximum of one time in each feature |D| vector, tf for any feature and any vector is 1, leaving tf × idf = idf = log( |{d∈D:t∈d}| )), where D is the “corpus” of feature vectors, d is an individual feature vector, and t is an individual feature, as a coarse-grained informativity metric of a given feature across this dataset.

10-fold cross-validation was run on all these neural net classifier variations at 1,000, 2,000, 3,000, 4,000, and 5,000 training steps, for a total of 1,200 iterations. Results from this experiment are presented in Section 4.3.

4“Discrete” features are considered to be those features that are maximally specified by making a value assignment choice out of a set of categories. These features are motion manner, rotation axis, symmetry axis, placement order, and relative orientation. All others are considered to be “continuous” features.

64 Chapter 4

Results and Discussion

Raw and collated data from all tasks is available at https://github.com/nkrishnaswamy/thesis-docs- utils. This chapter presents a selection of the most interesting and informative results for each of the motion predicates in the study. Complete data is available as evaluation logs and SQL databases, containing data conditioned on every combination of recorded features, in the /docs/Under- specificationAnalysis/ directory of the GitHub repository above. As discussed in the footnote on the previous page, the features logged may be divided into “dis- crete” features, those features that are maximally specified by making a value assignment choice out of a set of categories; and “continuous” features, those features that take values in a continuous range. For analysis, continuous feature values were all plotted as a probability density over the rel- evant continuous random variable, and partitioned into subsets. This evaluation uses quintiles (q = 5), although other quantiles and partitions can be easily generated by passing alternate parameters to the evaluation scripts het1-generic-eval.py and het2-generic-eval.py (source code available on GitHub in the /utils/analysis/human-eval/ directory of the reposi- tory listed above). As the Automatic Evaluation Task is a neural net classifier well-equipped to handle continuous features, this partitioning was not required for that task.

65 CHAPTER 4. RESULTS AND DISCUSSION 4.1 Human Evaluation Task 1 Results

The tables below show the probability that an arbitrary judge would, given a description and a Monte Carlo visualization generated from that description, judge that visualization as best depict- ing the description, conditioned on various relevant parameters in the visualization. As multiple choices were allowed and eight evaluators judged each task, probabilities in each table will likely not sum to 1.

4.1.1 “Move”

Pred P(acc|Pred) stack 0.4375 flip 0.4539 spin 0.4875 roll 0.5066 µ ≈ 0.5929 lean 0.5769 σ ≈ 0.1304 slide 0.6563 turn 0.6818 lift 0.7500 put 0.7857 Table 4.1: Acceptability judgments and statistical metrics for “move x” visualizations, conditioned on respecification predicate

As “move” is almost fully unspecified, the simulator was always required to enact the verb as an instance of a different predicate. Results show that an arbitrarily chosen evaluator was significantly more likely to judge a visualization as acceptable for “move” if it was respecified as a “put,” “lift,” “turn,” or “slide” than as a “stack,” “flip,” “spin,” or “roll,” with “lean” falling closest to the mean.

• P(acc|put) ≈ 0.7857 ≈ µ + 1.48σ

• P(acc|stack) ≈ 0.4375 ≈ µ - 1.19σ

No clear or obvious trend based on the features of individual respecifications emerges, although there does appear to be a correlation with events involving a translocation or translocation sequence (“put,” “lift,” “slide”) being preferred over events involving a rotation or rotation sequence (“flip,”

66 CHAPTER 4. RESULTS AND DISCUSSION

“spin,” “roll”), with “lean,” which as demonstrated in Section 2.7 involves both a rotation se- quence and a translocation, falling somewhere in the middle. This may suggest that humans have a preference for translocations over rotations as prototypical instantiations or mental simulations of “move.” The exceptions, “stack” and “turn,” may then potentially be explained as follows:

• “Stack” randomizes the placement order of the objects. If, for example, “move the block” is respecified as “stack the block and the x” where x is some randomly chosen other object, and the randomly-assigned placement order feature places the block on the bottom of the stack, x, not the block becomes the moving object, and judges are unlikely to evaluate such a visualization as acceptable for “move the block.” Conditioning on placement order would reveal such circumstances.

• When “move” is respecified as “turn,” “turn” is required to not be respecified into another of its own subclasses (see Section 3.1.2), meaning that “move” in this case is respecified to a random rotation, or minimally specified turning. This this type of “turn” is considered more acceptable than, say, “spin,” “roll,” or “flip” may suggest an aversion to overspecifying an underspecified predicate such as “move.” Interpreting a “move” as a “spin,” “roll,” or “flip” is somewhat less perspicacious about the intent of labeling something a “move” than a more generic interpretation of “turn” is.

4.1.2 “Turn”

Pred P(acc|Pred) spin 0.4875 µ ≈ 0.5476 roll 0.5066 σ ≈ 0.0614 lean 0.5769 flip 0.6195 Table 4.2: Acceptability judgments and statistical metrics for “turn x” visualizations, conditioned on respecification predicate

We see very similar results for “turn” as for “move”—identical in fact for all predicate respecifi- cation options except for “flip”—due most likely to the fact that the same number of total visu- alizations (45) were generated for “turn x” as for “move x” over all x. “Lean” once again is the

67 CHAPTER 4. RESULTS AND DISCUSSION

respecification that falls closest to the mean probability of acceptability for all respecified turns, although “flip” is preferred, falling approximately 1.17 standard deviations above the mean.

• P(acc|flip) ≈ 0.4375 ≈ µ + 1.17σ

Assessing this qualitatively, it can be conjectured that “flip” is a very obvious or dramatic kind of a turn. “Spin” could be argued to be this, too, but is in fact the least preferred falling 0.98σ below the mean probability of acceptability for all respecified turns.

• P(acc|spin) ≈ 0.4375 ≈ µ + 1.17σ

This may be because “spin” is overspecified/less perspicacious relative to a generic “turn,” especially if it rotates past 360◦ away from its starting orientation, passing out of the range of “turn” into a definitive “spin.”

Rot P(acc|Rot) (0,QU1) 0.5179 (QU1,QU2) 0.5500 µ ≈ 0.5570 (QU2,QU3) 0.6637 σ ≈ 0.0896 (QU3,QU4) 0.6198 (QU4,∞) 0.4338 Table 4.3: Acceptability judgments and statistical metrics for unrespecified “turn x” visualizations, conditioned on rotation angle

For the instances of “turn” that were not respecified to a different predicate, no clear trend appears based on the length of the rotation enacted, although evaluators did seem more likely to judge as acceptable turns that ended between QU2 and QU4, or moderate-to-long turns, while leaning against turns in the highest interval, perhaps because, like “spin” above, the rotation went on long enough to change the judge’s qualitative labeling of the event. Evaluators seem to prefer events like this that are moderate in duration. Continuing continuously iterated events like “turn” for too long may in fact change the event class in most people’s minds.

• P(acc|(QU2,QU3)) ≈ 0.6637 ≈ µ + 1.19σ

• P(acc|(QU3,QU4)) ≈ 0.6198 ≈ µ + 0.70σ

• P(acc|(QU4,∞)) ≈ 0.4338 ≈ µ - 1.38σ

68 CHAPTER 4. RESULTS AND DISCUSSION

4.1.3 “Roll”

Dist P(acc|QU) (0,QU1) 0.4539 (QU1,QU2) 0.5319 µ ≈ 0.5139 (QU2,QU3) 0.4830 σ ≈ 0.0661 (QU3,QU4) 0.4800 (QU4,∞) 0.6208 Table 4.4: Acceptability judgments and statistical metrics for “roll x” visualizations, conditioned on path length

Since all parameters of “roll” can be calculated from object parameters given the length of the path traveled, we can condition acceptability judgments on this parameter, however there does not appear to be a clear trend. Rolls in the longest interval appear to be quite strongly preferred by judges, as P(acc|(QU4,∞)) ≈ 0.6208 ≈ µ + 1.62σ, perhaps due to the very evident and obvious nature of a rolling motion along a long path. This observation coupled with the preference for “flip” as a “turn” may suggest a consideration of obviousness as a criterion for prototypicality of a motion event.

4.1.4 “Slide”

Speed P(acc|QU) (0,QU1) 0.5166 (QU1,QU2) 0.6056 µ ≈ 0.5831 (QU2,QU3) 0.6110 σ ≈ 0.0406 (QU3,QU4) 0.6107 (QU4,∞) 0.5715 Table 4.5: Acceptability judgments and statistical metrics for “slide x” visualizations, conditioned on translocation speed

For “slide,” no clear pattern appeared when conditioning on path length. When conditioning on translocation speed, we see preference for the middle three intervals again. The lowest (here slowest) interval is the least preferred (P(acc|(0,QU1)) ≈ 0.5166 ≈ µ - 1.64σ). If the speed value generated is close enough to 0 it may be hard to see the object moving at all. The highest (fastest) interval falls closest to the mean probability for acceptability. This may exhibit the balancing act

69 CHAPTER 4. RESULTS AND DISCUSSION

between preference for a moderate speed of motion and the “obvious” sliding that a fast motion speed would also demonstrate.

4.1.5 “Spin”

Pred Dist P(acc|“roll”,QU) roll (0,QU1) 0.2500 roll (QU1,QU2) 0.5625 µ ≈ 0.4750 roll (QU2,QU3) 0.5313 σ ≈ 0.1489 roll (QU3,QU4) 0.4063 roll (QU4,∞) 0.6250 Table 4.6: Acceptability judgments and statistical metrics for “spin x” visualizations respecified as “roll x,” conditioned on path length

Axis P(acc|Axis) X 0.6466 Y 0.5263 Z 0.5947 Table 4.7: Acceptability judgments for unrespecified “spin x” visualizations, conditioned on rota- tion axis

“Spin” may be optionally respecified as a “roll,” as shown in Table 4.6 and in all other cases is spun around a random axis (Table 4.7). Spins respecified as “roll” show strong dispreference for those rolls that travel the shortest distances and, with the exception of a drop in the interval (QU3,QU4), an overall increase in probability of acceptability.

• P(acc|(0,QU1)) ≈ 0.2500 ≈ µ - 1.51σ

• P(acc|(QU4,∞)) ≈ 0.6250 ≈ µ + 1.01σ

In cases (which were the majority of cases) where “spin” was enacted as a rotation around a randomly-chosen axis, evaluators preferred rotation around the X axis by over 5% when compared to the Z axis, and by over 12% when compared to rotation around the Y axis. Some of this may be due to the positioning of the camera, looking down the Z axis at the intersection of the X and Y axes, where asymmetrical objects that rotate toward the camera do so very evidently, invoking

70 CHAPTER 4. RESULTS AND DISCUSSION the obviousness criterion discussed above with a factor of perspective independence. For objects that look the same from all sides, such as a plate, it may have been difficult for judges to see the rotation around the Y axis.

4.1.6 “Lift”

Speed Dist P(acc|QUs,QUd) (0,QU1) (0,QU1) 0.4511 (0,QU1) (QU2,QU3) 0.3750 (0,QU1) (QU3,QU4) 0.4537 (0,QU1) (QU4,∞) 0.5625 (QU1,QU2) (0,QU1) 0.5161 µs,d ≈ 0.5017 σs,d ≈ 0.0839 (QU1,QU2) (QU1,QU2) 0.3750 µ ≈ 0.4606 σ ≈ 0.0771 (QU1,QU2) (QU2,QU3) 0.4063 s=(0,QU1) s=(0,QU1) µ ≈ 0.5279 σ ≈ 0.1063 (QU1,QU2) (QU3,QU4) 0.4712 s=(QU1,QU2) s=(QU1,QU2) µ ≈ 0.4941 σ ≈ 0.0603 (QU1,QU2) (QU4,∞) 0.6898 s=(QU2,QU3) s=(QU2,QU3) µ ≈ 0.5209 σ ≈ 0.0840 (QU2,QU3) (0,QU1) 0.4271 s=(QU3,QU4) s=(QU3,QU4) µ ≈ 0.4909 σ ≈ 0.1139 (QU2,QU3) (QU1,QU2) 0.5187 s=(QU4,∞) s=(QU4,∞) µ ≈ 0.4502 σ ≈ 0.0476 (QU2,QU3) (QU2,QU3) 0.5648 d=(0,QU1) d=(0,QU1) µ ≈ 0.5043 σ ≈ 0.1228 (QU2,QU3) (QU3,QU4) 0.4658 d=(QU1,QU2) d=(QU1,QU2) µ ≈ 0.4643 σ ≈ 0.0756 (QU3,QU4) (0,QU1) 0.4063 d=(QU2,QU3) d=(QU2,QU3) µ ≈ 0.4575 σ ≈ 0.0545 (QU3,QU4) (QU1,QU2) 0.6193 d=(QU3,QU4) d=(QU3,QU4) µ ≈ 0.6078 σ ≈ 0.0568 (QU3,QU4) (QU2,QU3) 0.4756 d=(QU4,∞) d=(QU4,∞) (QU3,QU4) (QU3,QU4) 0.5242 (QU3,QU4) (QU4,∞) 0.5789 (QU4,∞) (QU2,QU3) 0.5000 (QU4,∞) (QU3,QU4) 0.3728 (QU4,∞) (QU4,∞) 0.6000 Table 4.8: Acceptability judgments and statistical metrics for “lift x” visualizations, conditioned on translocation speed and distance traversed

Aside from a (fairly strong) preference for “longer” lifts (µd=(QU4,∞) ≈ 0.6078 ≈ µs,d + 1.26σs,d) no clear patterns emerge when conditioning “lift” instances on translocation speed or distance trav- eled alone. When conditioning on the joint probability of both parameters, we find that evaluators were less likely to rate a “lift” event as acceptable when speed fell in (0,QU1) and distance fell in (QU1,QU2), when speed fell in (QU1,QU2) and distance fell in (QU1,QU2), or when speed fell in (QU4,∞) and distance fell in (QU3,QU4). The overall distribution is fairly random, with

71 CHAPTER 4. RESULTS AND DISCUSSION relatively high standard deviations across all speed and distance intervals, but these figures seem to suggest a slight preference for “moderate” or “average” speed and distance values for “lift.”

4.1.7 “Put”

QSR (start) P(acc|QSR) QSR (end) P(acc|QSR) behind(y) 0.5497 behind(y) 0.5474 in front(y) 0.5692 in front(y) 0.5816 µstart ≈ 0.5667 µend ≈ 0.5725 left(y) 0.5753 left(y) 0.4995 σstart ≈ 0.0116 σend ≈ 0.0628 right(y) 0.5725 right(y) 0.5560 on(y) N/A on(y) 0.6683

Table 4.9: Acceptability judgments and statistical metrics for “put x touching y” visualizations, conditioned on relations between x and y at event start and completion

72 CHAPTER 4. RESULTS AND DISCUSSION

Movement (M) P(acc|M) behind→behind(y) 0.5347 behind→in front(y) 0.4758 behind→left(y) 0.5014 behind→right(y) 0.4888 behind→on(y) 0.7453 in front→behind(y) 0.4523 in front→in front(y) 0.6447 µ ≈ 0.5624 σ ≈ 0.0811 in front→left(y) 0.4601 M M µ ≈ 0.5252 σ ≈ 0.0515 in front→right(y) 0.5756 →beh →beh µ ≈ 0.5711 σ ≈ 0.0701 in front→on(y) 0.6234 →fr →fr µ ≈ 0.4911 σ ≈ 0.0289 left→behind(y) 0.5732 →l →l µ ≈ 0.5426 σ ≈ 0.0455 left→in front(y) 0.5853 →r →r µ ≈ 0.6815 σ ≈ 0.0554 left→left(y) 0.5266 →on →on left→right(y) 0.5211 left→on(y) 0.6492 right→behind(y) 0.5406 right→in front(y) 0.5786 right→left(y) 0.4777 right→right(y) 0.5847 right→on(y) 0.7081

Table 4.10: Acceptability judgments and statistical metrics for “put x touching y” visualizations, conditioned on x movement relative to y

Dist (start) P(acc|QU) Dist (end) P(acc|QU) (0,QU1) N/A (0,QU1) 0.7523 (QU1,QU2) 0.3542 (QU1,QU2) 0.6207 µstart ≈ 0.4071 µend ≈ 0.4514 (QU2,QU3) 0.3829 (QU2,QU3) 0.3890 σstart ≈ 0.0461 σend ≈ 0.2419 (QU3,QU4) 0.4444 (QU3,QU4) 0.3655 (QU4,∞) 0.4470 (QU4,∞) 0.1295

Table 4.11: Acceptability judgments and statistical metrics for “put x near y” visualizations, con- ditioned on distance between x and y at event start and completion

73 CHAPTER 4. RESULTS AND DISCUSSION

Movement (M) P(acc|M) (QU1,QU2)→(0,QU1) 0.7625 (QU1,QU2)→(QU1,QU2) 0.4044 (QU1,QU2)→(QU2,QU3) 0.2232 (QU1,QU2)→(QU3,QU4) 0.1667 (QU1,QU2)→(QU4,∞) 0.0682 (QU2,QU3)→(0,QU1) 0.6848 (QU2,QU3)→(QU1,QU2) 0.5703 µ ≈ 0.4303 σ ≈ 0.2521 (QU2,QU3)→(QU2,QU3) 0.3750 M M µ ≈ 0.8043 σ ≈ 0.1360 (QU2,QU3)→(QU3,QU4) 0.2788 →(0,QU1) →(0,QU1) µ ≈ 0.5090 σ ≈ 0.1462 (QU2,QU3)→(QU4,∞) 0.1488 →(QU1,QU2) →(QU1,QU2) µ ≈ 0.3487 σ ≈ 0.0865 (QU3,QU4)→(0,QU1) 1.000 →(QU2,QU3) →(QU2,QU3) µ ≈ 0.3509 σ ≈ 0.1631 (QU3,QU4)→(QU1,QU2) 0.3750 →(QU3,QU4) →(QU3,QU4) µ ≈ 0.1388 σ ≈ 0.0577 (QU3,QU4)→(QU2,QU3) 0.3750 →(QU4,QU5) →(QU4,QU5) (QU3,QU4)→(QU3,QU4) 0.5417 (QU3,QU4)→(QU4,∞) 0.2083 (QU4,∞)→(0,QU1) 0.7698 (QU4,∞)→(QU1,QU2) 0.6863 (QU4,∞)→(QU2,QU3) 0.4217 (QU4,∞)→(QU3,QU4) 0.4162 (QU4,∞)→(QU4,∞) 0.1300

Table 4.12: Acceptability judgments and statistical metrics for “put x near y” visualizations, con- ditioned on start and end distance intervals between x and y

74 CHAPTER 4. RESULTS AND DISCUSSION

Dist (end) QSR P(acc|QU,QSR) (0,QU1) behind(y) 0.7730 (0,QU1) in front(y) 0.7349 (0,QU1) left(y) 0.7338 (0,QU1) right(y) 0.7712 (QU1,QU2) behind(y) 0.6701 (QU1,QU2) in front(y) 0.5797 (QU1,QU2) left(y) 0.6675 (QU1,QU2) right(y) 0.5819 (QU2,QU3) behind(y) 0.4151 (QU2,QU3) in front(y) 0.3644 (QU2,QU3) left(y) 0.3945 (QU2,QU3) right(y) 0.3825 (QU3,QU4) behind(y) 0.1713 (QU3,QU4) in front(y) 0.4308 (QU3,QU4) left(y) 0.2093 (QU3,QU4) right(y) 0.4699 (QU4,∞) behind(y) 0.0972 (QU4,∞) in front(y) 0.1401 (QU4,∞) left(y) 0.1250 (QU4,∞) right(y) 0.1348

µend,qsr ≈ 0.4424 σend,qsr ≈ 0.2380 µend=(0,QU1) ≈ 0.7532 σend=(0,QU1) ≈ 0.0218 µend=(QU1,QU2) ≈ 0.6248 σend=(QU1,QU2) ≈ 0.0508 µend=(QU2,QU3) ≈ 0.3891 σend=(QU2,QU3) ≈ 0.0213 µend=(QU3,QU4) ≈ 0.3203 σend=(QU3,QU4) ≈ 0.1518 µend=(QU4,QU5) ≈ 0.1243 σend=(QU4,QU5) ≈ 0.0191 µqsr=beh) ≈ 0.4253 σqsr=beh ≈ 0.2971 µqsr=fr ≈ 0.4500 σqsr=fr ≈ 0.2246 µqsr=l ≈ 0.4260 σqsr=l ≈ 0.2700 µqsr=r ≈ 0.4681 σqsr=r ≈ 0.2362

Table 4.13: Acceptability judgments and statistical metrics for “put x near y” visualizations, con- ditioned on distance between x and y and POV-relative orientation at event completion

While “put on” and “put in” judgments do not show significant variation in acceptability based on their one underspecified parameter, translocation speed, some very interesting results appear in the examination of “put touching” and “put near” visualizations. We observe a lower likelihood for visualizations to be judged acceptable when the moving ob- ject moves from behind the stationary object to in front of it, and vice versa. P(accept|behind→in front(y))

75 CHAPTER 4. RESULTS AND DISCUSSION is approximately 0.4758, which is approximately 1.07 standard deviations below the mean of the population for all starting/ending QSR relation pairs. This may be explained as an effect of the point of view imposed by the camera position, which may make it difficult to see if an object behind another object is actually making contact and satisfying the EC relation required by “touching,” especially if a larger object is occluding a smaller object. Visualizations where the moving object ends to the left of the stationary object were also less likely to be judged acceptable. P(accept|left(y)) is approximately 1.16 standard deviations below the mean likelihood of acceptance over the population for all event-end QSR relations. This is apparently independent of the moving object’s starting location relative to the stationary object, but the dispreference is more significant for objects that start in front of, or to the right of, their destination.

• P(acc|in front→left(y)) ≈ 0.4601 ≈ µM - 1.26σM

• P(acc|right→left(y)) ≈ 0.4777 ≈ µM - 1.04σM

This could also be explained as an effect of the POV, in particular the distortion it causes in cases where larger objects closer to the camera (including laterally) may occlude objects further away, making it difficult to assess the satisfaction of the EC relation. Therefore, some objects that move from the right of another object to the left of it also move away from the camera, meaning that this effect is analogous to that seen in the behind(y) relations, and explains the similar result seen for in front→left(y) motions. However, this hypothesis would not explain the absence of a sym- metric inclination against right(y) relations so more experimentation or analysis is needed. Some of this may be related to features of the objects themselves, which are not strongly controlled for (discussed further in Section 4.5). There is a strong preference for the on(y) specification of touching(y) over all others, which matches linguistic intuition. “On” necessarily implies an EC relation, which is expressed in the VoxML (Figure 2.11). P(accept|on(y)) falls approximately 1.52 standard deviations above the mean probability of acceptability of the population for all event-end relations. The strongest pref- erence is for motion from behind(y) to on(y), where P(accept|behind→on(y)) is approximately 2.25 standard deviations above the mean likelihood for acceptability over the whole population condi- tioned on start-to-end motion. In terms of point of view effects, this may be due to an occluded object being brought into view and very obviously made to touch its destination in a visualization with no obstructed view. Where “touching” is an underspecified predicate, the relations entailed

76 CHAPTER 4. RESULTS AND DISCUSSION

by “on,” while arguably somewhat overspecified as an interpretation of “touching” alone, seem to most clearly satisfy the qualitative specification of “touching” out of the options available. No- tably, it is the only one not dependent on the relative point of view, suggesting that the relative point of view introduces some noise or confusion into the human judgments, potentially for the reasons discussed above, among others. For “put near,” evaluators unsurprisingly preferred visualizations where the two objects ended up close to each other to those where the objects ended further apart.

• P(acc|(0, QU1)) ≈ µend + 1.24σend

• P(acc|(QU1, QU2)) ≈ µend + 0.70σend

While this seems like an obvious result, the fact that quantitative data comports with intuition lends credence to the soundness of this simulation method of determining the presuppositions underlying motion and relation predicates. In the first three distance intervals, we observe a slight preference for events where the moving object finishes the event behind the stationary object.

• P(acc|(0, QU1),behind(y)) ≈ µend=(0,QU1),qsr + 0.90σend=(0,QU1),qsr

• P(acc|(QU1, QU2),behind(y)) ≈ µend=(QU1,QU2),qsr + 0.89σend=(QU1,QU2),qsr

• P(acc|(QU2, QU3),behind(y)) ≈ µend=(QU2,QU3),qsr + 1.22σend=(QU2,QU3),qsr

This may be an effect of foreshortening caused by the point of view, as with some of the “touching” specifications, which causes an object x which is behind(y) to appear closer to y than it actually is. When conditioning on the joint distribution of the distance interval and the QSR relation, as shown in Table 4.13, there is some apparent confusion in judgments of events in the fourth dis- tance interval, where σ for the population of P(accept|QSR) is greater than .15, where in all other intervals σ for P(accept|QSR) falls between .019 and .051. This is possibly a factor of workers being unable to judge purely from the visuals whether an object that began its movement from a position in the fourth distance interval relative to the stationary object actually ended the event nearer than it began, whereas in preceding intervals, the resulting location was more likely to be unambiguously “near” regardless of starting location.

77 CHAPTER 4. RESULTS AND DISCUSSION

Table 4.12 shows the judges’ preferences for objects that moved between the different distance intervals, independent of direction or orientation. The quintiles were calculated based on the dis- tributions of distances between objects at the end of the “put near” event, which is why Tables 4.11 and 4.12 show no objects beginning the event in the lowest distance interval. There is a clear preference for objects that move from a far interval to a near one, and the inverse is also true, with very low proportions of “acceptable” judgments for visualizations where the object moved from a near distance interval to a farther one. This reinforces the intuition that a qualitative term like “near” is understood to be inherently relative (Peters, 2007).

4.1.8 “Lean”

Angle P(acc|QU) (0,QU1) 0.6117 (QU1,QU2) 0.6403 µ ≈ 0.6432 (QU2,QU3) 0.6694 σ ≈ 0.0208 (QU3,QU4) 0.6443 (QU4,∞) 0.6502 Table 4.14: Acceptability judgments and statistical metrics for “lean x” visualizations, conditioned on rotation angle

For “lean,” the only parameter value left underspecified is the angle of the lean.1 The data gath- ered from the evaluators suggest near equal preference for all angle intervals, with a low standard deviation (0.0208). There is a dispreference for the lowest interval, those angles closest to 25◦ (P(acc|(0, QU1)) ≈ 0.6117 ≈ µ - 1.51σ). This may be because the Unity engine physics takes over once the object has completed its motion, and on occasion with certain objects (e.g., a book), leaning it at a low angle causes the force of gravity to overwhelm and break the support relation between the leaning object and the supporting object, causing it to fall. This suggests that proto- typical events may be required to remain satisfied after the application of any postconditions or effects imposed by world physics, independently of the event satisfaction itself.

1The speed of rotation and translocation are also underspecified, but these are properties of the [[TURN]] and [[PUT]] subevents of “lean,” not the lean itself.

78 CHAPTER 4. RESULTS AND DISCUSSION

4.1.9 “Flip”

Rot Axis Symmetry Axis P(acc|Axisrot,Axissym) X Y 0.6193 X Z 0.6667 µ ≈ 0.5927 Y X 0.5417 σ ≈ 0.1527 Y Z 0.7500 Z X 0.3137 Z Y 0.6645 Table 4.15: Acceptability judgments and statistical metrics for “flip x” visualizations, conditioned on rotation axis and symmetry axis

On “flip” instances, the outliers are objects with symmetry around the X axis rotating around Z axis and objects with symmetry around the Z axis rotating around the Y axis.

1. P(acc|Z,X) ≈ 0.3137 ≈ µ - 1.83σ

2. P(acc|Y,Z) ≈ 0.7500 ≈ µ + 1.03σ

Axisrot = Z and Axissym = X is very strongly dispreferred, while Axisrot = Y and Axissym = Z is somewhat strongly preferred. As with “spin” (Table 4.7), this may be a factor of the object symmetry relative to the camera placement, where a rotation around the Y axis shows all sides of the object, making the “flip” motion obvious (as long as it is not symmetric around the Y axis). A kind of conditioning to control for relative point of view, similar to those calculated for the “put touching” events, could clarify whether or not this is in fact the case.

4.1.10 “Close”

Pred P(acc|Pred) turn 0.6818 put 0.7857 Table 4.16: Acceptability judgments for “close x” visualizations, conditioned on motion manner

For “close,” we see two types of motions: those where a subcomponent of the object is turned to close it (e.g., “close the book”), and those where another object is put on top of the object to

79 CHAPTER 4. RESULTS AND DISCUSSION

be closed (e.g., “close the cup” → “put the lid (or disc) on the cup”). Evaluators seem to prefer the “put” type to the “turn” type. Although no instances of “close the book” were realized by placing another object on top of an open book, it seems unlikely that that would be considered an acceptable visualization or interpretation of “close the book.” It is likely, then, that acceptability of a “close” event, since it requires predicate respecification, is strongly conditioned on the features of the object rather than the motion event itself.

4.1.11 “Open”

Pred P(acc|Pred) turn 0.6818 move 0.8750 Table 4.17: Acceptability judgments for “open x” visualizations, conditioned on motion manner

The same is probably true for “open.” Here we have two cases: the reverse of the “turn” event from “close,” where some subcomponent is turned to an arbitrary angle to open the object (such as a book), and instances of “open” where some object closing another object is moved to a different location or positioning (e.g., “move the lid” where the lid is currently sealing a cup). Evaluators strongly preferred the “move” event to the “turn” event.

4.2 Human Evaluation Task 2 Results

The tables below show the probability that an arbitrary judge would, given a Monte Carlo-generated visualization of the given predicate, identify that predicate as best describing the visualization, con- ditioned on various relevant parameters in the visualization. As multiple choices were allowed and eight judges evaluated each task, probabilities in each table will likely not sum to 1. The probabil- ities shown here are generally lower than the probabilities that fall out of HET1, due primarily to the fact that for each single visualization, workers were given three separate labels to choose from instead of a single label for three visualizations, spreading the distribution out over three potential captions and rendering the overall task three times as large.

80 CHAPTER 4. RESULTS AND DISCUSSION

4.2.1 “Move”

Pred P(select=“move”|Pred) spin 0.0617 turn 0.0909 lift 0.0938 flip 0.1491 µ ≈ 0.2396 put 0.2500 σ ≈ 0.1694 stack 0.2500 roll 0.2961 lean 0.3714 slide 0.5938 Table 4.18: Probabilities and statistical metrics for selection of “move” predicate for “move x” event, conditioned on respecification predicate

The probability of an arbitrary evaluator choosing the label “move” for a given visualization of a “move” is highly variable, with a very high standard deviation relative to the probability values for each individual candidate respecification predicate.

• P(select=“move”|spin) ≈ 0.0617 ≈ µ - 1.05σ

• P(select=“move”|slide) ≈ 0.5938 ≈ µ + 2.09σ

We also see a very different order of increasing preference for the label choices compared to the generated respecifications as seen in Table 4.1. Most respecifications that were preferred by evaluators in HET1 are often infrequent choices to be labeled a “move” here. The one point of convergence in the two tasks is “slide,” which is a frequently accepted respecification in HET 1 (P(acc|slide) ≈ 0.6563) and the most likely event type to be labeled a “move” in this task. We might explain this by introducing a reflexivity qualification on prototypical motion events, such that events instantiated with a certain label by an event generator should be labeled the same way by an evaluator, in which case a sliding motion might be a good candidate for a prototypical “move” event.

81 CHAPTER 4. RESULTS AND DISCUSSION

4.2.2 “Turn”

Pred P(select=“turn”|Pred) lean 0.1048 µ ≈ 0.2532 roll 0.2171 σ ≈ 0.1470 spin 0.2346 flip 0.4561 Table 4.19: Probabilities and statistical metrics for selection of “turn” predicate for “turn x” visu- alizations, conditioned on respecification predicate

These results also show a very different order than the acceptability of the same predicates as visualizations in HET1. As with “move,” there is a point of consistency between the two sets of results. Here, that is “flip,” which is the most likely respecification to be labeled as a “turn” and the most likely type of “turn” visualization to be judged acceptable. This satisfies both reflexivity and obviousness with respect to prototypicality, and so “flip” might be considered for a prototypical realization of a “turn” event.

Rot P(select=“turn”|Rot) (0,QU1) 0.2618 (QU1,QU2) 0.2647 µ ≈ 0.2521 (QU2,QU3) 0.2355 σ ≈ 0.0126 (QU3,QU4) 0.2560 (QU4,∞) 0.2426 Table 4.20: Probabilities and statistical metrics for selection of “turn” predicate for unrespecified “turn x” visualizations, conditioned on rotation angle

For visualizations where “turn” was not respecified as a different predicate, the results can be conditioned on the angle of the rotation. No clear patterns emerge here, with roughly equal distribution over all intervals. It appears that for an arbitrary rotation not part of any other distinct motion class, “turn” is the best overall label from the available choices.

82 CHAPTER 4. RESULTS AND DISCUSSION

4.2.3 “Roll”

Dist P(select=“roll”|QU) (0,QU1) 0.0300 (QU1,QU2) 0.0568 µ ≈ 0.0458 (QU2,QU3) 0.0531 σ ≈ 0.0107 (QU3,QU4) 0.0483 (QU4,∞) 0.0407 Table 4.21: Probabilities and statistical metrics for selection of “roll” predicate for “roll x” visual- izations, conditioned on path length

Dist Pred P(select=Pred|QU) (0,QU1) move 0.6816 (0,QU1) put near 0.2912 (0,QU1) lift 0.0412 (QU1,QU2) move 0.6639 (QU1,QU2) put near 0.2912 (QU1,QU2) roll 0.0568 (QU2,QU3) move 0.6434 (QU2,QU3) put near 0.3080 (QU2,QU3) lift 0.0624 (QU3,QU4) move 0.6007 (QU3,QU4) put near 0.3719 (QU3,QU4) lift 0.0611 (QU4,∞) move 0.5717 (QU4,∞) put near 0.4646 (QU4,∞) lift 0.0664 Table 4.22: Top 3 most likely predicate choices for “roll x” visualizations, conditioned on path length

There is a very low probability overall that evaluators would choose “roll” as the best label for a rolling event, regardless of path length. The most likely label choice for instances of “roll” overall was actually “move,” followed by “put near,” and then “lift” in most cases, although the overall probabilities for choosing “lift” are also very low. The occurrence of “lift” is hard to explain. Some of the “roll” visualizations bounce a little bit due to physics effects, and the low probabilities may also be attributable to evaluator error.

83 CHAPTER 4. RESULTS AND DISCUSSION

4.2.4 “Slide”

Speed Dist P(select=“slide”|QUs,QUd) (0,QU1) (0,QU1) 0.0300 (0,QU1) (QU2,QU3) 0.0214 (0,QU1) (QU3,QU4) 0.0178 (0,QU1) (QU4,∞) 0.0521 (QU1,QU2) (0,QU1) 0.0379 (QU1,QU2) (QU1,QU2) 0.0311 (QU1,QU2) (QU2,QU3) 0.0300 (QU1,QU2) (QU3,QU4) 0.0150 (QU1,QU2) (QU4,∞) 0.0498 (QU2,QU3) (0,QU1) 0.0311 (QU2,QU3) (QU1,QU2) 0.0310 (QU2,QU3) (QU2,QU3) 0.0385 (QU2,QU3) (QU3,QU4) 0.0563 (QU3,QU4) (0,QU1) 0.0217 (QU3,QU4) (QU1,QU2) 0.0978 (QU3,QU4) (QU2,QU3) 0.0179 (QU3,QU4) (QU4,∞) 0.0263 (QU4,∞) (0,QU1) 0.0323 (QU4,∞) (QU2,QU3) 0.0651 (QU4,∞) (QU3,QU4) 0.0655 (QU4,∞) (QU4,∞) 0.0558

µs,d ≈ 0.0392 σs,d ≈ 0.0204 µs=(0,QU1) ≈ 0.0303 σs=(0,QU1) ≈ 0.0154 µs=(QU1,QU2) ≈ 0.0328 σs=(QU1,QU2) ≈ 0.0127 µs=(QU2,QU3) ≈ 0.0392 σs=(QU2,QU3) ≈ 0.0119 µs=(QU3,QU4) ≈ 0.0409 σs=(QU3,QU4) ≈ 0.0381 µs=(QU4,∞) ≈ 0.0547 σs=(QU4,∞) ≈ 0.0156 µd=(0,QU1) ≈ 0.0306 σd=(0,QU1) ≈ 0.0058 µd=(QU1,QU2) ≈ 0.0533 σd=(QU1,QU2) ≈ 0.0385 µd=(QU2,QU3) ≈ 0.0346 σd=(QU2,QU3) ≈ 0.0188 µd=(QU3,QU4) ≈ 0.0387 σd=(QU3,QU4) ≈ 0.0260 µd=(QU4,∞) ≈ 0.0460 σd=(QU4,∞) ≈ 0.0134 Table 4.23: Probabilities and statistical metrics for selection of “slide” predicate for “slide x” visualizations, conditioned on path length and translocation speed

Like “roll,” the probability of choosing “slide” as the best label for visualizations generated from the “slide” predicate are low, although we can see an increasing trend in favor of the “slide” label as motion speed rises.

84 CHAPTER 4. RESULTS AND DISCUSSION

• µs=(0,QU1) ≈ 0.0303 ≈ µs,d - 0.44σs,d

• µs=(QU4,∞) ≈ 0.0547 ≈ µs,d + 0.76σs,d

Overall, results from neither “roll” nor “slide” seem to be very informative about prototypi- cality in this case, possibly because both are fairly fully-specified events, not easily confused for anything else beyond their own hypernyms (e.g., “move”).

4.2.5 “Spin”

Pred Dist P(select=“spin”|“roll”,QU) roll (0,QU1) 0.2500 roll (QU1,QU2) 0.1563 µ ≈ 0.1913 roll (QU2,QU3) 0.1250 σ ≈ 0.0565 roll (QU3,QU4) 0.2500 roll (QU4,∞) 0.1750 Table 4.24: Probabilities and statistical metrics for selection of “spin” predicate for “spin x” visu- alizations respecified as “roll x,” conditioned on path length

Axis P(select=“spin”|Axis) X 0.0643 Y 0.4137 Z 0.0625 Table 4.25: Probabilities for selection of “spin” predicate for unrespecified “spin x” visualizations, conditioned on rotation axis

For instances of “spin” respecified as “roll,” no clear trend emerges that would identify a particular path length as making such an event more identifiable as a “spin.” Paths in the shortest interval, and in (QU3,QU4) show some preference, but not a clearly explainable one relative to the total test set for respecified “spin,” and it is unclear how much of this is statistical noise due to the small size of this specific segment of the dataset. For unrespecified “spin” There is a very strong preference for spin motions around the Y axis, making it clear that the prototypical notion of a “spin” is probably around that axis.

85 CHAPTER 4. RESULTS AND DISCUSSION

4.2.6 “Lift”

Speed Dist P(select=“lift”|QUs,QUd) (0,QU1) (QU1,QU2) 0.0523 (0,QU1) (QU2,QU3) 0.0256 (0,QU1) (QU3,QU4) 0.0800 (0,QU1) (QU4,∞) 0.1146 (QU1,QU2) (0,QU1) 0.0909 (QU1,QU2) (QU1,QU2) 0.0932 (QU1,QU2) (QU3,QU4) 0.0300 (QU1,QU2) (QU4,∞) 0.0746 (QU2,QU3) (0,QU1) 0.0104 (QU2,QU3) (QU1,QU2) 0.0354 (QU2,QU3) (QU2,QU3) 0.1106 (QU2,QU3) (QU3,QU4) 0.0938 (QU2,QU3) (QU4,∞) 0.0667 (QU3,QU4) (0,QU1) 0.0217 (QU3,QU4) (QU1,QU2) 0.0761 (QU3,QU4) (QU2,QU3) 0.1071 (QU3,QU4) (QU3,QU4) 0.0806 (QU4,∞) (0,QU1) 0.0753 (QU4,∞) (QU1,QU2) 0.0257 (QU4,∞) (QU2,QU3) 0.0888 (QU4,∞) (QU3,QU4) 0.0476 (QU4,∞) (QU4,∞) 0.0944

µs,d ≈ 0.0680 σs,d ≈ 0.0317 µs=(0,QU1) ≈ 0.0681 σs=(0,QU1) ≈ 0.0381 µs=(QU1,QU2) ≈ 0.0722 σs=(QU1,QU2) ≈ 0.0293 µs=(QU2,QU3) ≈ 0.0634 σs=(QU2,QU3) ≈ 0.0411 µs=(QU3,QU4) ≈ 0.0714 σs=(QU3,QU4) ≈ 0.0358 µs=(QU4,∞) ≈ 0.0664 σs=(QU4,∞) ≈ 0.0291 µd=(0,QU1) ≈ 0.0496 σd=(0,QU1) ≈ 0.0395 µd=(QU1,QU2) ≈ 0.0565 σd=(QU1,QU2) ≈ 0.0280 µd=(QU2,QU3) ≈ 0.0830 σd=(QU2,QU3) ≈ 0.0395 µd=(QU3,QU4) ≈ 0.0664 σd=(QU3,QU4) ≈ 0.0265 µd=(QU4,∞) ≈ 0.0876 σd=(QU4,∞) ≈ 0.0215 Table 4.26: Probabilities and statistical metrics for selection of “lift” predicate for “lift x” visual- izations, conditioned on translocation speed and distance traversed

Like “roll” and “slide,” “lift” receives overall low probabilities of being judged the best label for visualizations generated from “lift” events, but there seems to be a rising trend of probability for the

86 CHAPTER 4. RESULTS AND DISCUSSION

“lift” label as the distance traveled rises (µd=(QU4,∞) ≈ 0.0876 ≈ µs,d + 0.62σs,d). Since in HET1 (Table 4.8), evaluators also preferred longer instances of “lift,” this comports with the reflexivity qualification on prototypicality introduced earlier.

4.2.7 “Put”

Speed P(select=“put on/in”|Speed) (0,QU1) 0.2016 (QU1,QU2) 0.2182 µ ≈ 0.2251 (QU2,QU3) 0.2372 σ ≈ 0.0151 (QU3,QU4) 0.2334 (QU4,∞) 0.2349 Table 4.27: Probabilities and statistical metrics for selection of “put on/in” predicate for “put x on/in y” visualizations, conditioned on translocation speed

QSR (end) P(select=“put touching”|QSR) behind(y) 0.1982 in front(y) 0.2706 µ ≈ 0.2985 left(y) 0.3250 σ ≈ 0.0656 right(y) 0.3333 on(y) 0.3654

Table 4.28: Probabilities and statistical metrics for selection of “put touching” predicate for “put x touching y” visualizations, conditioned on relative orientation between x and y at event completion

Dist P(select=“put near”|Dist) (0,QU1) 0.2912 (QU1,QU2) 0.2912 µ ≈ 0.3454 (QU2,QU3) 0.3080 σ ≈ 0.0745 (QU3,QU4) 0.3719 (QU4,∞) 0.4646 Table 4.29: Probabilities and statistical metrics for selection of “put near” predicate for “put x near y” visualizations, conditioned on distance traveled

“Put on” and “put in” results show little variation when conditioned on translocation speed, with perhaps a slight preference for faster motions. With the exception of this parameter these are

87 CHAPTER 4. RESULTS AND DISCUSSION

already well-specified motion predicates so combined with the results from HET1, there does not appear to be any particularly distinct set of “best values” for translocation speed in a prototypical “put on” or “put in” event. The results for “put touching” also reflect the results in HET1, with a preference for the on(y) specification of touching(y), and the dispreference for behind(y), possibly due to the effects of the point of view on perceptibility of the EC constraint of a touching(y) relation.

• P(select=“put touching”|behind(y)) ≈ 0.1982 ≈ µ - 1.53σ

• P(select=“put touching”|on(y)) ≈ 0.3654 ≈ µ + 1.02σ

on(y), meanwhile, remains a very obvious interpretation of touching(y) and clearly preferred as a specification in both visualization and labeling. The results for “put near” display something interesting: rather than having a clear pattern of preference for objects in close proximity, conditioning on relative offset, as shown in the results for HET1 (Table 4.11), results in this task show the trend emerging when conditioning on distance the moving object traveled overall, with preference to longer paths (P(select=“put near”|(QU4,∞)) ≈ 0.4646 ≈ µ + 1.6σ). Longer paths are obvious motions, but they are also, in the case of moving an object “near” another object, demonstrative, in that they demonstrate the contrast between the beginning and ending state, and the clear typing of the event, as might be encoded in VoxML (for a [[PUT]], the typing is a “transition event,” a minimal distinction which a long path taken over the course of the event clearly demonstrates).

4.2.8 “Lean”

Angle P(select=“lean”|QU) (0,QU1) 0.4263 (QU1,QU2) 0.4287 µ ≈ 0.4239 (QU2,QU3) 0.4291 σ ≈ 0.0075 (QU3,QU4) 0.4245 (QU4,∞) 0.4110 Table 4.30: Probabilities and statistical metrics for selection of “lean on” predicate for “lean x on y” visualizations, conditioned on rotation angle

88 CHAPTER 4. RESULTS AND DISCUSSION

Angle P(select=“lean”|QU) (0,QU1) 0.3319 (QU1,QU2) 0.3495 µ ≈ 0.3420 (QU2,QU3) 0.3555 σ ≈ 0.0145 (QU3,QU4) 0.3217 (QU4,∞) 0.3512 Table 4.31: Probabilities and statistical metrics for selection of “lean against” predicate for “lean x against y” visualizations, conditioned on rotation angle

For both “lean on” and “lean against,” results show roughly equal probabilities for the respective two labels across angle intervals. The same type of results appear for unrespecified “turn,” although here there is an overall and universal preference for labeling a “lean” event with “on” as opposed to “against,” even though the two events programs are enacted identically.

4.2.9 “Flip”

Rot Axis Symmetry Axis P(select=“flip on edge”|Axisrot,Axissym) X Y 0.1080 X Z 0.1579 µ ≈ 0.1481 Y Z 0.3125 σ ≈ 0.0971 Z X 0.0833 Z Y 0.0789 Table 4.32: Probabilities and statistical metrics for selection of “flip on edge” predicate for “flip x on edge” visualizations, conditioned on rotation axis and symmetry axis

Rot Axis Symmetry Axis P(select=“flip at center”|Axisrot,Axissym) X Y 0.3352 X Z 0.3333 µ ≈ 0.2942 Y X 0.2400 σ ≈ 0.0776 Z X 0.1875 Z Y 0.3750 Table 4.33: Probabilities and statistical metrics for selection of “flip at center” predicate for “flip x at center” visualizations, conditioned on rotation axis and symmetry axis

89 CHAPTER 4. RESULTS AND DISCUSSION

Rot Axis Symmetry Axis Pred P(select=“flip”|Axisrot,Axissym) X Y turn 0.4659 X Y flip 0.3352 X Y move 0.1818 X Z turn 0.4211 X Z flip 0.3333 X Z move 0.2280 Y X turn 0.4000 Y X flip 0.2400 Y X move 0.1200 Y Z turn 0.4375 Y Z flip 0.3125 Y Z move 0.1250 Z X move 0.2500 Z X turn 0.2291 Z X flip 0.1875 Z Y turn 0.5197 Z Y flip 0.3750 Z Y move 0.2039

Table 4.34: Top 3 most likely predicate choices for “flip x {on edge, at center}” visualizations, conditioned on rotation axis and symmetry axis

Labeling probabilities for “flip on edge” and “flip at center” pattern differently so are separated here. Preference for a “flip on edge” label was for visualizations containing objects rotating around their Y axis when they had symmetry around their Z axis, whereas for a “flip at center” label, visualizations containing objects symmetric around their Y axis rotating around their Z axis were preferred. Overall, evaluators were more likely to label these visualizations with “turn,” than “flip.” In some cases, this may have been because the object flipped on its edge would fall over after physics effects are evaluated following the event completion, violating the physics-independence quality of a prototypical motion event, but may also indicate a level of “hedging bets” on the part of the evaluators, or building in a measure of certainty into their evaluations. In other words, while the event visualized may have been somewhat less-than-definitely a “flip,” it was very securely a “turn,” making that the best label.

90 CHAPTER 4. RESULTS AND DISCUSSION

4.2.10 “Close”

Pred P(select=“close”|Pred) turn 0.2273 put 0.2143 Table 4.35: Probabilities for selection of “close” predicate for “close x” visualizations, conditioned on motion manner

Manner Pred P(select=“close”|Manner) put put on 0.4821 put move 0.2500 put close 0.2143 turn turn 0.3636 turn close 0.2272 turn open 0.1932 Table 4.36: Top 3 most likely predicate choices for “close x” visualizations, conditioned on motion manner

For both types of “close” realization, “put” events and “turn” events, evaluators were roughly equally likely to choose “close” as the correct label. However, evaluators were much more likely to choose the more specific predicate, “put on” or “turn” (depending on event typing) as the cor- rect label overall, as opposed to “close.” There is also a small incidence of “open” in the “turn” realization, which seems to be evaluator error or at least counterintuitive to the visualization given.

4.2.11 “Open”

Pred P(select=“open”|Pred) turn 0.1932 move 0.6122 Table 4.37: Probabilities for selection of “open” predicate for “open x” visualizations, conditioned on motion manner

91 CHAPTER 4. RESULTS AND DISCUSSION

Manner Pred P(select=“close”|Manner) turn turn 0.3636 turn close 0.2272 turn open 0.1932 move open 0.6122 move move 0.5306 move lean against 0.0408 Table 4.38: Top 3 most likely predicate choices for “open x” visualizations, conditioned on motion manner

Evaluators were much more likely to label “move” enactments of “open” as “open” events, whereas, like for “close,” “turn” realizations were more likely to be given the label “turn.” This may suggest that, because although “turn” is an underspecified motion predicate, “move” is more so, and so the more underspecified the motion predicate, the less likely evaluators are to choose it as an event label for a more fully-specified event visualization. It should be noted that in this task, the heuristics used to generated the three choices for evalua- tors may potentially influence the results presented. By choosing to give alternative choices to the true input sentence that rely on 1) adjunct alternation such as “put in” vs. “put on” and 2) motion superclass/subclass distinctions such as providing a “move” option for “slide” or a “turn” option for “lean,” these results do reinforce rigid taxonomies of motion classes when human interpretation may be more fluid.

4.3 Automatic Evaluation Task Results

The classifier source code for both MaxEnt baseline and the DNN evaluator is available in the /utils/ analysis/auto-eval/ directory of https://github.com/nkrishnaswamy/thesis-docs- utils.

92 CHAPTER 4. RESULTS AND DISCUSSION

4.3.1 Baseline: Maximum Entropy Logistic Regression

Granularity 1: Predicting Predicate Only 3-way Choice Total Correct Incorrect 3357 1628 1729 µ Accuracy σ σ2 48.50% 0.29066 0.08448

Unrestricted Choice Total Correct Incorrect 3357 558 2799 µ Accuracy σ σ2 16.62% 0.09007 0.00811 Granularity 2: Predicting Predicate w/ Adjunct 3-way Choice Total Correct Incorrect 3357 1182 2175 µ Accuracy σ σ2 35.19% 0.24375 0.05941

Unrestricted Choice Total Correct Incorrect 3357 522 2835 µ Accuracy σ σ2 15.54% 0.10732 0.01152 Granularity 3: Predicting Full Sentence 3-way Choice Total Correct Incorrect 3357 1532 1825 µ Accuracy σ σ2 45.64% 0.02424 0.00059

Unrestricted Choice Total Correct Incorrect 3357 34 3323 µ Accuracy σ σ2 1.01% 0.00320 0.00001

Table 4.39: Accuracy tables for baseline automatic evaluation

93 CHAPTER 4. RESULTS AND DISCUSSION

100

80

60 48.5 45.64

Accuracy 40 35.19

20

0 Predicate Only Pred+Prep Full Sentence Figure 4.1: Baseline accuracy on restricted choice set

100

80

60

Accuracy 40

20 16.62 15.54 1.01 0 Predicate Only Pred+Prep Full Sentence Figure 4.2: Baseline accuracy on unrestricted choice set

Over a 10-fold cross-validation of the test set, using the saved feature vectors as training data, the MaxEnt classifier achieves 48.50% accuracy on selecting the correct event predicate alone, and 45.64% when selecting the correct sentence in its entirety. The baseline results when selecting the predicate alone display a much higher variance than the results when selecting the entire sentence, pointing to the existence of some “confusing” features

94 CHAPTER 4. RESULTS AND DISCUSSION

when judging the predicate by itself, or indicating some extra information provided by object features resulting in more consistent results across folds. Nevertheless, this baseline exhibits only 12-15% improvement over random chance in a three-way classification task. When the algorithm is not restricted to the same three choices given to evaluators for each labeling task (effectively increasing the choice from a 3-way classification to an 11-way classifi- cation when choosing just the predicate, or a 1,119-way classification when choosing the entire sentence), the accuracy drops quite drastically, to 16.62% on predicting the predicate alone, and to 1.01% on predicting the entire sentence, effectively reducing the accuracy to statistical error.

4.3.2 Deep Learning Results

The results from the deep learning classifiers are presented graphically, showing the differences in performance across neural network types and granularity levels, but also showing the effects of the number of training steps on each learning method. Full data tables of the format of Table 4.39 are available in Appendix E. Eight neural net configurations were run, on 5 different training lengths, across 10 folds at 3 levels of granularity apiece, for a total of 1,200 individual automatic evaluations. Aggregate results are presented below. The following charts are sorted by neural net type and learning method, with two graphs for each: with the network’s choice restricted to the three possible captions available to human eval- uators for the same visualization in HET2, and with the choice set open to all options (predicate, predicate plus preposition, or full sentence, depending on granularity level). Each chart shows the three levels of granularity assessed for the baseline. Discussion follows after all the charts.

95 CHAPTER 4. RESULTS AND DISCUSSION

DNN with Unweighted Features

100 97.37 97.64 97.88 97.82 97.88

81.88 80.87 81.16 81.61 81.19 80

66.83 63.25 64.86 63.91 60 48.8

Accuracy 40

20

0 1000 2000 3000 4000 5000 Training Steps Predicate Only Pred+Prep Full Sentence Figure 4.3: “Vanilla” DNN accuracy on restricted choice set

95.41 95.29 100 93.5 94.87 95.14

80

60 55.63 56.46 55.63 57.18 56.58

Accuracy 40

20

0.24 0.42 0.36 0.66 0.51 0 1000 2000 3000 4000 5000 Training Steps Predicate Only Pred+Prep Full Sentence Figure 4.4: “Vanilla” DNN accuracy on unrestricted choice set

96 CHAPTER 4. RESULTS AND DISCUSSION

DNN with Weighted Features

100 96.36 97.58 97.85 97.82 97.88

80.27 80.98 80.66 80.78 81.34 80

63.17 62.3 62.78 60.13 60

45.32

Accuracy 40

20

0 1000 2000 3000 4000 5000 Training Steps Predicate Only Pred+Prep Full Sentence Figure 4.5: DNN with weighted features accuracy on restricted choice set

95.44 95.5 95.47 100 93.44 95.05

80

60 55.15 55.69 56.46 55.84 56.7

Accuracy 40

20

0.25 0.24 0.24 0.21 0.6 0 1000 2000 3000 4000 5000 Training Steps Predicate Only Pred+Prep Full Sentence Figure 4.6: DNN with weighted features accuracy on unrestricted choice set

97 CHAPTER 4. RESULTS AND DISCUSSION

DNN with Weighted Discrete Features

100 96.51 97.91 97.88 97.82 98

80.33 81.01 81.19 80.33 80.95 80

65.31 65.25 65.01 63.67

60 54.11

Accuracy 40

20

0 1000 2000 3000 4000 5000 Training Steps Predicate Only Pred+Prep Full Sentence Figure 4.7: DNN with weighted discrete features accuracy on restricted choice set

95.02 95.17 95.2 95.5 100 92.75

80

60 55.18 56.04 56.61 55.19 56.46

Accuracy 40

20

0.18 0.45 0.54 0.45 0.6 0 1000 2000 3000 4000 5000 Training Steps Predicate Only Pred+Prep Full Sentence Figure 4.8: DNN with weighted discrete features accuracy on unrestricted choice set

98 CHAPTER 4. RESULTS AND DISCUSSION

DNN with Feature Weights Only

100 98.27 98.95 99.04 99.01 98.95

82.95 83.34 83.43 83.13 83.25 80 69.78 69.72 69.12 67.19 60

Accuracy 40

20 11.35

0 1000 2000 3000 4000 5000 Training Steps Predicate Only Pred+Prep Full Sentence Figure 4.9: DNN with feature weights only accuracy on restricted choice set

100 96.12 96.93 97.05 97.1 97.07

80 75.62 75.86 76.57 76.51 76.48

60

Accuracy 40

20

−2 0.18 0.45 0.63 0.63 0 3 · 10 1000 2000 3000 4000 5000 Training Steps Predicate Only Pred+Prep Full Sentence Figure 4.10: DNN with feature weights only accuracy on unrestricted choice set

99 CHAPTER 4. RESULTS AND DISCUSSION

Combined Linear-DNN with Unweighted Features

100 95.56 95.88 95.97 96.15 96.15

80.18 80.36 80.57 80.92 81.19 80

58.1 56.19 57.27 57.65 60 53

Accuracy 40

20

0 1000 2000 3000 4000 5000 Training Steps Predicate Only Pred+Prep Full Sentence Figure 4.11: Linear-DNN accuracy on restricted choice set

100 90.85 91.2 91.35 91.44 91.5

80

60 53.51 53.75 54.43 54.85 55.06

Accuracy 40

20

0.36 0.42 0.42 0.47 0.42 0 1000 2000 3000 4000 5000 Training Steps Predicate Only Pred+Prep Full Sentence Figure 4.12: Linear-DNN accuracy on unrestricted choice set

100 CHAPTER 4. RESULTS AND DISCUSSION

Combined Linear-DNN with Weighted Features

100 95.91 96.03 96.12 96.06 96

80.66 80.92 81.16 81.43 81.46 80

57.21 55.21 56.28 60 53.06 50.05

Accuracy 40

20

0 1000 2000 3000 4000 5000 Training Steps Predicate Only Pred+Prep Full Sentence Figure 4.13: Linear-DNN with weighted features accuracy on restricted choice set

100 90.88 91.26 91.5 91.59 91.44

80

60 53.99 54.58 54.73 54.79 54.52

Accuracy 40

20

0.3 0.36 0.33 0.39 0.45 0 1000 2000 3000 4000 5000 Training Steps Predicate Only Pred+Prep Full Sentence Figure 4.14: Linear-DNN with weighted features accuracy on unrestricted choice set

101 CHAPTER 4. RESULTS AND DISCUSSION

Combined Linear-DNN with Weighted Discrete Features

100 95.59 95.88 95.97 96.15 96.15

80.18 80.36 80.57 80.92 81.19 80

58.1 56.19 57.27 57.65 60 53

Accuracy 40

20

0 1000 2000 3000 4000 5000 Training Steps Predicate Only Pred+Prep Full Sentence Figure 4.15: Linear-DNN with weighted discrete features accuracy on restricted choice set

96.66 96.75 94.48 95.56 100 92.46

80

60 53.51 53.75 54.43 54.85 55.06

Accuracy 40

20

0.18 0.18 0.18 0.27 0.3 0 1000 2000 3000 4000 5000 Training Steps Predicate Only Pred+Prep Full Sentence Figure 4.16: Linear-DNN with weighted discrete features accuracy on unrestricted choice set

102 CHAPTER 4. RESULTS AND DISCUSSION

Combined Linear-DNN with Feature Weights Only

98.71 98.71 100 96.45 97.1 97.49

82.86 83.31 83.19 83.37 81.04 80

60 57.89

45.29

Accuracy 40 30.03

22.49 20.23 20

0 1000 2000 3000 4000 5000 Training Steps Predicate Only Pred+Prep Full Sentence Figure 4.17: Linear-DNN with feature weights only accuracy on restricted choice set

100 90.85 91.2 91.35 91.44 91.5

80 75.77 75.98 76.04 76.25 71.78

60

Accuracy 40

20

0.36 0.42 0.42 0.45 0.42 0 1000 2000 3000 4000 5000 Training Steps Predicate Only Pred+Prep Full Sentence Figure 4.18: Linear-DNN with feature weights only accuracy on unrestricted choice set

103 CHAPTER 4. RESULTS AND DISCUSSION

Discussion

All variations of the neural network method were able to identify the motion predicate alone with greater than 90% accuracy, even with only 1,000 training steps and even when given a choice of all available motion predicates, not just the three offered to human evaluators on the same individual task. The addition of a prepositional adjunct distinction caused accuracy to sink to somewhere in the 80-90% range for the restricted choice set, and the 50-60% range for the unrestricted choice set. This is still well above the baseline but represents a significant discounting from results where the label set was limited to verbs and verbs alone. Two exceptions to this are the networks with feature weights only (Figure 4.10 and Figure 4.18), where accuracy remained in the vicinity of 75%. This phenomenon is discussed briefly below and in Section 4.5.1. One general observation that emerged during early tests of the neural network learning method was that introducing IDF weighting to all the features seemed to add little in terms of end-run accu- racy, and in fact often introduced noise that lowered accuracy at short training intervals (see Figure 4.3 where the “vanilla” DNN achieved 97.73%/81.88%/48.80% accuracy as opposed to Figure 4.5 where the DNN with feature weights achieved 96.36%/80.27%/45.32% accuracy). This shortfall was usually made up with additional training but even then rarely exceeded the performance of un- weighted features. Meanwhile, assigning IDF weights to the discrete features only provided some increase in performance, notable mostly in the highest granularity level. This led to the intuition that the presence or absence of a feature may be a strong predictor of motion class. Since the distribution of underspecified parameter features varies through events in the test set (that is, some features occur in one event class only, others in multiples—which ties back into the notion that the amount and nature of spatial information provided by a predicate is variable), this was transformed into a TF-IDF metric (see discussion under Section 3.4.2). Both DNN and combined Linear-DNN methods that used feature IDF weights only in place of actual feature values actually outperformed all other methods. In the lowest granularity, the advantage was slight (typically up from ∼96% to ∼98% across 5,000 training steps). In the middle granularity, the advantage was in the vicinity of 20%, typically jumping from ∼55% to ∼75%. In the highest granularity, predicting the entire input sentence, the weights-only method drastically underperforms all the other methods at first, but this is quickly made up for by longer training. In the combined network, a gradual increase in performance over training time results in approximate parity with the other methods (Figure 4.17), whereas in the purely deep learning network, the weights-only method ends up besting the others

104 CHAPTER 4. RESULTS AND DISCUSSION

by about 10% after 5,000 training steps (Figure 4.9).2 TensorFlow documentation typically recommends using a “wide” or linear classifier for con- tinuous features and deep learning for discrete or categorical features, so it was thought that a combined Linear-DNN feeding continuous features to the linear nodes would be more appropriate to mixed input feature types than a wholly deep learning classifier, but overall the entirely deep learning method outperformed the combined method across all variations. The “deep” learning method is actually not very deep, consisting of only 4 layers of artificial neurons, but that is all that was needed to achieve results significantly above the baseline, with one exception (discussed shortly). Some tests on dev-test sets did not show much improvement with the addition of more layers or more artificial neurons per layer, although only a small set of variations were tested, and this testing was conducted informally. Results for these development tests are not recorded, but can be replicated through the DNN code at https://github.com/nkrishnaswamy/ thesis-docs-utils/tree/master/utils/analysis/auto-eval. In the highest level of granularity, predicting the entire input sentence, over the unrestricted choice set of all possible input labels, the deep learning methods all actually underperformed the baseline, falling only as high as 0.63% accuracy to the baseline’s 1.01%. As all these figures (base- line included) fall in the range of statistical noise—barely better than random guessing—it seems clear that the label set in this trial, of 1,119 possible labels, is simply too much to classify using sparse feature data, without a very sophisticated method. There do exist recurrent, convolutional, and sequence-to-sequence methods that could be suitable in theory and would be worth testing against this data. In the first two levels of granularity, results are roughly the same across all training lengths. Often, there is a slight increase in performance with the addition of training steps, but in most cases this plateaus after about 3,000 or 4,000 steps, and only in a few variations of the neural net configuration does the classifier perform best when trained for 5,000 steps. In some cases with the finest-grained level, we see indications that further training may increase performance, particularly when using feature weights as input alone without values (see Figure 4.17), but more tests would need to be run to verify this. On the finest-grained trials, we also see significant improvement in performance between 1,000 and 2,000 training steps (e.g., Figure 4.9), or across the 1,000-5,000 training step range (Figure 4.17). Slight improvement from 1,000 to 2,000 training steps that plateaus afterwards is also common (e.g., Figure 4.13).

2This is in the restricted choice set only.

105 CHAPTER 4. RESULTS AND DISCUSSION 4.4 Mechanical Turk Worker Response

The human evaluation tasks also allowed workers to (optionally) provide short feedback explaining their decisions on each HIT. Since this field of the task was not required to be filled out, data is not uniform and is also free form, making it difficult to quantitatively assess. However, qualitative assessment can be provided to get a sense of what those workers who filled out that portion of the task were reasoning about it. A brief survey of this data is presented as word clouds, to provide a quick, intuitive, and vi- sual assessment of the terms (here, uni- and bigrams) that occurred most frequently in worker comments.

Figure 4.19: Word clouds depicting worker response to HET1

For HET1, workers most often explained a choice of multiple videos for a given input by citing the fact that the videos displayed the same event (“perfectly matching,” “matching video,” “equally

106 CHAPTER 4. RESULTS AND DISCUSSION well,” “right process,” etc.). Less frequent but also prevalent was discussion of the objects involved.

Figure 4.20: Word clouds depicting worker response to HET2

For HET2, we see similar results (“right process,” “describe event,” etc.). Workers also com- mented on the task itself, sometimes ungrammatically (“much interesting”). One particular worker, who completed a lot of tasks, also described the videos as “nice work,” which, while validating, is not a very informative response. Overall, workers did not display much tendency to explain their decisions when they chose one distinct answer. Most explanation came when they chose multiples, or in some cases “none.” Word clouds were generated with the wordcloud Python package by Andreas Mueller (http://amueller.github.io/word cloud/).

107 CHAPTER 4. RESULTS AND DISCUSSION 4.5 Summary

The discussion heretofore has been primarily of qualitative inferences taken from the quantitative data, to evaluate the relevance of particular parameters and trends to the prototypicality of the visualization of a given motion event. In Chapter 2, I stated that the goal of this thesis was to determine a set of “best practices” for assigning values to an underspecified motion event in a visual simulation system. From the data revealed by conditioning the responses to HET1 and HET2 on various parameters, I would like to propose the following qualitative criteria for determining the prototypicality of an instance of a motion:

1. Perspicacity — When specifying the parameters of a motion predicate, a certain minimum level of additional information is required for the program to run, but adding too much information seems to have a greater chance of violating an evaluator’s notion of what the original (underspecified) predicate should be. For example, while a rolling or a spinning motion is technically a turning, the prototypical turning motion will have the minimum level of additional specification needed to execute the simulation, and no more. Too much added information tends to change the motion class in an evaluator’s eyes.

2. Obviousness — Motions that are too slight may be mistaken for software jitters or other artifacts and tend not to be classed as distinct motions. The motion should be made in such a way that it is undeniably denoting something.

3. Moderation — For parameters where evaluators have preferred value ranges, those ranges tend to fall in the center of the values available. Motions that are too slow, too fast, too far, not far enough, etc., tend to be less likely to be preferred. This must often be balanced against obviousness, in cases where a motion of longer length or duration is the most obvious kind of enactment.

4. Perspective-independence — The nature (path, manner, relative orientation, etc.) of the motion event should be identifiable from all points of view. An event that looks different from different perspectives is less likely to be preferred as a “best visualization.”

5. Physics-independence — Any physics effects applied after the completion of an event should not change the event class. For example, when a “lean” event is completed, if the leaned object falls off its support, that is unlikely to be considered a “lean.”

108 CHAPTER 4. RESULTS AND DISCUSSION

6. Reflexivity — An event visualization generated from a given label should be identifiable as that label. The prototypical “move” may be considered a “slide,” since “move” instances respecified as “slide” are more likely to be described as “move” events than other events respecified from “move.”

7. Demonstrativity — An event’s VoxML markup includes a semantic head that broadly indi- cates what class of program is enacted over the arguments, and the event visualization should demonstrate that class of program, such as a continually iterated process, a value assignment, etc. For instance, a prototypical transition event such as “put” will clearly demonstrate the contrast and distinction between the ¬φ and φ states encoded in its typing. In conjunction with the obviousness characteristic, a demonstrative motion will display a key characteristic of the motion that distinguishes it from a different motion.

8. Certainty — The label for the motion should be definite, certain, and unambiguous. If evaluators waver between two choices for a best visualization or motion label, the event depicted is unlikely to be prototypical for either of them. This characteristic serves to define the labels of motion supertypes (e.g., “move” or “turn”) from more fully-specified motion subtypes (e.g., “slide” or “flip”).

“Best values” assigned to an underspecified motion predicate are therefore those values that satisfy the maximum number of these criteria that are relevant to that predicate. These criteria or maxims are not equally relevant to every predicate under examination, and a few warrant further examination:

(a) The quality of perspicacity as defined above suggests that, at least for motion predicates such as “move” or “turn,” that have a high amount of unspecified information, there exists a base level of information that defines that basic class of motion. This might be termed a “natural prototype” a la Rosch (1973, 1983). As these motions are also supertypes of more specific mo- tions (such as “turn,” a supertype of “spin”), adding parameters that make the underspecified motion more closely approach a natural prototype of a more specified motion predicate may remove the motion from the prototype or “natural category” of the underspecified motion. This comports with the observation that spins and rolls are less likely to be considered acceptable instances of “turn” than arbitrary rotations that do not have another label associated with them in the simulation besides “turn.”

109 CHAPTER 4. RESULTS AND DISCUSSION

(b) The above observation also applies to the quality of reflexivity. This research relied on two human-driven evaluation tasks: one in which evaluators had to pick the best visualization(s) for a given label and another where they had to pick the best label(s) for a given visualization. The prototypical motion event for a given class is one where these two classifications converge, such that given a motion class C and a visualization V, for V to be considered a prototypical realization of C by the reflexivity maxim, an arbitrary evaluator should be likely to both choose V as an acceptable or best visualization for C, and choose C as an acceptable or best description of V.

(c) The quality of physics-independence suggests that humans evaluate motions not just on the interval from beginning to the instant of satisfaction, but also consider the effects of the motion for at least a short period after the completion of the event, such that an instance of “lean” that does not continue the created support relation is not considered a “lean.” This suggests that the events are taken to be perfective and that prototypical events are judged relative to that Aktionsart.3 However, we should also consider that it may be that all the motions considered herein have natural perfective constructions, thus biasing judgment in that direction. Lexical distinctions such as these could be examined using simulations of semelfactive or atelic verbs (e.g., “blink,” “read,” etc.).

The motion predicates tested broadly fell into three types: those where all underspecified vari- ables have a distinct range of “best values” according to the human judges (“move,” “put touching,” “put near”), those where the precise values of the underspecified variables are routinely judged im- material (“lean” in HET1), and those where certain variables have a preferred range but others do not (the remainder). As expected, most fell into the last category. One interesting conclusion is that human judges appear to exhibit a preference for the minimum level of additional specification required. That is, there appears to be a preference against overspecification, a kind of Gricean maxim for motion events. There is also the possibility that the “best values” chosen by the human judges may be cor- related to object or scene properties such as size of the object or environment/MES. Object fea- tures are reflected in some of the parameter values, such as relative offset ranges for “put near” events, but these are very weak signals. To further examine these kinds of effects, a similar set of experiments can be run in different environments to see if correlations emerge between Monte

3A property of predicates concerned with the internal temporal consistency of the denoted situation (Vendler, 1957; Comrie, 1976; Bache, 1985). The German term is roughly equivalent to “lexical aspect” in English.

110 CHAPTER 4. RESULTS AND DISCUSSION

Carlo-generated value ranges and object-independent scene properties, such as size of the total ma- nipulable area beyond the MES. Conditioning on these additional variables could reveal the effects of object features even beyond obvious candidates like object size (a primary factor in occlusion, which may have affected evaluators’ judgments, as discussed in Section 4.1.7). Since human evaluators were making their judgments on the basis of the visualized events without access to the feature vectors that created them, it was difficult to quantitatively (or even qualitatively) assess the informativity of particular features in motion event labeling. As humans tend to be better judges of qualitative parameters that of precise quantitative values such as those provided by the feature vectors (particularly for the continuous features), the feature vectors are unlikely to be have been much help to the human judges, although this question may be worth individual examination, particularly on the basis on individual features. At any rate, the machine learning-based AET provides a heuristic to measure individual feature informativity against.

4.5.1 Feature Informativity

Independent of its actual value, the presence or absence of a given underspecified feature turns out to be quite a strong predictor of motion class, and even without object features, can be used to automatically discriminate minimal pairs (or triplets) of complete sentences based on motion class alone, with relatively high accuracy, given sufficient training time. This is an interesting result that says something about the data and the task, and not just the machine learning method. The meaning of a motion event, at least according to this data, can be said to be found as much as is what is left out as in what is said. This axiom, in many senses and interpretations, seems apt to generalize to semantic informativity in language at large.

111 Chapter 5

Future Directions

While it has been discussed how composing primitive behaviors into complex events requires in- depth knowledge of the entailments underlying event descriptions, corpora targeted toward needs such as this are only recently being developed (Bowman et al., 2015). Annotated video and image datasets that do not restrict annotator input to a core vocabulary (e.g. Ronchi and Perona (2015)) contain mostly top-level super-events as opposed to the subevent data needed to perform auto- matic event composition, while those that require a restricted vocabulary rely on differing sets of primitives (Chao et al., 2015; Gupta and Malik, 2015). The data gathered through the course of this research can serve as one such dataset built on what I believe to be a set of primitive actions grounded solidly in both language understanding and semantics, and three-dimensional mathemat- ics. Also provided are a set of feature vectors describing the test set of motion events, experimental data inferring some “best values” for those features, (should they exist), and a machine-learning method resulting in the assessment of the informativity of those features. The Brandeis University lab in which this research has been conducted has begun bootstrapping a dataset of videos annotated with event-subevent relations using ECAT, an internally-developed video annotation tool (Do et al., 2016). ECAT allows the annotation of video with labeled events, object participants, and subevents, which can be used to induce what the common subevent struc- tures are for the labeled superevent. Videos have been recoded of a test set of verbs using simple objects, such as blocks and other objects like those found used in the test set here. Similar annota- tion methods have also been linked to data from movies (Do, 2016; Kehat and Pustejovsky, 2016).

112 CHAPTER 5. FUTURE DIRECTIONS

The features of motion events gathered in this dissertation, along with the analyzed informativity metrics, can be used to better understand motion semantics in narratives. Through this line of research, it can be demonstrated that there exist, for some underspecified variables in motion events, prototypical values that create event enactments (here visual) that com- port with human notions of prototypicality for those events. These conclusions flow from analysis of a dataset linking visualized events to linguistic instantiations, which can serve as the beginnings of a corpus of motion events annotated with, effectively, the information “missing” from the lin- guistic instantiation, such that an utterance describing a motion event, composed with its voxeme and corpus data from a dataset such as this one, results in a complete event visualization with the missing “bits” assigned value, allowing the event to be computationally evaluated. Underlying that is VoxML, a robust, extensible framework for mapping natural language to a minimal model and then to a simulation through the dynamic semantics, DITL. The implementation of this frame- work is VoxSim, a software platform for simulation in a visual modality, which demonstrates the framework’s utility for answering theoretical linguistic questions about motion events. It should be stressed here that visualization is just one available modality. As technology improves, events may be simulated through other modalities, including aural, haptic, or proprioceptic. The deep learning methods used here for event classification provide good results over some permutations of the dataset, but could be better in others, and utterly fail in others still. The notion should be entertained that, as a motion event from start to finish is a sequence, attempting to decode it from a single feature vector is perhaps not the best technique, and sequence-to-sequence methods could be explored for this type of task. In addition, since qualitative spatial relations are readily available in the data (either directly encoded in feature vectors or calculable from other feature values plus generally accessible knowledge about the scene), machine learning algorithms to perform analogical generalization over the QSR data could provide similar or better results, potentially with less data (cf. McLure et al. (2015)). The machine evaluation presented herein was largely modeled on the second human evaluation task, requiring the machine learning algorithm to predict the original input sentence for the features of a generated visualization. Using this data and the human-evaluated data as a gold standard, a machine learning-based version of the reverse task is also possible, wherein VoxSim generates a set of new visualizations with different random values for the underspecified parameters, and the algorithm predicts which of those new instances are most likely to be judged acceptable by a human evaluator. These results could then be compared to the presented results from the first human evaluation task.

113 CHAPTER 5. FUTURE DIRECTIONS 5.1 Extensions to Methdology

Extensions to the research methodology may involve using VoxSim in an augmented or virtual reality environment to examine how human perception of motion events changes in a virtual en- vironment when it is fully immersive rather than viewed on a screen. A variable could be added by introducing a disorienting factor to the human judges’ perception, investigating the intersection between spatial perception, cognition, and language processing in virtual environments. The use of a virtual environment also affords alternative methods of gathering additional data to augment the dataset presented here, which was gathered by a very constrained and specified method in order to present a potentially first-of-its-kind dataset in a new field. Data on alternative features can be gathered with very slight adjustments to the automatic capture code, human evalu- ation tasks can be designed to evaluate event visualization acceptability on a scale rather than as a binary, and VoxSim’s foundation on game design, AI theory, and game engine technology readily lends itself to “gamification” approaches in data gathering that could obviate some need for budget considerations and constraints (Pelling, 2011; Deterding et al., 2011). One such approach might be modeled on the “ESP game” (Von Ahn and Dabbish, 2004), wherein pairs of anonymous players label the same image and receive points when their labels coincide, providing an impetus to keep playing. Using a virtually identical mechanic, with motion events generated by VoxSim instead of static images, could potentially generate a large set of video captions (a la those used in HET2), and the search space could be left open or restricted and then evaluated against both the HET2 results and the presented machine-learning results. This and related approaches could provide many methods for expanding the initial datasets presented in this thesis into a genuine crowd-sourced gold standard, and allow more flexibility in gathering new data about new events or object interactions.

5.2 VoxML and Robotics

The VoxML framework also has relevance to the field of robotics. While a humanoid skeleton in a 3D environment is a directed, rooted graph with nodes laid out in the rough configuration of a human, representing the positions of the major joints, a robotic agent could be virtually represented by a similar type of graph structure with different configuration, isomorphic to the locations of major pivot points on the robot’s external structure, such as those of graspers or robotic limbs.

114 CHAPTER 5. FUTURE DIRECTIONS

A 3D representation of a robotic agent that is operating in the real world then would allow the simulation of events in the 3D world (such as moving the simulated robot around a simulated table that has simulated blocks on it) representing events and object configurations in the real world. The event simulation then generates position and orientation information for each object in the scene at each time step t, which is isomorphic to the real-world configuration in the same way that the robot’s virtual skeleton is isomorphic to its actual joint structure. This allows the real robot, acting as an agent, to be fed a set of translation and rotation “moves” by its virtual embodiment that is a nearly exact representation of the steps it would need to take to satisfy a real world goal, such as navigating to a target or grasping an object (cf. Thrun et al. (2000); Rusu et al. (2008)). In short, the interdisciplinary nature of the research that led to the creation of VoxML and VoxSim naturally affords many extensions into other disciples, fields, and specializations.

5.3 Information-Theoretic Implications

Correlates between the incomplete information provided by a linguistic predicate and models for ignorance about a system are quite striking in their resemblance. Ω, or the measure of ignorance of a physical system, can be thought of in terms of the number of quantitatively-defined microstates consistent with a qualitative macrostate (or “label”), and in the data gathered about linguistic pre- suppositions here, we can observe similar patterns unfolding with regard to events, wherein some events (for instance, “lean”) appear to allow a number of configurations (or microstates) that would satisfy that label (or macrostate) according to human evaluators. We also observe circumstances in which the incomplete information provided by a predicate is actually further restricted to some set of values and metrics that appear to correlate more closely to prototypicality of a given motion event than others, suggesting some level of information is added by the interpreter. That evaluators seem to be resistant to highly underspecified events such as “move” being too far overspecified, or having too much information content assigned to them, suggests that predicates may be bearers of a finite level of information entropy, meaning a finite level of information is required to describe the system.1 If I may be allowed to philosophize for a moment, this suggests that representabil-

1Qualitative reasoning approaches to this notion have been forwarded by Kuipers (1994) and Joskowicz and Sacks (1991), where many examples presented involve physical systems. Other discussion on qualitative spatial reasoning in physical systems is presented in the qualitative physics literature, including Forbus (1988); Faltings and Struss (1992), and a recent gamification approach presented by Shute et al. (2013).

115 CHAPTER 5. FUTURE DIRECTIONS ity of a proposition is ultimately grounded in physical reality. For a statement to have meaning, irrespective of its truth value, it must be assessed relative to some condition in the world, either as describing a true situation, or in contrast to one. A “roll” is technically a kind of turning, but a “flip” may be a better one. Adding too much, or the wrong kind of, information to an under- specified predicate can change its meaning in the eye of the interpreter. Different intensions and extensions, different senses and references, find their union in the set of parameter values that satisfy both, resulting in a realization that, to the interpreter under the model constructed by the current situational context, appears to be the truth.

116 Appendix A

VoxML Structures

A.1 Objects

   block              PRED = block        LEX =          TYPE = physobj, artifact                   HEAD = rectangular prism[1]             COMPONENTS = nil                TYPE =  CONCAVITY = flat                 ROTATSYM = {X,Y,Z}             =     REFLECTSYM {XY,XZ,YZ}                         INTR = [2] CONSTR = {X = Y + Z}           HABITAT =              XTR = ...     E                =    A1 H[2] → [put(x, on([1]))]support([1], x)            =  A = H → [grasp(x, [1])]hold(x, [1])    AFFORD STR  3 [2]             A4 = H → [slide(x, [1])]H     [2] [2]                 SCALE =

117 APPENDIX A. VOXML STRUCTURES

   ball              PRED = ball        LEX =         TYPE = physobj, artifact                    HEAD = ellipsoid[1]             =     COMPONENTS nil         =    TYPE =  CONCAVITY convex             ROTAT YM =     S {X,Y,Z}             REFLECTSYM = {XY,XZ,YZ}                   INTR = ...        HABITAT =      = ...     EXTR                =    A1 H → [grasp(x, [1])]             =     A2 H → [hold(x, [1])]lift(x, [1])    AFFORD STR =          A = H → [roll(x, [1])]     3                             SCALE =

118 APPENDIX A. VOXML STRUCTURES

   plate              PRED = plate        LEX =          TYPE = physobj, artifact                =    HEAD sheet[1]             COMPONENTS = base, surface[1]             CONCAVITY =    TYPE =  concave             ROTATSYM = {Y }                 REFLECTSYM = {XY,YZ}                              CONSTR = {Y  X,Y  Z}                  = [2] =      INTR  UP align(Y, EY )                 =   TOP = top(+Y )     HABITAT                              XTR = [3] UP = align(Y, E )      E  ⊥Y                            A1 = H[2] → [put(x, on([1]))]support([1], x)                 A2 = H[2] → [put(x, in([1]))]contain([1], x)             =    AFFORD STR =  A3 H[2] → [grasp(x, [1])]hold(x, [1])             A = H → [slide(x, [1])]H     4 [2] [2]             A = H → [roll(x, [1])]H     5 [3] [3]                 SCALE =

119 APPENDIX A. VOXML STRUCTURES

   cup           PRED = cup       LEX =          TYPE = physobj, artifact                =    HEAD cylindroid[1]             COMPONENTS = surface, interior             CONCAVITY =    TYPE =  concave             ROTATSYM = {Y }                 REFLECTSYM = {XY,YZ}                              CONSTR = {Y > X, Y > Z}                  = [2] =      INTR  UP align(Y, EY )                 =   TOP = top(+Y )     HABITAT                              XTR = [3] UP = align(Y, E )      E  ⊥Y                            A1 = H[2] → [put(x, on([1]))]support([1], x)             =     A2 H[2] → [put(x, in([1]))]contain([1], x)        AFFORD STR =      =     A3 H[2] → [grasp(x, [1])]             A = H → [roll(x, [1])]     4 [3]                 SCALE =

120 APPENDIX A. VOXML STRUCTURES

   disc              PRED = disc        LEX =          TYPE = physobj, artifact                   HEAD = cylindroid[1]             COMPONENTS = nil                TYPE =  CONCAVITY = flat                 ROTATSYM = {Y }             =     REFLECTSYM {XY,YZ}                             INTR = [2] CONSTR = {X > Y, Z > Y }                                           CONSTR = {X([1]) X(y : physobj),         >        [3]      HABITAT =           Z([1]) Z(y : physobj)}        >       XTR =       E                           [4] =        UP align(Y, E⊥Y )                                          =    A1 H[2] → [grasp(x, [1])]            =  = H → [put([1], on(y)]close([1], y)    AFFORD STR  A2 [3]             A = H → [roll(x, [1])]     4 [4]                 SCALE =

121 APPENDIX A. VOXML STRUCTURES

   book              PRED = book        LEX =          TYPE = physobj, artifact                   HEAD = rectangular prism[1]             COMPONENTS = cover[2]+, page[3]+                TYPE =  CONCAVITY = flat             =     ROTATSYM nil             REFLECT YM =     S {XY }                            UP = align(Y, EY )            INTR = [4]           HABITAT =   TOP = front(+Y )                      EXTR = ...                   A1 = H → [grasp(x, [2]),                 move(x, [2], away(from([3])))]open(x, [1])        AFFORD STR =      =     A2 H → [grasp(x, [2]),                 move(x, [2], toward([3]))]close(x, [1])                 SCALE =

122 APPENDIX A. VOXML STRUCTURES

   blackboard              PRED = blackboard        LEX =          TYPE = physobj, artifact                   HEAD = sheet[1]             COMPONENTS =     board, surface[1], back, leg[2]+            TYPE =  CONCAVITY = flat             =     ROTATSYM nil             REFLECT YM =     S {YZ}                          UP = align(Y, EY )                         TOP = top(+Y )                        INTR = [3] UP = align(Z, EZ )     =       HABITAT         =       TOP front(+Z)                   CONSTR[2] = {Y  X,Y  Z}                      = ...     EXTR                =    A1 H → [grasp(x, [1])]            AFFORD STR =  = H → [write(x, on([1])]     A1 [3]                             SCALE = agent        EMBODIMENT =      =     MOVABLE true      

123 APPENDIX A. VOXML STRUCTURES

   bottle              PRED = bottle        LEX =          TYPE = physobj, artifact                =    HEAD cylindroid[1]             COMPONENTS = surface, interior             CONCAVITY =    TYPE =  concave             ROTATSYM = {Y }                 REFLECTSYM = {XY,YZ}                         =      CONSTR {Y > X, Y > Z}                  = [2] UP =      INTR  align(Y, EY )                 HABITAT =   TOP = top(+Y )                                  XTR = [3] UP = align(Y, E )      E  ⊥Y                            A1 = H[2] → [put(x, on([1]))]support([1], x)             =     A2 H[2] → [put(x, in([1]))]contain([1], x)        AFFORD STR =      = H → [grasp(x, [1])]     A3 [2]             A = H → [roll(x, [1])]     4 [3]                 SCALE =

124 APPENDIX A. VOXML STRUCTURES

   grape           PRED = grape       LEX =          TYPE = physobj                =    HEAD ellipsoid[1]             COMPONENTS = fruit[1]                TYPE =  CONCAVITY = flat                 ROTATSYM = {Y }             =     REFLECTSYM {XY,YZ}                 INTR = ...                  HABITAT =      = [2] =     EXTR  UP align(Y, E⊥Y )                             A = H → [grasp(x, [1])]     1             A1 = H → [hold(x, [1])]lift(x, [1])            =     AFFORD STR  A1 = H → [slide(x, [1])]             =     A1 H[2] → [roll(x, [1])]                             SCALE =

125 APPENDIX A. VOXML STRUCTURES

   apple              PRED = apple        LEX =          TYPE = physobj                =    HEAD ellipsoid[1]             COMPONENTS = fruit[1], stem, leaf                TYPE =  CONCAVITY = flat             =     ROTATSYM {Y }             REFLECT YM =     S {XY,YZ}                 INTR = ...                  HABITAT =      = [2] =      EXTR  UP align(Y, E⊥Y )                            A1 = H → [grasp(x, [1])]                 A1 = H → [hold(x, [1])]lift(x, [1])            AFFORD STR =      A1 = H → [slide(x, [1])]             =     A1 H[2] → [roll(x, [1])]                             SCALE =

126 APPENDIX A. VOXML STRUCTURES

   banana              PRED = banana        LEX =          TYPE = physobj                =    HEAD cylindroid[1]             COMPONENTS = fruit[1], stem             CONCAVITY = convex    TYPE =              ROTATSYM = {Y }                 REFLECTSYM = {YZ}                   INTR = ...                  HABITAT =      EXTR = [2] UP = align(Y, E )       ⊥Y                         =    A3 H → [grasp(x, [1])]             =    AFFORD STR =  A3 H → [hold(x, [1])]lift(x, [1])             A = H → [slide(x, [1])]     4 [2]                 SCALE =

127 APPENDIX A. VOXML STRUCTURES

   bowl              PRED = bowl        LEX =          TYPE = physobj, artifact                   HEAD = cylindroid[1]             COMPONENTS = base, interior             =     CONCAVITY concave    =     TYPE      ROTATSYM = {Y }                 REFLECTSYM = {XY,YZ}                                            CONSTR = {Y < X, Y < Z}                        INTR = [2] UP = align(Y, EY )                   =     HABITAT =   TOP top(+Y )                            = [3] =      EXTR  UP align(Y, E⊥Y )                            A1 = H → [put(x, on([1]))]support([1], x)     [2]             A2 = H[2] → [put(x, in([1]))]contain([1], x)             =     A3 H[2] → [grasp(x, [1])]        AFFORD STR =      =     A3 H[2] → [grasp(x, [1])]lift(x, [1])             A = H → [slide(x, [1])]     3 [2]             A3 = H → [roll(x, [1])]     [3]                 SCALE =

128 APPENDIX A. VOXML STRUCTURES

   knife              PRED = knife        LEX =         TYPE = physobj, artifact                    HEAD = rectangular prism[1]             =     COMPONENTS handle[2], blade            TYPE =  CONCAVITY = flat                 ROTATSYM = nil             =     REFLECTSYM {XY }                          CONSTR = {X > Y, X  Z}            INTR = [3]           HABITAT =   FRONT = front(+X)                      EXTR = ...                     A = H → [grasp(x, [1])]     1 [3]    AFFORD STR =          A2 = H → [grasp(x, [2]) → grasp(x, [1])]     [3]                 SCALE =

129 APPENDIX A. VOXML STRUCTURES

   pencil              PRED = pencil        LEX =          TYPE = physobj, artifact                =    HEAD cylindroid[1]             COMPONENTS = shaft[1], eraser, nib             CONCAVITY = convex    TYPE =              ROTATSYM = {Z}                 REFLECTSYM = {XZ,YZ}                         =      CONSTR {Y  X,Y  Z}                  NTR = [4] FORWARD = align(Z, E )      I  Z                 HABITAT =   FRONT = front(+Z)                                  EXTR = [5] FORWARD = align(Z, E )       ⊥Y                            A3 = H[4] → [grasp(x, [1])]             =    AFFORD STR =  A3 H[4] → [hold(x, [1])]lift(x, [1])             = H → [roll(x, [1])]     A4 [5]                 SCALE =

130 APPENDIX A. VOXML STRUCTURES

   paper sheet              PRED = paper sheet        LEX =          TYPE = physobj, artifact                =    HEAD sheet[1]             COMPONENTS = nil                TYPE =  CONCAVITY = flat             =     ROTATSYM {Y }             REFLECT YM = {XY,YZ}     S                            UP = align(Y, EY )                         TOP = top(+Y )      NTR = [2]      I      HABITAT =   =       CONSTR {Y  X,Y  Z}                                        EXTR = ...                   A1 = H → [grasp(x, [1])]     [2]             A1 = H[2] → [hold(x, [1])]lift(x, [1])    AFFORD STR =          =     A1 H[2] → [slide(x, [1])]                             SCALE =

A.2 Programs

   move           PRED = move       LEX =      =     TYPE process                HEAD = process                       A1 = x:agent            ARGS =        =       A2 y:physobj         TYPE =                    =      E1 grasp(x, y)      BODY =              E = [while(hold(x, y), move(x, y)]       2               

131 APPENDIX A. VOXML STRUCTURES

   turn              PRED = turn        LEX =     TYPE = process                 HEAD =    process                   A =      1 x:agent      =       ARGS        A2 = y:physobj               TYPE =                     E1 = grasp(x, y)            BODY =             E2 = [while(hold(x, y), rotate(y)]                  

   roll              PRED = roll        LEX =      TYPE = process                =    HEAD process                   =      A1 x:agent                  ARGS =  A = y:physobj       2                   A3 = z:surface         =     TYPE                     E1 = grasp(x, y)                   =      BODY =  E2 [while(hold(x, y),                         while(EC(y, z), translocate(x, y), rotate(x, y)))]               

   slide              PRED = slide        LEX =      TYPE = process                =    HEAD process                   =      A1 x:agent                  ARGS =  A = y:physobj       2                   A3 = z:surface         =     TYPE                     E1 = grasp(x, y)                   =      BODY =  E2 [while(hold(x, y),                         while(EC(y, z), move(x, y)))]               

132 APPENDIX A. VOXML STRUCTURES

   push              PRED = push        LEX =      TYPE = process                =    HEAD process                   =      A1 x:agent                  ARGS =  A = y:physobj       2                   A3 = z:physobj         =     TYPE                     E1 = grasp(x, y)                   =      BODY =  E2 [while(hold(x, y),                         while(EC(y, z), move(x, y)))]               

   spin             PRED = spin       LEX =      =     TYPE process                HEAD = process                       A1 = x:agent            ARGS =        =       A2 y:physobj                       =     TYPE  =      E1 grasp(x, y)                   E = rotate(x, y)       2      BODY =              E = ungrasp(x, y)       3                                 

   lift             PRED = lift       LEX =      TYPE =     process                HEAD = process                       A1 = x:agent            ARGS =        A =       2 y:physobj         TYPE =                    E =      1 grasp(x, y)      BODY =              E = [while(hold(x, y), move(x, y, vec(E ))))]       2 Y               

133 APPENDIX A. VOXML STRUCTURES

   stack              PRED = stack        LEX =          TYPE = transition event                   HEAD = transition                        A1 = x:agent            ARGS =        A =       2 y[]:physobj         TYPE =                     E = def(y[0], as(z)), for(o ∈ y[1..n])       1      BODY =             [put(o, on(z)), reify((z, o), as(z)]                    

   put              PRED = put        LEX =      =     TYPE transition event                HEAD = transition                      A =      1 x:agent                  ARGS =  A2 = y:physobj                   =       A3 z:location         TYPE =                    =      E1 grasp(x, y)                  =  E =      BODY  2 [while(hold(x, y), move(x, y)]                   E = [at(y, z) → ungrasp(x, y)]       3               

134 APPENDIX A. VOXML STRUCTURES

   lean              PRED = lean        LEX =          TYPE = transition event                =    HEAD transition                   =      A1 x:agent                  =  A = y:physobj      ARGS  2                   A3 = z:location                                 E1 = grasp(x, y)                         E2 = [while(hold(x, y), turn(x, y,                         align(minor(y),     TYPE =               E × (90 − θ, about(E )))))]       Y ⊥Y                   E = [while(hold(x, y), turn(x, y,       3      BODY =              align(major(y),                         EY × (θ, about(E⊥Y ))),                         about(minor(y))))]                   E =       4 [while(hold(x, y), put(x, y))]                   E = [at(y, z) → ungrasp(x, y)]      5              

   flip              PRED = flip        LEX =         TYPE = transition event                    HEAD = transition                        A1 = x:agent            ARGS =        =       A2 y:physobj                 TYPE =             E = def(w, as(orient(y)))[grasp(x, y)]       1                  BODY =  E = [while(hold(x, y), rotate(x, y)]       2                  E = [(orient(y) = opp(w)) → ungrasp(x, y)]      3               

135 APPENDIX A. VOXML STRUCTURES

   close              PRED = close        LEX =         TYPE = transition event                    HEAD = transition                        A1 = x:agent            ARGS =        A =       2 y:physobj                 TYPE =             E = grasp(x, y)       1                  BODY =  E = [while(hold(x, y), move(x, y))]       2                  E3 = [DC(interior(y), E) → ungrasp(x, y)]                    

   open           PRED =    open    =     LEX      TYPE = transition event                   HEAD = transition                        A1 = x:agent            ARGS =        =       A2 y:physobj                 TYPE =            =      E1 grasp(x, y)                  BODY =  E = [while(hold(x, y), move(x, y))]       2                   E = [EC(interior(y), E) → ungrasp(x, y)]       3               

A.3 Relations

   on           = PRED = on   LEX                  CLASS = config                 VALUE = EC                      TYPE =   A1 = x:3D            ARGS =        =       A2 y:3D                      CONSTR = y→HABITAT→INTR[align]      

136 APPENDIX A. VOXML STRUCTURES

   in             LEX =  PRED = in                     CLASS = config                 VALUE = PO k TPP k NTPP                      TYPE =   A1 = x:3D            ARGS =              A2 = y:3D                      CONSTR = y→HABITAT→INTR[align]?      

   against             LEX =  PRED = against                     CLASS = force dynamic                 VALUE = EC                      TYPE =   A1 = x:3D            ARGS =        =       A2 y:3D                      CONSTR = nil      

   at             LEX =  PRED = at                   CLASS = config             =     VALUE DC k EC                      TYPE =   A1 = x:3D      =       ARGS        A = y:3D       2                      CONSTR = dist(x, y) <       

137 APPENDIX A. VOXML STRUCTURES A.4 Functions

   edge             LEX =  PRED = edge                     ARG = x:physobj             REFERENT = x→HEAD                 MAPPING = dimension(n):1                      =  SPACE = object    TYPE                     AXIS = ref:after      ORIENTATION =              ARITY = x→HABITAT→                   INTR       [axis]               

   center             LEX = PRED = center                    ARG = x:physobj             REFERENT = x→HEAD                 MAPPING = dimension(n):0                  TYPE =       SPACE = object                  =  =      ORIENTATION  AXIS nil                   ARITY = intransitive                   

138 Appendix B

Underspecifications

Program Underspecified parameters Satisfaction test move(y) motion manner loc(y)n 6= loc(y)0 ∨ rot(y)n 6= rot(y)0 turn(y) rot speed rot(y)n 6= rot(y)0 rot axis rot angle rot dir motion manner roll(y) transloc dir EC(y,z)[0,n]?,loc(y)n 6= loc(y)0, n P−−−−−−−−−−−−−→ rot(y) ∝ loc(y)t − loc(y)t−1 t=0 slide(y) transloc speed EC(y,z)[0,n]?,loc(y)n 6= loc(y)0, n P−−−−−−−−−−−−−→ transloc dir rot(y) 6∝ loc(y)t − loc(y)t−1 t=0 spin(y) rot angle rot(y)n 6= rot(y)0, rot speed (rot(y)t 6= rot(y)t−1)[0,n]? rot axis rot dir motion manner lift(y) transloc speed x(y)n = x(y)0,y(y)n > y(y)0,z(y)n > z(y)0, transloc dir (loc(y)t 6= loc(y)t−1)[0,n]? stack(y[]) list (EC(y[i],y[i+1],Y(y[i]) < Y(y[i+1]))i∈y put(y,z) transloc speed d(z,loc(y)n) = 0,(loc(y)t 6= loc(y)t−1)[0,n]? rel orientation rel offset lean(y,z) rot angle EC(y,z),¬ align(major(y),EY ),rot(y)n 6= rot(y)0

139 APPENDIX B. UNDERSPECIFICATIONS

flip(y,z) rot axis vec(X(y))n = opp(vec(z-loc(y))0) ∨ symmetry axis vec(Y(y))n = opp(vec(z-loc(y))0) ∨ vec(Z(y))n = opp(vec(z-loc(y))0)  manner = put: DC(y, z) ,EC(y, z)n,  [0,n−1] close(y) motion speed loc(y)n 6= loc(y)0   manner = turn: rot(y)n 6= rot(y)0  manner = move: EC(y, z)0,DC(y, z)[1,n],    loc(y)n 6= loc(y)0   manner = turn: EC(y, z)[0,n],  open(y) motion speed align(major(y), major(z))0,   ¬align(major(y),    major(z))[1,n],   rot(y)n 6= rot(y)0 transloc dir rot angle Table B.1: Underspecified parameters and satisfaction conditions

140 Appendix C

[[TURN]]: Complete Operationalization

public void TURN(object[] args) { string originalEval = eventManager.events [0]; Vector3 targetRotation = Vector3.zero; float sign = 1.0f; float angle = 0.0f; string rotAxis = "";

// look for agent GameObject agent = GameObject.FindGameObjectWithTag("Agent"); if (agent != null) { // add preconditions if (!SatisfactionTest.IsSatisfied (string.Format ("reach({0})", (args [0] as GameObject).name))) { eventManager.InsertEvent (string.Format ("reach({0})", (args [0] as GameObject).name), 0); eventManager.InsertEvent (string.Format ("grasp({0})", (args [0] as GameObject).name), 1); if (args.Length > 2) { eventManager.InsertEvent ( eventManager.evalOrig [string.Format ("turn({0},{1})", (args [0] as GameObject).name, Helper.VectorToParsable ((Vector3)args [1]))], 1); } else {

141 APPENDIX C. [[TURN]]: COMPLETE OPERATIONALIZATION

eventManager.InsertEvent ( eventManager.evalOrig [string.Format ("turn({0})", (args [0] as GameObject).name)], 1); } eventManager.RemoveEvent (3); return; } else { if (!SatisfactionTest.IsSatisfied (string.Format ("grasp({0})", (args [0] as GameObject).name))) { eventManager.InsertEvent ( string.Format ("grasp({0})", (args [0] as GameObject).name), 0); if (args.Length > 2) { eventManager.InsertEvent ( eventManager.evalOrig [ string.Format ("turn({0},{1})", (args [0] as GameObject).name, Helper.VectorToParsable ((Vector3)args [1]))], 1); } else { eventManager.InsertEvent ( eventManager.evalOrig [string.Format ("turn({0})", (args [0] as GameObject).name)], 1); } eventManager.RemoveEvent (2); return; } }

// add postconditions if (args [args.Length - 1] is bool) { if ((bool)args [args.Length - 1] == true) { eventManager.InsertEvent ( string.Format ("ungrasp({0})", (args [0] as GameObject).name), 1); } } } // override physics rigging

142 APPENDIX C. [[TURN]]: COMPLETE OPERATIONALIZATION foreach (object arg in args) { if (arg is GameObject) { (arg as GameObject).GetComponent (). ActivatePhysics(false); } } string prep = rdfTriples.Count > 0 ? rdfTriples [0].Item2.Replace ("turn", "") : ""; if (args [0] is GameObject) { GameObject obj = (args [0] as GameObject); Voxeme voxComponent = obj.GetComponent (); if (voxComponent != null) { if (!voxComponent.enabled) { voxComponent.gameObject.transform.parent = null; voxComponent.enabled = true; }

if (args [1] is Vector3 && args [2] is Vector3) { // args[1] is local space axis // args[2] is world space axis if (args [3] is Vector3) { // args[3] is world space axis sign = Mathf.Sign (Vector3.Dot(Vector3.Cross ( obj.transform.rotation * (Vector3)args [1], (Vector3)args [2]), (Vector3)args[3])); angle = Vector3.Angle ( obj.transform.rotation * (Vector3)args [1], (Vector3)args [2]); // rotation from object axis [1] // to world axis [2] // around world axis [3]

if (voxComponent.turnSpeed == 0.0f) { voxComponent.turnSpeed = RandomHelper.RandomFloat (0.0f, 12.5f, (int)RandomHelper.RangeFlags.MaxInclusive); }

targetRotation = (Quaternion.AngleAxis (sign * angle,

143 APPENDIX C. [[TURN]]: COMPLETE OPERATIONALIZATION

(Vector3)args [3]) * obj.transform.rotation).eulerAngles; rotAxis = Constants.Axes.FirstOrDefault ( a => a.Value == (Vector3)args [3]).Key; } else { // rotation from object axis[1] to world axis [2]

if (voxComponent.turnSpeed == 0.0f) { voxComponent.turnSpeed = RandomHelper.RandomFloat ( 0.0f, 12.5f, (int)RandomHelper.RangeFlags.MaxInclusive); }

targetRotation = Quaternion.FromToRotation( (Vector3)args [1], (Vector3)args [2]).eulerAngles; angle = Vector3.Angle ((Vector3)args [1], (Vector3)args [2]); } } else { if (voxComponent.turnSpeed == 0.0f) { voxComponent.turnSpeed = RandomHelper.RandomFloat (0.0f, 12.5f, (int)RandomHelper.RangeFlags.MaxInclusive); }

targetRotation = (obj.transform.rotation * UnityEngine.Random.rotation).eulerAngles; angle = Quaternion.Angle(transform.rotation, Quaternion.Euler(targetRotation)); }

voxComponent.targetRotation = targetRotation; } }

// add to events manager if (args[args.Length-1] is bool) {

144 APPENDIX C. [[TURN]]: COMPLETE OPERATIONALIZATION if ((bool)args[args.Length-1] == false) { if (args [1] is Vector3 && args [2] is Vector3) { if (args [3] is Vector3) { eventManager.events [0] = "turn(" + (args [0] as GameObject).name + "," + Helper.VectorToParsable ((Vector3)args [1]) + "," + Helper.VectorToParsable ((Vector3)args [2]) + "," + Helper.VectorToParsable ((Vector3)args [3]) + ")"; } else { eventManager.events [0] = "turn(" + (args [0] as GameObject).name + "," + Helper.VectorToParsable ((Vector3)args [1]) + "," + Helper.VectorToParsable ((Vector3)args [2]) + ")"; } } else { eventManager.events [0] = "turn(" + (args [0] as GameObject).name + "," + Helper.VectorToParsable ((args [0] as GameObject). transform.rotation * Constants.yAxis) + "," + Helper.VectorToParsable ((args [0] as GameObject). transform.rotation * Quaternion.Euler(targetRotation) * Constants.yAxis) + ")"; }

// record parameter values OnPrepareLog (this, new ParamsEventArgs ("RotSpeed", (args [0] as GameObject).GetComponent(). turnSpeed.ToString()));

if (angle > 0.0f) { OnPrepareLog (this, new ParamsEventArgs ("RotAngle",

145 APPENDIX C. [[TURN]]: COMPLETE OPERATIONALIZATION

angle.ToString())); OnPrepareLog (this, new ParamsEventArgs ("RotDir", sign.ToString ())); }

if (rotAxis != string.Empty) { OnPrepareLog (this, new ParamsEventArgs ("RotAxis", rotAxis)); }

if ((Helper.GetTopPredicate(eventManager.lastParse) == Helper.GetTopPredicate(eventManager.events [0])) || (PredicateParameters.IsSpecificationOf( Helper.GetTopPredicate(eventManager.events [0]), Helper.GetTopPredicate(eventManager.lastParse)))) { OnParamsCalculated (null, null); } } }

return; }

Figure C.1: C# operationalization of [[TURN]] (unabridged)

The segment below “add to events manager” adds the event to the global events manager for tracking and updating the status and state-by-state relations between objects in the scene that result from this [[TURN]] action. OnPrepareLog and OnParamsCalculated handle logging the values assigned to the underspecified parameters to a SQL database for the experimentation and evaluation. This feature is present in VoxSim for users wishing to capture simulated events and their properties.

146 Appendix D

Sentence Test Set

1. open the book 16. move the apple 31. turn the apple

2. close the book 17. move the banana 32. turn the banana

3. close the cup 18. move the bowl 33. turn the bowl

4. open the cup 19. move the knife 34. turn the knife

5. close the bottle 20. move the pencil 35. turn the pencil

6. open the bottle 21. move the paper sheet 36. turn the paper sheet

7. move the block 22. turn the block 37. lift the block

8. move the ball 23. turn the ball 38. lift the ball

9. move the plate 24. turn the plate 39. lift the plate

10. move the cup 25. turn the cup 40. lift the cup

11. move the disc 26. turn the disc 41. lift the disc

12. move the book 27. turn the book 42. lift the book

13. move the blackboard 28. turn the blackboard 43. lift the blackboard

14. move the bottle 29. turn the bottle 44. lift the bottle

15. move the grape 30. turn the grape 45. lift the grape

147 APPENDIX D. SENTENCE TEST SET

46. lift the apple 72. flip the disc at center 98. roll the ball

47. lift the banana 73. flip the book at center 99. roll the plate

48. lift the bowl 74. flip the bottle at center 100. roll the cup

49. lift the knife 75. flip the grape at center 101. roll the disc

50. lift the pencil 76. flip the apple at center 102. roll the book

51. lift the paper sheet 77. flip the banana at center 103. roll the blackboard

52. spin the block 78. flip the bowl at center 104. roll the bottle

53. spin the ball 79. flip the knife at center 105. roll the grape

54. spin the plate 80. flip the pencil at center 106. roll the apple

55. spin the cup 81. flip the paper sheet at center 107. roll the banana

56. spin the disc 82. slide the block 108. roll the bowl

57. spin the book 83. slide the ball 109. roll the knife

58. spin the blackboard 84. slide the plate 110. roll the pencil

59. spin the bottle 85. slide the cup 111. roll the paper sheet

60. spin the grape 86. slide the disc 112. open the banana

61. spin the apple 87. slide the book 113. put the block in the plate

62. spin the banana 88. slide the blackboard 114. put the block in the cup

63. spin the bowl 89. slide the bottle 115. put the block in the disc

64. spin the knife 90. slide the grape 116. put the block in the bottle

65. spin the pencil 91. slide the apple 117. put the block in the bowl

66. spin the paper sheet 92. slide the banana 118. put the ball in the block

67. flip the block on edge 93. slide the bowl 119. put the ball in the plate

68. flip the book on edge 94. slide the knife 120. put the ball in the cup

69. flip the blackboard on edge 95. slide the pencil 121. put the ball in the disc

70. flip the plate at center 96. slide the paper sheet 122. put the ball in the bottle

71. flip the cup at center 97. roll the block 123. put the ball in the bowl

148 APPENDIX D. SENTENCE TEST SET

124. put the plate in the cup 150. put the banana in the bowl 174. put the block touching the disc 125. put the plate in the disc 151. put the bowl in the plate 175. put the block touching the 126. put the plate in the bowl 152. put the bowl in the cup book 127. put the cup in the plate 153. put the bowl in the disc 176. put the block touching the blackboard 128. put the cup in the bottle 154. put the knife in the block 155. put the knife in the ball 177. put the block touching the 129. put the cup in the bowl bottle 156. put the knife in the plate 130. put the disc in the cup 178. put the block touching the 157. put the knife in the cup grape 131. put the disc in the bottle 158. put the knife in the book 179. put the block touching the 132. put the disc in the bowl apple 159. put the knife in the black- 133. put the book in the cup board 180. put the block touching the banana 134. put the book in the bottle 160. put the knife in the bottle 181. put the block touching the 135. put the book in the bowl 161. put the knife in the grape bowl 136. put the bottle in the cup 162. put the knife in the apple 182. put the block touching the knife 137. put the bottle in the disc 163. put the knife in the banana 183. put the block touching the 138. put the bottle in the bowl 164. put the knife in the bowl pencil 139. put the grape in the plate 165. put the knife in the pencil 184. put the block touching the 140. put the grape in the cup 166. put the knife in the paper paper sheet sheet 141. put the grape in the book 185. put the ball touching the 167. put the pencil in the cup block 142. put the grape in the bottle 168. put the pencil in the book 186. put the ball touching the 143. put the grape in the bowl plate 169. put the pencil in the bottle 144. put the apple in the plate 187. put the ball touching the 170. put the pencil in the bowl cup 145. put the apple in the cup 171. put the block touching the 188. put the ball touching the 146. put the apple in the bottle ball disc

147. put the apple in the bowl 172. put the block touching the 189. put the ball touching the plate book 148. put the banana in the plate 173. put the block touching the 190. put the ball touching the 149. put the banana in the cup cup blackboard

149 APPENDIX D. SENTENCE TEST SET

191. put the ball touching the 207. put the banana touching the 223. put the pencil touching the bottle block ball

192. put the ball touching the 208. put the bowl touching the 224. put the paper sheet touch- grape block ing the ball

193. put the ball touching the ap- 209. put the knife touching the 225. put the plate touching the ple block cup

194. put the ball touching the ba- 210. put the pencil touching the 226. put the disc touching the nana block cup

195. put the ball touching the 211. put the paper sheet touch- 227. put the book touching the bowl ing the block cup

196. put the ball touching the 212. put the plate touching the 228. put the blackboard touch- knife ball ing the cup

197. put the ball touching the 213. put the cup touching the 229. put the bottle touching the pencil ball cup

198. put the ball touching the pa- 214. put the disc touching the 230. put the grape touching the per sheet ball cup

199. put the plate touching the 215. put the book touching the 231. put the apple touching the block ball cup

200. put the cup touching the 216. put the blackboard touch- 232. put the banana touching the block ing the ball cup

201. put the disc touching the 217. put the bottle touching the 233. put the bowl touching the block ball cup

202. put the book touching the 218. put the grape touching the 234. put the knife touching the block ball cup

203. put the blackboard touch- 219. put the apple touching the 235. put the pencil touching the ing the block ball cup

204. put the bottle touching the 220. put the banana touching the 236. put the paper sheet touch- block ball ing the cup

205. put the grape touching the 221. put the bowl touching the 237. put the cup touching the block ball plate

206. put the apple touching the 222. put the knife touching the 238. put the disc touching the block ball plate

150 APPENDIX D. SENTENCE TEST SET

239. put the book touching the 255. put the apple touching the 271. put the pencil touching the plate disc book

240. put the blackboard touch- 256. put the banana touching the 272. put the paper sheet touch- ing the plate disc ing the book

241. put the bottle touching the 257. put the bowl touching the 273. put the plate touching the plate disc blackboard

242. put the grape touching the 258. put the knife touching the 274. put the cup touching the plate disc blackboard

243. put the apple touching the 259. put the pencil touching the 275. put the disc touching the plate disc blackboard

244. put the banana touching the 260. put the paper sheet touch- 276. put the book touching the plate ing the disc blackboard

245. put the bowl touching the 261. put the plate touching the 277. put the bottle touching the plate book blackboard

246. put the knife touching the 262. put the cup touching the 278. put the grape touching the plate book blackboard

247. put the pencil touching the 263. put the disc touching the 279. put the apple touching the plate book blackboard

248. put the paper sheet touch- 264. put the blackboard touch- 280. put the banana touching the ing the plate ing the book blackboard

249. put the plate touching the 265. put the bottle touching the 281. put the bowl touching the disc book blackboard

250. put the cup touching the 266. put the grape touching the 282. put the knife touching the disc book blackboard

251. put the book touching the 267. put the apple touching the 283. put the pencil touching the disc book blackboard

252. put the blackboard touch- 268. put the banana touching the 284. put the paper sheet touch- ing the disc book ing the blackboard

253. put the bottle touching the 269. put the bowl touching the 285. put the plate touching the disc book bottle

254. put the grape touching the 270. put the knife touching the 286. put the cup touching the disc book bottle

151 APPENDIX D. SENTENCE TEST SET

287. put the disc touching the 303. put the apple touching the 319. put the pencil touching the bottle grape apple

288. put the book touching the 304. put the banana touching the 320. put the paper sheet touch- bottle grape ing the apple

289. put the blackboard touch- 305. put the bowl touching the 321. put the plate touching the ing the bottle grape banana

290. put the grape touching the 306. put the knife touching the 322. put the cup touching the ba- bottle grape nana

291. put the apple touching the 307. put the pencil touching the 323. put the disc touching the bottle grape banana

292. put the banana touching the 308. put the paper sheet touch- 324. put the book touching the bottle ing the grape banana

293. put the bowl touching the 309. put the plate touching the 325. put the blackboard touch- bottle apple ing the banana

294. put the knife touching the 310. put the cup touching the ap- 326. put the bottle touching the bottle ple banana

295. put the pencil touching the 311. put the disc touching the 327. put the grape touching the bottle apple banana

296. put the paper sheet touch- 312. put the book touching the 328. put the apple touching the ing the bottle apple banana

297. put the plate touching the 313. put the blackboard touch- 329. put the bowl touching the grape ing the apple banana

298. put the cup touching the 314. put the bottle touching the 330. put the knife touching the grape apple banana

299. put the disc touching the 315. put the grape touching the 331. put the pencil touching the grape apple banana

300. put the book touching the 316. put the banana touching the 332. put the paper sheet touch- grape apple ing the banana

301. put the blackboard touch- 317. put the bowl touching the 333. put the plate touching the ing the grape apple bowl

302. put the bottle touching the 318. put the knife touching the 334. put the cup touching the grape apple bowl

152 APPENDIX D. SENTENCE TEST SET

335. put the disc touching the 351. put the grape touching the 367. put the knife touching the bowl knife pencil

336. put the book touching the 352. put the apple touching the 368. put the paper sheet touch- bowl knife ing the pencil

337. put the blackboard touch- 353. put the banana touching the 369. put the plate touching the ing the bowl knife paper sheet

338. put the bottle touching the 354. put the bowl touching the 370. put the cup touching the pa- bowl knife per sheet

339. put the grape touching the 355. put the pencil touching the 371. put the disc touching the bowl knife paper sheet 372. put the book touching the 340. put the apple touching the 356. put the paper sheet touch- paper sheet bowl ing the knife 373. put the blackboard touch- 341. put the banana touching the 357. put the plate touching the ing the paper sheet bowl pencil 374. put the bottle touching the 342. put the knife touching the 358. put the cup touching the paper sheet bowl pencil 375. put the grape touching the 343. put the pencil touching the 359. put the disc touching the paper sheet bowl pencil 376. put the apple touching the 344. put the paper sheet touch- 360. put the book touching the paper sheet ing the bowl pencil 377. put the banana touching the 345. put the plate touching the 361. put the blackboard touch- paper sheet knife ing the pencil 378. put the bowl touching the 346. put the cup touching the 362. put the bottle touching the paper sheet knife pencil 379. put the knife touching the 347. put the disc touching the 363. put the grape touching the paper sheet knife pencil 380. put the pencil touching the 348. put the book touching the 364. put the apple touching the paper sheet knife pencil 381. put the block on the ball 349. put the blackboard touch- 365. put the banana touching the 382. put the block on the plate ing the knife pencil 383. put the block on the cup 350. put the bottle touching the 366. put the bowl touching the knife pencil 384. put the block on the disc

153 APPENDIX D. SENTENCE TEST SET

385. put the block on the book 409. put the plate on the block 434. put the cup on the knife

386. put the block on the black- 410. put the plate on the ball 435. put the cup on the pencil board 411. put the plate on the cup 436. put the cup on the paper 387. put the block on the bottle sheet 412. put the plate on the disc 388. put the block on the grape 437. put the disc on the block 413. put the plate on the book 438. put the disc on the ball 389. put the block on the apple 414. put the plate on the black- 439. put the disc on the plate 390. put the block on the banana board 440. put the disc on the cup 391. put the block on the bowl 415. put the plate on the bottle 441. put the disc on the book 392. put the block on the knife 416. put the plate on the grape 442. put the disc on the black- 393. put the block on the pencil 417. put the plate on the apple board

394. put the block on the paper 418. put the plate on the banana 443. put the disc on the bottle sheet 419. put the plate on the bowl 444. put the disc on the grape

395. put the ball on the block 420. put the plate on the knife 445. put the disc on the apple

396. put the ball on the plate 421. put the plate on the pencil 446. put the disc on the banana

397. put the ball on the cup 422. put the plate on the paper 447. put the disc on the bowl 398. put the ball on the disc sheet 448. put the disc on the knife

399. put the ball on the book 423. put the cup on the block 449. put the disc on the pencil 424. put the cup on the ball 450. put the disc on the paper 400. put the ball on the black- sheet board 425. put the cup on the plate 451. put the book on the block 401. put the ball on the bottle 426. put the cup on the disc 452. put the book on the ball 402. put the ball on the grape 427. put the cup on the book 453. put the book on the plate 403. put the ball on the apple 428. put the cup on the black- 454. put the book on the cup board 404. put the ball on the banana 455. put the book on the disc 429. put the cup on the bottle 405. put the ball on the bowl 456. put the book on the black- 430. put the cup on the grape 406. put the ball on the knife board 431. put the cup on the apple 407. put the ball on the pencil 457. put the book on the bottle 432. put the cup on the banana 458. put the book on the grape 408. put the ball on the paper sheet 433. put the cup on the bowl 459. put the book on the apple

154 APPENDIX D. SENTENCE TEST SET

460. put the book on the banana 484. put the grape on the disc 508. put the banana on the block

461. put the book on the bowl 485. put the grape on the book 509. put the banana on the ball

462. put the book on the knife 486. put the grape on the black- 510. put the banana on the plate board 463. put the book on the pencil 511. put the banana on the cup 487. put the grape on the bottle 512. put the banana on the disc 464. put the book on the paper sheet 488. put the grape on the apple 513. put the banana on the book 465. put the blackboard on the 489. put the grape on the banana 514. put the banana on the bottle paper sheet 490. put the grape on the bowl 515. put the banana on the grape 466. put the bottle on the block 491. put the grape on the knife 516. put the banana on the apple 467. put the bottle on the ball 492. put the grape on the pencil 517. put the banana on the bowl 468. put the bottle on the plate 518. put the banana on the knife 493. put the grape on the paper 469. put the bottle on the cup sheet 519. put the banana on the pencil 470. put the bottle on the disc 494. put the apple on the block 520. put the banana on the paper sheet 471. put the bottle on the book 495. put the apple on the ball 521. put the bowl on the block 472. put the bottle on the black- 496. put the apple on the plate board 522. put the bowl on the ball 497. put the apple on the cup 473. put the bottle on the grape 523. put the bowl on the plate 498. put the apple on the disc 524. put the bowl on the cup 474. put the bottle on the apple 499. put the apple on the book 525. put the bowl on the disc 475. put the bottle on the banana 500. put the apple on the black- 526. put the bowl on the book 476. put the bottle on the bowl board 527. put the bowl on the bottle 477. put the bottle on the knife 501. put the apple on the bottle 528. put the bowl on the grape 478. put the bottle on the pencil 502. put the apple on the grape 529. put the bowl on the apple 479. put the bottle on the paper 503. put the apple on the banana 530. put the bowl on the banana sheet 504. put the apple on the bowl 531. put the bowl on the knife 480. put the grape on the block 505. put the apple on the knife 532. put the bowl on the pencil 481. put the grape on the ball 506. put the apple on the pencil 533. put the bowl on the paper 482. put the grape on the plate sheet 507. put the apple on the paper 483. put the grape on the cup sheet 534. put the knife on the block

155 APPENDIX D. SENTENCE TEST SET

535. put the knife on the ball 560. put the pencil on the paper 577. put the block near the disc sheet 536. put the knife on the plate 578. put the block near the book 561. put the paper sheet on the 537. put the knife on the cup 579. put the block near the block blackboard 538. put the knife on the disc 562. put the paper sheet on the 580. put the block near the bottle ball 539. put the knife on the book 581. put the block near the grape 563. put the paper sheet on the 540. put the knife on the black- 582. put the block near the apple board plate 583. put the block near the ba- 541. put the knife on the bottle 564. put the paper sheet on the nana cup 542. put the knife on the grape 584. put the block near the bowl 565. put the paper sheet on the 585. put the block near the knife 543. put the knife on the apple disc 586. put the block near the pen- 544. put the knife on the banana 566. put the paper sheet on the cil book 545. put the knife on the bowl 587. put the block near the paper 567. put the paper sheet on the 546. put the knife on the pencil sheet blackboard 588. put the ball near the block 547. put the knife on the paper 568. put the paper sheet on the sheet 589. put the ball near the cup bottle 548. put the pencil on the block 590. put the ball near the disc 569. put the paper sheet on the 549. put the pencil on the ball grape 591. put the ball near the book 592. put the ball near the black- 550. put the pencil on the plate 570. put the paper sheet on the apple board 551. put the pencil on the cup 593. put the ball near the bottle 571. put the paper sheet on the 552. put the pencil on the disc banana 594. put the ball near the grape

553. put the pencil on the book 572. put the paper sheet on the 595. put the ball near the apple bowl 554. put the pencil on the bottle 596. put the ball near the banana 573. put the paper sheet on the 555. put the pencil on the grape 597. put the ball near the bowl knife 598. put the ball near the knife 556. put the pencil on the apple 574. put the paper sheet on the 599. put the ball near the pencil 557. put the pencil on the banana pencil 600. put the ball near the paper 558. put the pencil on the bowl 575. put the block near the ball sheet 559. put the pencil on the knife 576. put the block near the cup 601. put the plate near the block

156 APPENDIX D. SENTENCE TEST SET

602. put the plate near the ball 626. put the cup near the paper 649. put the book near the bowl sheet 603. put the plate near the cup 650. put the book near the knife 627. put the disc near the block 604. put the plate near the disc 651. put the book near the pencil 628. put the disc near the ball 605. put the plate near the book 652. put the blackboard near the 629. put the disc near the cup block 606. put the plate near the black- board 630. put the disc near the book 653. put the blackboard near the ball 607. put the plate near the bottle 631. put the disc near the black- board 654. put the blackboard near the 608. put the plate near the grape cup 632. put the disc near the bottle 609. put the plate near the apple 655. put the blackboard near the 633. put the disc near the grape disc 610. put the plate near the ba- nana 634. put the disc near the apple 656. put the blackboard near the book 611. put the plate near the bowl 635. put the disc near the banana 657. put the blackboard near the 612. put the plate near the knife 636. put the disc near the bowl bottle

613. put the plate near the pencil 637. put the disc near the knife 658. put the blackboard near the grape 614. put the cup near the block 638. put the disc near the pencil 659. put the blackboard near the 615. put the cup near the ball 639. put the disc near the paper apple sheet 616. put the cup near the disc 660. put the blackboard near the 640. put the book near the block banana 617. put the cup near the book 641. put the book near the ball 661. put the blackboard near the 618. put the cup near the black- bowl board 642. put the book near the cup 662. put the blackboard near the 619. put the cup near the bottle 643. put the book near the disc knife

620. put the cup near the grape 644. put the book near the black- 663. put the blackboard near the board pencil 621. put the cup near the apple 645. put the book near the bottle 664. put the blackboard near the 622. put the cup near the banana paper sheet 646. put the book near the grape 623. put the cup near the bowl 665. put the bottle near the block 647. put the book near the apple 624. put the cup near the knife 666. put the bottle near the ball 648. put the book near the ba- 625. put the cup near the pencil nana 667. put the bottle near the cup

157 APPENDIX D. SENTENCE TEST SET

668. put the bottle near the disc 690. put the grape near the paper 711. put the banana near the sheet grape 669. put the bottle near the book 691. put the apple near the block 712. put the banana near the ap- 670. put the bottle near the ple blackboard 692. put the apple near the ball 713. put the banana near the 671. put the bottle near the grape 693. put the apple near the cup bowl

672. put the bottle near the apple 694. put the apple near the disc 714. put the banana near the knife 673. put the bottle near the ba- 695. put the apple near the book nana 715. put the banana near the pen- 696. put the apple near the cil 674. put the bottle near the bowl blackboard 716. put the banana near the pa- 675. put the bottle near the knife 697. put the apple near the bottle per sheet

676. put the bottle near the pen- 698. put the apple near the grape 717. put the bowl near the block cil 699. put the apple near the ba- 718. put the bowl near the ball 677. put the bottle near the paper nana 719. put the bowl near the cup sheet 700. put the apple near the bowl 720. put the bowl near the disc 678. put the grape near the block 701. put the apple near the knife 721. put the bowl near the book 679. put the grape near the ball 702. put the apple near the pen- 722. put the bowl near the black- board 680. put the grape near the cup cil 723. put the bowl near the bottle 681. put the grape near the disc 703. put the apple near the paper sheet 724. put the bowl near the grape 682. put the grape near the book 704. put the banana near the 725. put the bowl near the apple 683. put the grape near the block 726. put the bowl near the ba- blackboard 705. put the banana near the ball nana 684. put the grape near the bottle 727. put the bowl near the knife 706. put the banana near the cup 685. put the grape near the apple 728. put the bowl near the pencil 707. put the banana near the disc 686. put the grape near the ba- 729. put the bowl near the paper nana 708. put the banana near the sheet book 687. put the grape near the bowl 730. put the knife near the block 709. put the banana near the 688. put the grape near the knife blackboard 731. put the knife near the ball 732. put the knife near the cup 689. put the grape near the pen- 710. put the banana near the bot- cil tle 733. put the knife near the disc

158 APPENDIX D. SENTENCE TEST SET

734. put the knife near the book 755. put the paper sheet near the 773. lean the block on the black- block board 735. put the knife near the black- board 756. put the paper sheet near the 774. lean the block on the bottle ball 736. put the knife near the bottle 775. lean the block on the grape 757. put the paper sheet near the 776. lean the block on the apple 737. put the knife near the grape cup 777. lean the block on the ba- 738. put the knife near the apple 758. put the paper sheet near the nana disc 739. put the knife near the ba- 778. lean the block on the bowl nana 759. put the paper sheet near the 740. put the knife near the bowl book 779. lean the block on the knife

741. put the knife near the pencil 760. put the paper sheet near the 780. lean the block on the pencil blackboard 781. lean the block on the paper 742. put the knife near the paper sheet sheet 761. put the paper sheet near the bottle 782. lean the ball on the block 743. put the pencil near the 762. put the paper sheet near the block 783. lean the ball on the plate grape 744. put the pencil near the ball 784. lean the ball on the cup 763. put the paper sheet near the 745. put the pencil near the cup apple 785. lean the ball on the disc

746. put the pencil near the disc 764. put the paper sheet near the 786. lean the ball on the book banana 747. put the pencil near the book 787. lean the ball on the black- 765. put the paper sheet near the board 748. put the pencil near the bowl blackboard 788. lean the ball on the bottle 766. put the paper sheet near the 749. put the pencil near the bot- knife 789. lean the ball on the grape tle 767. put the paper sheet near the 790. lean the ball on the apple 750. put the pencil near the pencil 791. lean the ball on the banana grape 768. lean the block on the ball 792. lean the ball on the bowl 751. put the pencil near the apple 769. lean the block on the plate 793. lean the ball on the knife 752. put the pencil near the ba- nana 770. lean the block on the cup 794. lean the ball on the pencil

753. put the pencil near the bowl 771. lean the block on the disc 795. lean the plate on the block

754. put the pencil near the knife 772. lean the block on the book 796. lean the plate on the ball

159 APPENDIX D. SENTENCE TEST SET

797. lean the plate on the cup 822. lean the disc on the ball 847. lean the bottle on the block

798. lean the plate on the disc 823. lean the disc on the plate 848. lean the bottle on the ball

799. lean the plate on the book 824. lean the disc on the cup 849. lean the bottle on the plate 850. lean the bottle on the cup 800. lean the plate on the black- 825. lean the disc on the book board 851. lean the bottle on the disc 826. lean the disc on the black- 801. lean the plate on the bottle board 852. lean the bottle on the book

802. lean the plate on the grape 827. lean the disc on the bottle 853. lean the bottle on the black- board 803. lean the plate on the apple 828. lean the disc on the grape 854. lean the bottle on the grape 804. lean the plate on the banana 829. lean the disc on the apple 855. lean the bottle on the apple 805. lean the plate on the bowl 830. lean the disc on the banana 856. lean the bottle on the ba- 806. lean the plate on the knife 831. lean the disc on the bowl nana 857. lean the bottle on the bowl 807. lean the plate on the pencil 832. lean the disc on the knife 858. lean the bottle on the knife 808. lean the cup on the block 833. lean the disc on the pencil 859. lean the bottle on the pencil 809. lean the cup on the ball 834. lean the book on the block 860. lean the grape on the block 810. lean the cup on the plate 835. lean the book on the ball 861. lean the grape on the ball 811. lean the cup on the disc 836. lean the book on the plate 862. lean the grape on the plate 812. lean the cup on the book 837. lean the book on the cup 863. lean the grape on the cup 813. lean the cup on the black- 838. lean the book on the disc 864. lean the grape on the disc board 839. lean the book on the black- 865. lean the grape on the book 814. lean the cup on the bottle board 866. lean the grape on the black- 815. lean the cup on the grape 840. lean the book on the bottle board

816. lean the cup on the apple 841. lean the book on the grape 867. lean the grape on the bottle 868. lean the grape on the apple 817. lean the cup on the banana 842. lean the book on the apple 869. lean the grape on the ba- 818. lean the cup on the bowl 843. lean the book on the banana nana 819. lean the cup on the knife 844. lean the book on the bowl 870. lean the grape on the bowl 820. lean the cup on the pencil 845. lean the book on the knife 871. lean the grape on the knife 821. lean the disc on the block 846. lean the book on the pencil 872. lean the grape on the pencil

160 APPENDIX D. SENTENCE TEST SET

873. lean the apple on the block 895. lean the banana on the ap- 919. lean the knife on the bottle ple 874. lean the apple on the ball 920. lean the knife on the grape 896. lean the banana on the bowl 875. lean the apple on the plate 921. lean the knife on the apple 897. lean the banana on the knife 922. lean the knife on the banana 876. lean the apple on the cup 898. lean the banana on the pen- 923. lean the knife on the bowl 877. lean the apple on the disc cil 924. lean the knife on the pencil 878. lean the apple on the book 899. lean the bowl on the block 925. lean the pencil on the block 900. lean the bowl on the ball 879. lean the apple on the black- 926. lean the pencil on the ball board 901. lean the bowl on the plate 927. lean the pencil on the plate 880. lean the apple on the bottle 902. lean the bowl on the cup 928. lean the pencil on the cup 881. lean the apple on the grape 903. lean the bowl on the disc 929. lean the pencil on the disc 882. lean the apple on the ba- 904. lean the bowl on the book 930. lean the pencil on the book nana 905. lean the bowl on the black- 931. lean the pencil on the black- 883. lean the apple on the bowl board board

884. lean the apple on the knife 906. lean the bowl on the bottle 932. lean the pencil on the bottle 933. lean the pencil on the grape 885. lean the apple on the pencil 907. lean the bowl on the grape 934. lean the pencil on the apple 886. lean the banana on the 908. lean the bowl on the apple block 935. lean the pencil on the ba- 909. lean the bowl on the banana nana 887. lean the banana on the ball 910. lean the bowl on the knife 936. lean the pencil on the bowl 888. lean the banana on the plate 911. lean the bowl on the pencil 937. lean the pencil on the knife 889. lean the banana on the cup 912. lean the knife on the block 938. lean the paper sheet on the 890. lean the banana on the disc block 913. lean the knife on the ball 939. lean the paper sheet on the 891. lean the banana on the book 914. lean the knife on the plate ball 892. lean the banana on the 915. lean the knife on the cup 940. lean the paper sheet on the blackboard plate 916. lean the knife on the disc 893. lean the banana on the bot- 941. lean the paper sheet on the tle 917. lean the knife on the book cup 894. lean the banana on the 918. lean the knife on the black- 942. lean the paper sheet on the grape board disc

161 APPENDIX D. SENTENCE TEST SET

943. lean the paper sheet on the 959. lean the block against the 975. lean the plate against the book grape bowl

944. lean the paper sheet on the 960. lean the block against the 976. lean the plate against the blackboard apple knife

945. lean the paper sheet on the 961. lean the block against the 977. lean the plate against the bottle banana pencil

946. lean the paper sheet on the 962. lean the block against the 978. lean the cup against the grape bowl block

947. lean the paper sheet on the 963. lean the block against the 979. lean the cup against the ball apple knife 980. lean the cup against the 948. lean the paper sheet on the 964. lean the block against the plate banana pencil 981. lean the cup against the disc 949. lean the paper sheet on the 965. lean the plate against the 982. lean the cup against the bowl block book

950. lean the paper sheet on the 966. lean the plate against the 983. lean the cup against the knife ball blackboard

951. lean the paper sheet on the 967. lean the plate against the 984. lean the cup against the bot- pencil cup tle

952. lean the block against the 968. lean the plate against the 985. lean the cup against the ball disc grape

953. lean the block against the 969. lean the plate against the 986. lean the cup against the ap- plate book ple

954. lean the block against the 970. lean the plate against the 987. lean the cup against the ba- cup blackboard nana

955. lean the block against the 971. lean the plate against the 988. lean the cup against the disc bottle bowl

956. lean the block against the 972. lean the plate against the 989. lean the cup against the book grape knife

957. lean the block against the 973. lean the plate against the 990. lean the cup against the blackboard apple pencil

958. lean the block against the 974. lean the plate against the 991. lean the disc against the bottle banana block

162 APPENDIX D. SENTENCE TEST SET

992. lean the disc against the ball 1009. lean the book against the 1025. lean the bottle against the blackboard banana 993. lean the disc against the plate 1010. lean the book against the 1026. lean the bottle against the bottle bowl 994. lean the disc against the cup 1011. lean the book against the 1027. lean the bottle against the 995. lean the disc against the grape knife book 1012. lean the book against the 1028. lean the bottle against the 996. lean the disc against the apple pencil blackboard 1013. lean the book against the 1029. lean the grape against the 997. lean the disc against the banana block bottle 1014. lean the book against the 1030. lean the grape against the 998. lean the disc against the bowl ball grape 1015. lean the book against the 1031. lean the grape against the 999. lean the disc against the ap- knife plate ple

1000. lean the disc against the ba- 1016. lean the bottle against the 1032. lean the grape against the nana block cup

1001. lean the disc against the 1017. lean the bottle against the 1033. lean the grape against the bowl ball disc

1002. lean the disc against the 1018. lean the bottle against the 1034. lean the grape against the knife plate book

1003. lean the disc against the 1019. lean the bottle against the 1035. lean the grape against the pencil cup blackboard

1004. lean the book against the 1020. lean the bottle against the 1036. lean the grape against the block disc bottle

1005. lean the book against the 1021. lean the bottle against the 1037. lean the grape against the ball book apple

1006. lean the book against the 1022. lean the bottle against the 1038. lean the grape against the plate blackboard banana

1007. lean the book against the 1023. lean the bottle against the 1039. lean the grape against the cup grape bowl

1008. lean the book against the 1024. lean the bottle against the 1040. lean the grape against the disc apple knife

163 APPENDIX D. SENTENCE TEST SET

1041. lean the grape against the 1057. lean the banana against the 1073. lean the bowl against the pencil disc bottle

1042. lean the apple against the 1058. lean the banana against the 1074. lean the bowl against the block book grape

1043. lean the apple against the 1059. lean the banana against the 1075. lean the bowl against the plate blackboard apple

1044. lean the apple against the 1060. lean the banana against the 1076. lean the bowl against the cup bottle banana

1045. lean the apple against the 1061. lean the banana against the 1077. lean the bowl against the disc grape knife

1046. lean the apple against the 1062. lean the banana against the 1078. lean the bowl against the book apple pencil

1047. lean the apple against the 1063. lean the banana against the 1079. lean the knife against the blackboard bowl block

1048. lean the apple against the 1064. lean the banana against the 1080. lean the knife against the bottle knife ball

1049. lean the apple against the 1065. lean the banana against the 1081. lean the knife against the banana pencil plate

1050. lean the apple against the 1066. lean the bowl against the 1082. lean the knife against the bowl block cup

1051. lean the apple against the 1067. lean the bowl against the 1083. lean the knife against the knife ball disc

1052. lean the apple against the 1068. lean the bowl against the 1084. lean the knife against the pencil plate book

1053. lean the banana against the 1069. lean the bowl against the 1085. lean the knife against the block cup blackboard

1054. lean the banana against the 1070. lean the bowl against the 1086. lean the knife against the ball disc bottle

1055. lean the banana against the 1071. lean the bowl against the 1087. lean the knife against the plate book grape

1056. lean the banana against the 1072. lean the bowl against the 1088. lean the knife against the cup blackboard apple

164 APPENDIX D. SENTENCE TEST SET

1089. lean the knife against the 1100. lean the pencil against the 1111. lean the paper sheet against banana grape the blackboard 1090. lean the knife against the 1101. lean the pencil against the bowl apple 1112. lean the paper sheet against the bottle 1091. lean the knife against the 1102. lean the pencil against the pencil banana 1113. lean the paper sheet against 1092. lean the pencil against the 1103. lean the pencil against the the grape block bowl 1114. lean the paper sheet against 1093. lean the pencil against the 1104. lean the pencil against the the apple ball knife 1094. lean the pencil against the 1105. lean the paper sheet against 1115. lean the paper sheet against plate the block the banana 1095. lean the pencil against the 1106. lean the paper sheet against cup the ball 1116. lean the paper sheet against the bowl 1096. lean the pencil against the 1107. lean the paper sheet against disc the plate 1117. lean the paper sheet against 1097. lean the pencil against the 1108. lean the paper sheet against the knife book the cup 1118. lean the paper sheet against 1098. lean the pencil against the 1109. lean the paper sheet against the pencil blackboard the disc 1099. lean the pencil against the 1110. lean the paper sheet against 1119. put the book near the paper bottle the book sheet

165 Appendix E

Data Tables

166 APPENDIX E. DATA TABLES E.1 DNN with Unweighted Features

Granularity 1: Predicting Predicate Only 3-way Choice Training Steps Total Correct Incorrect 1000 3357 3268 89 2000 3357 3277 80 3000 3357 3285 72 4000 3357 3283 74 5000 3357 3285 72 Training Steps µ Accuracy σ σ2 1000 97.37% 0.00763 0.00006 2000 97.64% 0.00879 0.00008 3000 97.88% 0.00541 0.00003 4000 97.82% 0.00450 0.00002 5000 97.88% 0.00541 0.00003

Unrestricted Choice Training Steps Total Correct Incorrect 1000 3357 3138 219 2000 3357 3184 173 3000 3357 3202 155 4000 3357 3198 159 5000 3357 3193 164 Training Steps µ Accuracy σ σ2 1000 93.50% 0.01094 0.00012 2000 94.87% 0.01329 0.00018 3000 95.41% 0.00491 0.00002 4000 95.29% 0.00791 0.00006 5000 95.14% 0.00662 0.00004

167 APPENDIX E. DATA TABLES

Granularity 2: Predicting Predicate w/ Adjunct 3-way Choice Training Steps Total Correct Incorrect 1000 3357 2748 609 2000 3357 2714 643 3000 3357 2724 633 4000 3357 2739 618 5000 3357 2725 632 Training Steps µ Accuracy σ σ2 1000 81.88% 0.00937 0.00009 2000 80.87% 0.01255 0.00016 3000 81.16% 0.01352 0.00018 4000 81.61% 0.02167 0.00047 5000 81.19% 0.01443 0.00021

Unrestricted Choice Training Steps Total Correct Incorrect 1000 3357 1867 1490 2000 3357 1895 1462 3000 3357 1867 1490 4000 3357 1919 1438 5000 3357 1899 1458 Training Steps µ Accuracy σ σ2 1000 55.63% 0.01290 0.00017 2000 56.46% 0.02063 0.00043 3000 55.63% 0.02208 0.00049 4000 57.18% 0.01810 0.00033 5000 56.58% 0.01643 0.00027

168 APPENDIX E. DATA TABLES

Granularity 3: Predicting Full Sentence 3-way Choice Training Steps Total Correct Incorrect 1000 3357 1638 1719 2000 3357 2143 1214 3000 3357 2243 1114 4000 3357 2177 1180 5000 3357 2145 1212 Training Steps µ Accuracy σ σ2 1000 48.80% 0.08193 0.00671 2000 63.25% 0.02232 0.00050 3000 66.83% 0.02235 0.00050 4000 64.86% 0.02260 0.00051 5000 63.91% 0.03331 0.00111

Unrestricted Choice Training Steps Total Correct Incorrect 1000 3357 8 3349 2000 3357 14 3343 3000 3357 12 3345 4000 3357 22 3335 5000 3357 17 3340 Training Steps µ Accuracy σ σ2 1000 0.24% 0.00190 0.00000 2000 0.42% 0.00350 0.00001 3000 0.36% 0.00237 0.00001 4000 0.66% 0.00481 0.00002 5000 0.51% 0.00466 0.00002

Table E.1: Accuracy tables for “vanilla” DNN automatic evaluation

169 APPENDIX E. DATA TABLES E.2 DNN with Weighted Features

Granularity 1: Predicting Predicate Only 3-way Choice Training Steps Total Correct Incorrect 1000 3357 3234 123 2000 3357 3275 82 3000 3357 3284 73 4000 3357 3284 73 5000 3357 3268 89 Training Steps µ Accuracy σ σ2 1000 96.36% 0.00756 0.00006 2000 97.58% 0.00701 0.00005 3000 97.85% 0.00771 0.00006 4000 97.85% 0.00703 0.00005 5000 97.67% 0.01541 0.00024

Unrestricted Choice Training Steps Total Correct Incorrect 1000 3357 3136 221 2000 3357 3190 167 3000 3357 3193 154 4000 3357 3204 153 5000 3357 3204 153 Training Steps µ Accuracy σ σ2 1000 93.44% 0.01313 0.00017 2000 95.05% 0.00987 0.00010 3000 95.44% 0.01375 0.00019 4000 95.50% 0.00974 0.00009 5000 95.47% 0.01079 0.00012

170 APPENDIX E. DATA TABLES

Granularity 2: Predicting Predicate w/ Adjunct 3-way Choice Training Steps Total Correct Incorrect 1000 3357 2694 663 2000 3357 2718 639 3000 3357 2707 650 4000 3357 2711 646 5000 3357 2733 624 Training Steps µ Accuracy σ σ2 1000 80.27% 0.01754 0.00031 2000 80.98% 0.01579 0.00025 3000 80.66% 0.02251 0.00051 4000 80.78% 0.01977 0.00039 5000 81.34% 0.01664 0.00028

Unrestricted Choice Training Steps Total Correct Incorrect 1000 3357 1851 1506 2000 3357 1869 1488 3000 3357 1895 1462 4000 3357 1874 1483 5000 3357 1903 1454 Training Steps µ Accuracy σ σ2 1000 55.15% 0.01230 0.00015 2000 55.69% 0.01621 0.00026 3000 56.46% 0.01279 0.00016 4000 55.84% 0.01694 0.00029 5000 56.70% 0.01693 0.00029

171 APPENDIX E. DATA TABLES

Granularity 3: Predicting Full Sentence 3-way Choice Training Steps Total Correct Incorrect 1000 3357 1521 1836 2000 3357 2018 1339 3000 3357 2120 1237 4000 3357 2091 1266 5000 3357 2107 1250 Training Steps µ Accuracy σ σ2 1000 45.32% 0.05113 0.00261 2000 60.13% 0.03917 0.00153 3000 63.17% 0.03076 0.00095 4000 62.30% 0.03146 0.00099 5000 62.78% 0.02653 0.00070

Unrestricted Choice Training Steps Total Correct Incorrect 1000 3357 5 3352 2000 3357 8 3349 3000 3357 8 3349 4000 3357 7 3350 5000 3357 20 3337 Training Steps µ Accuracy σ σ2 1000 0.15% 0.00158 0.00000 2000 0.24% 0.00237 0.00001 3000 0.24% 0.00190 0.00000 4000 0.21% 0.00202 0.00000 5000 0.60% 0.00543 0.00003

Table E.2: Accuracy tables for DNN automatic evaluation with weighted features

172 APPENDIX E. DATA TABLES E.3 DNN with Weighted Discrete Features

Granularity 1: Predicting Predicate Only 3-way Choice Training Steps Total Correct Incorrect 1000 3357 3239 118 2000 3357 3286 71 3000 3357 3285 72 4000 3357 3283 74 5000 3357 3289 68 Training Steps µ Accuracy σ σ2 1000 96.51% 0.01140 0.00013 2000 97.91% 0.00589 0.00003 3000 97.88% 0.00577 0.00003 4000 97.82% 0.00530 0.00003 5000 98.00% 0.00393 0.00002

Unrestricted Choice Training Steps Total Correct Incorrect 1000 3357 3113 244 2000 3357 3189 168 3000 3357 3194 163 4000 3357 3195 162 5000 3357 3205 152 Training Steps µ Accuracy σ σ2 1000 92.75% 0.00923 0.00009 2000 95.02% 0.00675 0.00005 3000 95.17% 0.00658 0.00004 4000 95.20% 0.01088 0.00012 5000 95.50% 0.00610 0.00004

173 APPENDIX E. DATA TABLES

Granularity 2: Predicting Predicate w/ Adjunct 3-way Choice Training Steps Total Correct Incorrect 1000 3357 2695 662 2000 3357 2719 638 3000 3357 2725 632 4000 3357 2696 661 5000 3357 2717 640 Training Steps µ Accuracy σ σ2 1000 80.33% 0.01608 0.00026 2000 81.01% 0.01594 0.00025 3000 81.19% 0.01870 0.00035 4000 80.33% 0.0187 0.00035 5000 80.95% 0.01553 0.00024

Unrestricted Choice Training Steps Total Correct Incorrect 1000 3357 1852 1505 2000 3357 1881 1476 3000 3357 1900 1457 4000 3357 1886 1471 5000 3357 1895 1462 Training Steps µ Accuracy σ σ2 1000 55.18% 0.01666 0.00028 2000 56.04% 0.02072 0.00043 3000 56.61% 0.00937 0.00009 4000 56.19% 0.00688 0.00005 5000 56.46% 0.01429 0.0002

174 APPENDIX E. DATA TABLES

Granularity 3: Predicting Full Sentence 3-way Choice Training Steps Total Correct Incorrect 1000 3357 1816 1541 2000 3357 2192 1165 3000 3357 2190 1167 4000 3357 2182 1175 5000 3357 2137 1220 Training Steps µ Accuracy σ σ2 1000 54.11% 0.07782 0.00606 2000 65.31% 0.02551 0.00065 3000 65.25% 0.02977 0.00089 4000 65.01% 0.01966 0.00039 5000 63.67% 0.03622 0.00131

Unrestricted Choice Training Steps Total Correct Incorrect 1000 3357 6 3351 2000 3357 15 3342 3000 3357 18 3339 4000 3357 15 3342 5000 3357 20 3337 Training Steps µ Accuracy σ σ2 1000 0.18% 0.00155 0.00000 2000 0.45% 0.00253 0.00001 3000 0.54% 0.00391 0.00002 4000 0.45% 0.00321 0.00001 5000 0.60% 0.00343 0.00001

Table E.3: Accuracy tables for DNN automatic evaluation with weighted discrete features

175 APPENDIX E. DATA TABLES E.4 DNN with Feature Weights Only

Granularity 1: Predicting Predicate Only 3-way Choice Training Steps Total Correct Incorrect 1000 3357 3298 59 2000 3357 3321 36 3000 3357 3324 33 4000 3357 3323 34 5000 3357 3321 36 Training Steps µ Accuracy σ σ2 1000 98.27% 0.00552 0.00003 2000 98.95% 0.00401 0.00002 3000 99.04% 0.00372 0.00001 4000 99.01% 0.00319 0.00001 5000 98.95% 0.00375 0.00001

Unrestricted Choice Training Steps Total Correct Incorrect 1000 3357 3226 131 2000 3357 3253 104 3000 3357 3257 100 4000 3357 3259 98 5000 3357 3258 99 Training Steps µ Accuracy σ σ2 1000 96.12% 0.00664 0.00004 2000 96.93% 0.00562 0.00003 3000 97.05% 0.00503 0.00003 4000 97.10% 0.00437 0.00002 5000 97.07% 0.00494 0.00002

176 APPENDIX E. DATA TABLES

Granularity 2: Predicting Predicate w/ Adjunct 3-way Choice Training Steps Total Correct Incorrect 1000 3357 2784 573 2000 3357 2797 560 3000 3357 2800 557 4000 3357 2790 567 5000 3357 2794 563 Training Steps µ Accuracy σ σ2 1000 82.95% 0.00957 0.00009 2000 83.34% 0.00672 0.00005 3000 83.43% 0.00896 0.00008 4000 83.13% 0.00943 0.00009 5000 83.25% 0.00859 0.00007

Unrestricted Choice Training Steps Total Correct Incorrect 1000 3357 2538 819 2000 3357 2546 811 3000 3357 2570 787 4000 3357 2568 789 5000 3357 2567 790 Training Steps µ Accuracy σ σ2 1000 75.62% 0.01129 0.00013 2000 75.86% 0.00878 0.00008 3000 76.57% 0.00767 0.00006 4000 76.51% 0.00905 0.00008 5000 76.48% 0.00899 0.00008

177 APPENDIX E. DATA TABLES

Granularity 3: Predicting Full Sentence 3-way Choice Training Steps Total Correct Incorrect 1000 3357 381 2976 2000 3357 2255 1102 3000 3357 2342 1015 4000 3357 2340 1017 5000 3357 2320 1037 Training Steps µ Accuracy σ σ2 1000 11.35% 0.05644 0.00319 2000 67.19% 0.04904 0.00240 3000 69.78% 0.02527 0.00064 4000 69.72% 0.02343 0.00055 5000 69.12% 0.01799 0.00032

Unrestricted Choice Training Steps Total Correct Incorrect 1000 3357 1 3356 2000 3357 6 3351 3000 3357 15 3342 4000 3357 21 3336 5000 3357 21 3336 Training Steps µ Accuracy σ σ2 1000 0.03% 0.00095 0.00000 2000 0.18% 0.00253 0.00001 3000 0.45% 0.0047 0.00002 4000 0.63% 0.00356 0.00001 5000 0.63% 0.00453 0.00002

Table E.4: Accuracy tables for DNN automatic evaluation with feature weights alone

178 APPENDIX E. DATA TABLES E.5 Combined Linear-DNN with Unweighted Features

Granularity 1: Predicting Predicate Only 3-way Choice Training Steps Total Correct Incorrect 1000 3357 3207 150 2000 3357 3218 139 3000 3357 3221 136 4000 3357 3227 130 5000 3357 3227 130 Training Steps µ Accuracy σ σ2 1000 95.56% 0.00841 0.00007 2000 95.88% 0.00813 0.00007 3000 95.97% 0.00733 0.00005 4000 96.15% 0.00703 0.00005 5000 96.15% 0.00643 0.00004

Unrestricted Choice Training Steps Total Correct Incorrect 1000 3357 3049 308 2000 3357 3061 296 3000 3357 3066 291 4000 3357 3069 288 5000 3357 3071 286 Training Steps µ Accuracy σ σ2 1000 90.85% 0.00684 0.00005 2000 91.20% 0.00538 0.00003 3000 91.35% 0.00499 0.00002 4000 91.44% 0.00543 0.00003 5000 91.50% 0.00513 0.00003

179 APPENDIX E. DATA TABLES

Granularity 2: Predicting Predicate w/ Adjunct 3-way Choice Training Steps Total Correct Incorrect 1000 3357 2691 666 2000 3357 2697 660 3000 3357 2704 653 4000 3357 2716 641 5000 3357 2725 632 Training Steps µ Accuracy σ σ2 1000 80.18% 0.00700 0.00005 2000 80.36% 0.00914 0.00008 3000 80.57% 0.00914 0.00008 4000 80.92% 0.00934 0.00009 5000 81.19% 0.00705 0.00005

Unrestricted Choice Training Steps Total Correct Incorrect 1000 3357 1796 1561 2000 3357 1804 1553 3000 3357 1827 1530 4000 3357 1841 1516 5000 3357 1848 1509 Training Steps µ Accuracy σ σ2 1000 53.51% 0.00486 0.00002 2000 53.75% 0.00483 0.00002 3000 54.43% 0.00574 0.00003 4000 54.85% 0.00783 0.00006 5000 55.06% 0.00720 0.00005

180 APPENDIX E. DATA TABLES

Granularity 3: Predicting Full Sentence 3-way Choice Training Steps Total Correct Incorrect 1000 3357 1779 1578 2000 3357 1886 1471 3000 3357 1922 1435 4000 3357 1935 1422 5000 3357 1950 1407 Training Steps µ Accuracy σ σ2 1000 53.00% 0.04476 0.00200 2000 56.19% 0.04404 0.00194 3000 57.27% 0.04078 0.00166 4000 57.65% 0.04352 0.00189 5000 58.10% 0.04353 0.00189

Unrestricted Choice Training Steps Total Correct Incorrect 1000 3357 12 3345 2000 3357 14 3343 3000 3357 14 3343 4000 3357 15 3342 5000 3357 14 3343 Training Steps µ Accuracy σ σ2 1000 0.36% 0.00190 0.00000 2000 0.42% 0.00210 0.00000 3000 0.42% 0.00210 0.00000 4000 0.47% 0.00253 0.00001 5000 0.42% 0.00251 0.00001

Table E.5: Accuracy tables for linear-DNN automatic evaluation

181 APPENDIX E. DATA TABLES E.6 Combined Linear-DNN with Weighted Features

Granularity 1: Predicting Predicate Only 3-way Choice Training Steps Total Correct Incorrect 1000 3357 3219 138 2000 3357 3223 134 3000 3357 3226 131 4000 3357 3224 133 5000 3357 3222 135 Training Steps µ Accuracy σ σ2 1000 95.91% 0.00683 0.00005 2000 96.03% 0.00628 0.00004 3000 96.12% 0.00771 0.00006 4000 96.06% 0.00593 0.00004 5000 96.00% 0.00674 0.00005

Unrestricted Choice Training Steps Total Correct Incorrect 1000 3357 3050 307 2000 3357 3063 294 3000 3357 3071 286 4000 3357 3074 283 5000 3357 3069 288 Training Steps µ Accuracy σ σ2 1000 90.88% 0.00693 0.00005 2000 91.26% 0.00914 0.00008 3000 91.50% 0.00751 0.00006 4000 91.59% 0.00688 0.00005 5000 91.44% 0.00838 0.00007

182 APPENDIX E. DATA TABLES

Granularity 2: Predicting Predicate w/ Adjunct 3-way Choice Training Steps Total Correct Incorrect 1000 3357 2707 650 2000 3357 2716 641 3000 3357 2724 633 4000 3357 2733 624 5000 3357 2734 623 Training Steps µ Accuracy σ σ2 1000 80.66% 0.00901 0.00008 2000 80.92% 0.00844 0.00007 3000 81.16% 0.00897 0.00008 4000 81.43% 0.00919 0.00008 5000 81.46% 0.01028 0.00011

Unrestricted Choice Training Steps Total Correct Incorrect 1000 3357 1812 1545 2000 3357 1832 1525 3000 3357 1837 1520 4000 3357 1839 1518 5000 3357 1830 1527 Training Steps µ Accuracy σ σ2 1000 53.99% 0.00753 0.00006 2000 54.58% 0.01039 0.00011 3000 54.73% 0.01181 0.00014 4000 54.79% 0.01024 0.00010 5000 54.52% 0.00980 0.00010

183 APPENDIX E. DATA TABLES

Granularity 3: Predicting Full Sentence 3-way Choice Training Steps Total Correct Incorrect 1000 3357 1680 1677 2000 3357 1781 1576 3000 3357 1853 1504 4000 3357 1889 1468 5000 3357 1920 1437 Training Steps µ Accuracy σ σ2 1000 50.05% 0.04326 0.00187 2000 53.06% 0.03441 0.00118 3000 55.21% 0.03388 0.00115 4000 56.28% 0.03365 0.00113 5000 57.21% 0.03123 0.00098

Unrestricted Choice Training Steps Total Correct Incorrect 1000 3357 10 3347 2000 3357 12 3345 3000 3357 11 3346 4000 3357 13 3344 5000 3357 15 3342 Training Steps µ Accuracy σ σ2 1000 0.30% 0.00245 0.00001 2000 0.36% 0.00126 0.00000 3000 0.33% 0.00170 0.00000 4000 0.39% 0.00202 0.00000 5000 0.45% 0.00253 0.00001

Table E.6: Accuracy tables for linear-DNN automatic evaluation with weighted features

184 APPENDIX E. DATA TABLES E.7 Combined Linear-DNN with Weighted Discrete Features

Granularity 1: Predicting Predicate Only 3-way Choice Training Steps Total Correct Incorrect 1000 3357 3207 150 2000 3357 3216 141 3000 3357 3221 136 4000 3357 3227 130 5000 3357 3227 130 Training Steps µ Accuracy σ σ2 1000 95.56% 0.00841 0.00007 2000 95.88% 0.00869 0.00008 3000 95.97% 0.00733 0.00005 4000 96.15% 0.00703 0.00005 5000 96.15% 0.00643 0.00004

Unrestricted Choice Training Steps Total Correct Incorrect 1000 3357 3049 308 2000 3357 3061 296 3000 3357 3066 291 4000 3357 3069 288 5000 3357 3071 286 Training Steps µ Accuracy σ σ2 1000 90.85% 0.00684 0.00005 2000 91.20% 0.00538 0.00003 3000 91.35% 0.00499 0.00002 4000 91.44% 0.00543 0.00003 5000 91.50% 0.00513 0.00003

185 APPENDIX E. DATA TABLES

Granularity 2: Predicting Predicate w/ Adjunct 3-way Choice Training Steps Total Correct Incorrect 1000 3357 2691 666 2000 3357 2697 660 3000 3357 2704 653 4000 3357 2716 641 5000 3357 2725 632 Training Steps µ Accuracy σ σ2 1000 80.18% 0.00700 0.00005 2000 80.36% 0.00914 0.00008 3000 80.57% 0.00914 0.00008 4000 80.92% 0.00934 0.00009 5000 81.19% 0.00705 0.00005

Unrestricted Choice Training Steps Total Correct Incorrect 1000 3357 1796 1561 2000 3357 1804 1553 3000 3357 1827 1530 4000 3357 1841 1516 5000 3357 1848 1509 Training Steps µ Accuracy σ σ2 1000 53.51% 0.00486 0.00002 2000 53.75% 0.00483 0.00002 3000 54.43% 0.00574 0.00003 4000 54.85% 0.00783 0.00006 5000 55.06% 0.00720 0.00005

186 APPENDIX E. DATA TABLES

Granularity 3: Predicting Full Sentence 3-way Choice Training Steps Total Correct Incorrect 1000 3357 1779 1578 2000 3357 1886 1471 3000 3357 1922 1435 4000 3357 1935 1422 5000 3357 1950 1407 Training Steps µ Accuracy σ σ2 1000 53.00% 0.04476 0.00200 2000 56.19% 0.04404 0.00194 3000 57.27% 0.04078 0.00166 4000 57.65% 0.04352 0.00189 5000 58.10% 0.04353 0.00189

Unrestricted Choice Training Steps Total Correct Incorrect 1000 3357 12 3345 2000 3357 14 3343 3000 3357 14 3343 4000 3357 15 3342 5000 3357 14 3343 Training Steps µ Accuracy σ σ2 1000 0.36% 0.00190 0.00000 2000 0.42% 0.00210 0.00000 3000 0.42% 0.00210 0.00000 4000 0.45% 0.00253 0.00001 5000 0.42% 0.00251 0.00001

Table E.7: Accuracy tables for linear-DNN automatic evaluation with weighted discrete features

187 APPENDIX E. DATA TABLES E.8 Combined Linear-DNN with Feature Weights Only

Granularity 1: Predicting Predicate Only 3-way Choice Training Steps Total Correct Incorrect 1000 3357 3237 120 2000 3357 3259 98 3000 3357 3272 85 4000 3357 3313 44 5000 3357 3313 44 Training Steps µ Accuracy σ σ2 1000 96.45% 0.00919 0.00008 2000 97.10% 0.00825 0.00007 3000 97.49% 0.00599 0.00004 4000 98.71% 0.00469 0.00002 5000 98.71% 0.00321 0.00001

Unrestricted Choice Training Steps Total Correct Incorrect 1000 3357 3103 254 2000 3357 3171 186 3000 3357 3207 150 4000 3357 3244 113 5000 3357 3247 110 Training Steps µ Accuracy σ σ2 1000 92.46% 0.01059 0.00011 2000 94.48% 0.00626 0.00004 3000 95.56% 0.00576 0.00003 4000 96.66% 0.00624 0.00004 5000 96.75% 0.00522 0.00003

188 APPENDIX E. DATA TABLES

Granularity 2: Predicting Predicate w/ Adjunct 3-way Choice Training Steps Total Correct Incorrect 1000 3357 2720 637 2000 3357 2781 576 3000 3357 2796 561 4000 3357 2792 565 5000 3357 2798 559 Training Steps µ Accuracy σ σ2 1000 81.04% 0.00827 0.00007 2000 82.86% 0.00854 0.00007 3000 83.31% 0.00857 0.00007 4000 83.19% 0.00864 0.00007 5000 83.37% 0.00799 0.00006

Unrestricted Choice Training Steps Total Correct Incorrect 1000 3357 2409 948 2000 3357 2543 814 3000 3357 2550 807 4000 3357 2552 805 5000 3357 2559 798 Training Steps µ Accuracy σ σ2 1000 71.78% 0.01145 0.00013 2000 75.77% 0.00647 0.00004 3000 75.98% 0.00771 0.00006 4000 76.04% 0.00735 0.00005 5000 76.25% 0.00736 0.00005

189 APPENDIX E. DATA TABLES

Granularity 3: Predicting Full Sentence 3-way Choice Training Steps Total Correct Incorrect 1000 3357 755 2602 2000 3357 670 2687 3000 3357 1008 2349 4000 3357 1520 1837 5000 3357 1952 1405 Training Steps µ Accuracy σ σ2 1000 22.49% 0.01966 0.00039 2000 20.23% 0.05766 0.00332 3000 30.03% 0.10858 0.01179 4000 45.29% 0.14882 0.02215 5000 57.89% 0.12705 0.01614

Unrestricted Choice Training Steps Total Correct Incorrect 1000 3357 6 3351 2000 3357 6 3351 3000 3357 6 3351 4000 3357 9 3348 5000 3357 10 3347 Training Steps µ Accuracy σ σ2 1000 0.18% 0.00287 0.00001 2000 0.18% 0.00287 0.00001 3000 0.18% 0.00320 0.00001 4000 0.27% 0.00296 0.00001 5000 0.30% 0.00280 0.00001

Table E.8: Accuracy tables for linear-DNN automatic evaluation with feature weights alone

190 Appendix F

Publication History

• August, 2014 — *SEM workshop, COLING 2014, Dublin, Ireland (Pustejovsky and Krish- naswamy, 2014)

• May, 2016 — LREC 2016, Protoroz,ˇ Slovenia (Pustejovsky and Krishnaswamy, 2016a)

• May, 2016 — ISA workshop, LREC 2016, Protoroz,ˇ Slovenia (Do et al., 2016)

• August, 2016 — Spatial Cognition 2016, Philadelpha, PA, USA (Krishnaswamy and Puste- jovsky, 2016a) (short paper)

• September, 2016 — CogSci 2016, Philadelpha, PA, USA (Pustejovsky and Krishnaswamy, 2016b)

• December, 2016 — GramLex workshop, COLING 2016, Osaka, Japan (Pustejovsky et al., 2016)

• December, 2016 — COLING 2016, Osaka, Japan (Krishnaswamy and Pustejovsky, 2016b)

• March, 2017 — AAAI Spring Symposium: Interactive Multisensory Object Perception for Embodied Agents, Stanford, CA, USA (Pustejovsky et al., 2017)

• Forthcoming, 2017 — Spatial Cognition X, Springer LNAI series (Krishnaswamy and Pustejovsky, 2016a) (extended paper)

191 Bibliography

Mart´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Savannah, Georgia, USA, 2016.

Julia Albath, Jennifer L. Leopold, Chaman L. Sabharwal, and Anne M. Maglia. RCC-3D: Quali- tative spatial reasoning in 3D. In CAINE, pages 74–79, 2010.

James F. Allen. Maintaining knowledge about temporal intervals. Communications of the ACM, 26(11):832–843, 1983.

Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman Ganchev, Slav Petrov, and Michael Collins. Globally normalized transition-based neural networks. arXiv preprint arXiv:1603.06042, 2016.

Carl Bache. Verbal aspect: a general theory and its application to present-day English. Syddansk Universitetsforlag, 1985.

Benjamin K. Bergen. Louder than words: The new science of how the mind makes meaning. Basic Books, 2012.

Mehul Bhatt and Seng Loke. Modelling dynamic spatial systems in the situation calculus. Spatial Cognition and Computation, 2008.

Rama Bindiganavale and Norman I. Badler. Motion abstraction and mapping with spatial con- straints. In Modelling and Motion Capture Techniques for Virtual Environments, pages 70–82. Springer, 1998.

192 BIBLIOGRAPHY

Patrick Blackburn and Johan Bos. Computational semantics. THEORIA. An International Journal for Theory, History and Foundations of Science, 18(1), 2008.

Rens Bod. Beyond grammar. An Experienced-Based Theory of Language. CSLI Lecture Notes, 88, 1998.

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326, 2015.

Michael Buhrmester, Tracy Kwang, and Samuel D. Gosling. Amazon’s Mechanical Turk: a new source of inexpensive, yet high-quality, data? Perspectives on psychological science, 6(1):3–5, 2011.

Rudolf Carnap. Meaning and necessity: a study in semantics and modal logic. University of Chicago Press, 1947.

Gavin C. Cawley and Nicola L. C. Talbot. On over-fitting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research, 11(Jul):2079– 2107, 2010.

Angel Chang, Will Monroe, Manolis Savva, Christopher Potts, and Christopher D. Manning. Text to 3D scene generation with rich lexical grounding. arXiv preprint arXiv:1505.06289, 2015.

Chen Chung Chang and H. Jerome Keisler. Model theory, volume 73. Elsevier, 1973.

Yu-Wei Chao, Zhan Wang, Yugeng He, Jiaxuan Wang, and Jia Deng. HICO: A benchmark for recognizing human-object interactions in images. In Proceedings of the IEEE International Conference on Computer Vision, pages 1017–1025, 2015.

Jinho D. Choi and Andrew McCallum. Transition-based dependency parsing with selectional branching. In ACL (1), pages 1052–1062, 2013.

Bernard Comrie. Aspect: An introduction to the study of verbal aspect and related problems, volume 2. Cambridge university press, 1976.

Bob Coyne and Richard Sproat. WordsEye: an automatic text-to-scene conversion system. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 487–496. ACM, 2001.

193 BIBLIOGRAPHY

Ernest Davis and Gary Marcus. The scope and limits of simulation in automated reasoning. Arti- ficial Intelligence, 233:60–72, 2016.

Sebastian Deterding, Miguel Sicart, Lennart Nacke, Kenton O’Hara, and Dan Dixon. Gamification. using game-design elements in non-gaming contexts. In CHI’11 extended abstracts on human factors in computing systems, pages 2425–2428. ACM, 2011.

Kevin Dill. A game AI approach to autonomous control of virtual characters. In Interser- vice/Industry Training, Simulation, and Education Conference (I/ITSEC), 2011.

Tuan Do. Event-driven movie annotation using MPII movie dataset. 2016.

Tuan Do, Nikhil Krishnaswamy, and James Pustejovsky. ECAT: Event capture annotation tool. Proceedings of ISA-12: International Workshop on Semantic Annotation, 2016.

Simon Dobnik and Robin Cooper. Spatial descriptions in type theory with records. In Proceed- ings of IWCS 2013 Workshop on Computational Models of Spatial Language Interpretation and Generation (CoSLI-3). Citeseer, 2013.

Simon Dobnik, Robin Cooper, and Staffan Larsson. Modelling language, action, and perception in type theory with records. In Constraint Solving and Language Processing, pages 70–91. Springer, 2013.

Jacques Durand. On the scope of linguistics: data, intuitions, corpora. Corpus analysis and variation in linguistics, pages 25–52, 2009.

Boi Faltings and Peter Struss. Recent advances in qualitative physics. MIT Press, 1992.

Jerome Feldman. From molecule to metaphor: A neural theory of language. MIT press, 2006.

Jerome Feldman and Srinivas Narayanan. Embodied meaning in a neural theory of language. Brain and language, 89(2):385–392, 2004.

George Ferguson, James F. Allen, et al. TRIPS: An integrated intelligent problem-solving assistant. In AAAI/IAAI, pages 567–572, 1998.

Kenneth D. Forbus. Qualitative physics: Past, present and future. Exploring artificial intelligence, pages 239–296, 1988.

194 BIBLIOGRAPHY

Kenneth D. Forbus, James V. Mahoney, and Kevin Dill. How qualitative spatial reasoning can improve strategy game AIs. IEEE Intelligent Systems, 17(4):25–30, 2002.

Gottlob Frege. On sense and reference. 1994, Basic Topics in the Philosophy of Language, Prentice-Hall, Englewood Cliffs, NJ, pages 142–160, 1892.

Michael Gelfond and Vladimir Lifschitz. The stable model semantics for logic programming. In ICLP/SLP, volume 88, pages 1070–1080, 1988.

Mark Giambruno. 3D graphics and animation. New Riders Publishing, 2002.

James J. Gibson. The theory of affordances. Perceiving, Acting, and Knowing: Toward an ecolog- ical psychology, pages 67–82, 1977.

James J. Gibson. The Ecology Approach to Visual Perception: Classic Edition. Psychology Press, 1979.

Peter Michael Goebel and Markus Vincze. A cognitive modeling approach for the semantic ag- gregation of object prototypes from geometric primitives: toward understanding implicit object topology. In Advanced Concepts for Intelligent Vision Systems, pages 84–96. Springer, 2007.

Will Goldstone. Unity Game Development Essentials. Packt Publishing Ltd, 2009.

Branko Grunbaum.¨ Are your polyhedra the same as my polyhedra? In Discrete and Computational Geometry, pages 461–488. Springer, 2003.

Saurabh Gupta and Jitendra Malik. Visual semantic role labeling. arXiv preprint arXiv:1505.04474, 2015.

Zellig S. Harris. Distributional structure. Word, 10(2-3):146–162, 1954.

Panagiotis G. Ipeirotis, Foster Provost, and Jing Wang. Quality management on Amazon Mechan- ical Turk. In Proceedings of the ACM SIGKDD workshop on human computation, pages 64–67. ACM, 2010.

Ray Jackendoff. Semantics and Cognition. MIT Press, 1983.

Richard Johansson, Anders Berglund, Magnus Danielsson, and Pierre Nugues. Automatic text-to- scene conversion in the traffic accident domain. In IJCAI, volume 5, pages 1073–1078, 2005.

195 BIBLIOGRAPHY

Mark Johnson. The body in the mind: The bodily basis of meaning, imagination, and reason. University of Chicago Press, 1987.

Leo Joskowicz and Elisha P. Sacks. Computational kinematics. Artificial Intelligence, 51(1-3): 381–416, 1991.

Gitit Kehat and James Pustejovsky. Annotation methodologies for vision and language dataset creation. IEEE CVPR Scene Understanding Workshop (SUNw), Las Vegas, 2016.

H. Jerome Keisler. Model theory for infinitary logic. 1971.

Christopher Kennedy and Louise McNally. From event structure to scale structure: Degree mod- ification in deverbal adjectives. In Semantics and linguistic theory, volume 9, pages 163–180, 1999.

Kara Kipper, Anna Korhonen, Neville Ryant, and Martha Palmer. Extensive classifications of English verbs. In Proceedings of the 12th EURALEX International Congress, Turin, Italy, 2006.

Saul A. Kripke. Semantical analysis of intuitionistic logic I. Studies in Logic and the Foundations of Mathematics, 40:92–130, 1965.

Nikhil Krishnaswamy and James Pustejovsky. Multimodal semantic simulations of linguistically underspecified motion events. In Spatial Cognition X: International Conference on Spatial Cog- nition. Springer, 2016a.

Nikhil Krishnaswamy and James Pustejovsky. VoxSim: A visual platform for modeling motion language. In Proceedings of COLING 2016, the 26th International Conference on Computa- tional Linguistics: Technical Papers. ACL, 2016b.

Benjamin Kuipers. Qualitative reasoning: modeling and simulation with incomplete knowledge. MIT press, 1994.

Yohei Kurata and Max Egenhofer. The 9+ intersection for topological relations between a directed line segment and a region. In B. Gottfried, editor, Workshop on Behaviour and Monitoring Interpretation, pages 62–76, Germany, September 2007.

George Lakoff. Women, fire, and dangerous things: What categories reveal about the mind. Cam- bridge University Press, 1987.

196 BIBLIOGRAPHY

George Lakoff. The neural theory of metaphor. Available at SSRN 1437794, 2009.

Beth Levin. English Verb Classes and Alternations: A Preliminary Investigation. University of Chicago Press, Chicago, 1993.

Edward Loper and Steven Bird. NLTK: The natural language toolkit. In Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics-Volume 1, pages 63–70. Association for Computational Linguistics, 2002.

Duncan Luce, David Krantz, Patrick Suppes, and Amos Tversky. Foundations of measurement, Vol. III: Representation, axiomatization, and invariance. 1990.

Minhua Ma and Paul McKevitt. Virtual human animation in natural language visualisation. Artifi- cial Intelligence Review, 25(1-2):37–53, 2006.

Nadia Magnenat-Thalmann, Richard Laperrire, and Daniel Thalmann. Joint-dependent local de- formations for hand animation and object grasping. In Proceedings of Graphics Interface, 1988.

David Mark and Max Egenhofer. Topology of prototypical spatial relations between lines and regions in English and Spanish. In Proceedings of the Twelfth International Symposium on Computer- Assisted Cartography, volume 4, pages 245–254, 1995.

David McDonald and James Pustejovsky. On the representation of inferences and their lexicaliza- tion. In Advances in Cognitive Systems, volume 3, 2014.

Matthew D. McLure, Scott E. Friedman, and Kenneth D. Forbus. Extending analogical general- ization with near-misses. In AAAI, pages 565–571, 2015.

Srinivas Sankara Narayanan. KARMA: Knowledge-based active representations for metaphor and aspect. University of California, Berkeley, 1997.

Ralf Naumann. A dynamic approach to aspect: Verbs as programs. University of Dusseldorf,¨ submitted to Journal of Semantics., 1999.

Nick Pelling. The (short) prehistory of gamification. Funding Startups (& other impossibilities), 2011.

197 BIBLIOGRAPHY

James F. Peters. Near sets. Special theory about nearness of objects. Fundamenta Informaticae, 75(1-4):407–433, 2007.

James Pustejovsky. The Generative Lexicon. MIT Press, Cambridge, MA, 1995.

James Pustejovsky. Dynamic event structure and habitat theory. In Proceedings of the 6th In- ternational Conference on Generative Approaches to the Lexicon (GL2013), pages 1–10. ACL, 2013.

James Pustejovsky and Nikhil Krishnaswamy. Generating simulations of motion events from ver- bal descriptions. Lexical and Computational Semantics (* SEM 2014), page 99, 2014.

James Pustejovsky and Nikhil Krishnaswamy. VoxML: A visualization modeling language. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Tenth International Conference on Language Re- sources and Evaluation (LREC 2016), Paris, France, May 2016a. European Language Resources Association (ELRA). ISBN 978-2-9517408-9-1.

James Pustejovsky and Nikhil Krishnaswamy. Visualizing events: Simulating meaning in lan- guage. In Proceedings of CogSci, 2016b.

James Pustejovsky and Jessica Moszkowicz. The qualitative spatial dynamics of motion. The Journal of Spatial Cognition and Computation, 2011.

James Pustejovsky, Nikhil Krishnaswamy, Tuan Do, and Gitit Kehat. The development of multi- modal lexical resources. GramLex 2016, page 41, 2016.

James Pustejovsky, Nikhil Krishnaswamy, and Tuan Do. Object embodiment in a multimodal simulation. AAAI Spring Symposium: Interactive Multisensory Object Perception for Embodied Agents, 2017.

David Randell, Zhan Cui, and Anthony Cohn. A spatial logic based on regions and connections. In Morgan Kaufmann, editor, Proceedings of the 3rd International Conference on Knowledge Representation and Reasoning, pages 165–176, San Mateo, 1992.

Matteo Ruggero Ronchi and Pietro Perona. Describing common human visual actions in images. arXiv preprint arXiv:1506.02203, 2015.

198 BIBLIOGRAPHY

Eleanor Rosch. Natural categories. Cognitive psychology, 4(3):328–350, 1973.

Eleanor Rosch. Prototype classification and logical classification: The two systems. New trends in conceptual representation: Challenges to Piaget’s theory, pages 73–86, 1983.

Anna Rumshisky, Nick Botchan, Sophie Kushkuley, and James Pustejovsky. Word sense invento- ries by non-experts. In LREC, pages 4055–4059, 2012.

Radu Bogdan Rusu, Zoltan Csaba Marton, Nico Blodow, Mihai Dolha, and Michael Beetz. To- wards 3D point cloud based object maps for household environments. Robotics and Autonomous Systems, 56(11):927–941, 2008.

Shlomo S. Sawilowsky. You think you’ve got trivials? Journal of Modern Applied Statistical Methods, 2(1):21, 2003.

Claude Elwood Shannon. A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review, 5(1):3–55, 1948.

Valerie J. Shute, Matthew Ventura, and Yoon Jeon Kim. Assessment and learning of qualitative physics in Newton’s Playground. The Journal of Educational Research, 106(6):423–430, 2013.

Jeffrey Mark Siskind. Grounding the lexical semantics of verbs in visual perception using force dynamics and event logic. J. Artif. Intell. Res.(JAIR), 15:31–90, 2001.

Scott Soames. Rethinking language, mind, and meaning. Princeton University Press, 2015.

Stanley Smith Stevens. On the theory of scales of measurement, 1946.

Leonard Talmy. Lexicalization patterns: semantic structure in lexical forms. In T. Shopen, editor, Language typology and semantic description Volume 3:, pages 36–149. Cambridge University Press, 1985.

Leonard Talmy. Towards a cognitive semantics. MIT Press, 2000.

Alfred Tarski. On the concept of logical consequence. Logic, semantics, metamathematics, 2: 1–11, 1936.

199 BIBLIOGRAPHY

Sebastian Thrun, Michael Beetz, Maren Bennewitz, Wolfram Burgard, Armin B. Cremers, Frank Dellaert, Dieter Fox, Dirk Haehnel, Chuck Rosenberg, Nicholas Roy, et al. Probabilistic al- gorithms and the interactive museum tour-guide robot Minerva. The International Journal of Robotics Research, 19(11):972–999, 2000.

Johan van Benthem and Jan Bergstra. Logic of transition systems. Journal of Logic, Language and Information, 3(4):247–283, 1994.

Johan van Benthem, Jan Eijck, and Alla Frolova. Changing preferences. Centrum voor Wiskunde en Informatica, 1993.

Johan van Benthem, Jan van Eijck, and Vera Stebletsova. Modal logic, transition systems and processes. Journal of Logic and Computation, 4(5):811–855, 1994.

Johannes Franciscus Abraham Karel van Benthem. Logic and the flow of information. 1991.

Zeno Vendler. Verbs and times. The philosophical review, pages 143–160, 1957.

Luis Von Ahn and Laura Dabbish. Labeling images with a computer game. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 319–326. ACM, 2004.

Terry Winograd. Procedures as a representation for data in a computer program for understanding natural language. Technical report, DTIC Document, 1971.

200