Computational Perception of Physical Object Properties

by Jiajun Wu

B.Eng., B.Ec., Tsinghua University (2014)

Submitted to the Department of and in partial fulfillment of the requirements for the degree of

Master of Science

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

February 2016

@ Massachusetts Institute of Technology 2016. All rights reserved.

Signature redacted Author ...... Department of E)rical Engineering and Computer Science January 29, 2016

Certified by ...... Signature redacted...... William T. Freeman Thomas and Gerd Perkins Professor of Electrical Engineering and Computer Science Supervisor Signature redacted

C ertified by ...... Joshua B. Tenenbaum Professor of Computational Cognitive Science Thesis Supervisor Signature redacted A ccepted by ...... U---- - ...... Leslie A. Kolodziejski on Graduate Students MASSACHUSMS INSTTUTE Chair, Depar ent Committee OF TECHNOLOGY

APR 15 2016 LIBRARIES ARCHIVES 2 Computational Perception of Physical Object Properties by Jiajun Wu

Submitted to the Department of Electrical Engineering and Computer Science on January 29, 2016, in partial fulfillment of the requirements for the degree of Master of Science

Abstract We study the problem of learning physical object properties from visual data. In- spired by findings in cognitive science that even infants are able to perceive a physical world full of dynamic content at a early age, we aim to build models to characterize object properties from synthetic and real-world scenes. We build a novel dataset con- taining over 17, 000 videos with 101 objects in a set of visually simple but physically rich scenarios. We further propose two novel models for learning physical object prop- erties by incorporating physics simulators, either a symbolic interpreter or a mature physics engine, with deep neural nets. Our extensive evaluations demonstrate that these models can learn physical object properties well and, with a physic engine, the responses of the model positively correlate with human responses. Future research directions include incorporating the knowledge of physical object properties into the understanding of interactions among objects, scenes, and agents.

Thesis Supervisor: William T. Freeman Title: Thomas and Gerd Perkins Professor of Electrical Engineering and Computer Science

Thesis Supervisor: Joshua B. Tenenbaum Title: Professor of Computational Cognitive Science

3 4 Acknowledgments

I would like to express sincere gratitude to my advisors, Professor William Freeman

and Professor Joshua Tenenbaum. Bill and Josh are always inspiring and encouraging, and have led me through my research with profound insights. Not only have they

taught me how to aim for top-quality research, but they have been sharing with me invaluable lessons about life.

I deeply appreciate the guidance and support from my undergraduate advisor, Professor Zhuowen Tu, who introduced me into the world of AI and vision, and has

long been my mentor and friend since then. I also thank Professor Andrew Chi-

Chih Yao and Professor Jian Li for advising me during my undergraduate study, Dr.

Yuandong Tian for mentoring me at Facebook AI Research, and Dr. Kai Yu and Dr. Yinan Yu for mentoring me at Baidu Research.

The thesis would not have been possible without the inspiration and support from my colleagues in the MIT Vision Group and Computational Cognitive Science (Co- CoSci) Group. I would like to deliver my appreciation to my collaborators, Dr. Joseph Lim, Tianfan Xue, and Dr. Ilker Yildirim. I am also thankful to other encouraging and helpful group members, especially Andrew Owens, Donglai Wei, Dr. Tomer Ull- man, Katie Bouman, Kelsey Allen, Tejas Kulkarni, Dr. Dilip Krishnan, Dr. Hossein

Mohabi, Dr. Tali Dekel, Dr. Daniel Zoran, Pedro Tsividis, and Hongyi Zhang.

I would like to extend my appreciation to my dear friends for their backing in my academic and daily life.

I received the Edwin S. Webster Fellowship during my first year, and have been partially funded by NSF-6926677 (Reconstructive Recognition). I appreciate the sup- port from all funding agencies.

Finally, I thank my parents, for their lifelong encouragement and love.

5 6 Contents

1 Introduction 13

2 Modeling the Physical World 17

2.1 Scenarios ...... 18

2.2 The Physics 101 Dataset ...... 20

3 Physical Object Model: Learning with a Symbolic Interpreter 23

3.1 Visual Property Discoverer ...... 24

3.2 Physics Interpreter ...... 25

3.3 Physical World Simulator ...... 26

3.4 Experim ents ...... 27

3.4.1 Learning Physical Properties ...... 28

3.4.2 Detecting Objects with Unusual Properties ...... 30

3.4.3 Predicting Outcomes ...... 31

4 Physical Object Model: Incorporating a Physics Engine 35 4.1 The Galileo M odel ...... 35

4.1.1 Tracking as Recognition ...... 38

4.1.2 Inference ...... 38

4.2 Sim ulations ...... 38

4.3 Bootstrapping as Efficient Perception in Static Scenes ...... 39

4.4 Experim ents ...... 41

4.4.1 Outcome Prediction ...... 41

7 4.4.2 M ass Prediction ...... 42

4.4.3 "Will it move" Prediction ...... 43

5 Beyond Understanding Physics 45

6 Conclusion 47

8 List of Figures

2-1 Abstraction of the physical world, and a snapshot of our dataset. . 18

2-2 Illustrations of the scenarios in our Physics 101 dataset...... 19

2-3 Physics 101: this is the set of objects we used in our experiments. We vary object material, color, shape, and size, together with external

conditions such as the slope of a surface or the stiffness of a string. Videos recording the motions of these objects interacting with target

objects will be used to train our ...... 20

3-1 Our first model exploits the advancement of algo- rithm (convolutional neural network) - we supervise all levels by a physics interpreter. This interpreter provides the physical constraints

on what each layer can take values. During the training and testing, our model has no label of physical properties, in contrast to the stan-

dard approaches...... 24

3-2 Charts for the estimations of rings. The physical properties, especially

density, of the first ring is different from those of the other rings. The difference is hard to perceive by merely visual appearances; however, by observing videos with object interactions, our algorithm is able to learn the properties and find the outlier. All figures are on a log-normalized

scale...... 30

3-3 Heat maps of user predictions, model outputs (in orange), and ground

truths (in white). Objects from top to bottom, left to right: dough, metal coin, metal pole, plastic block, plastic doll, and porcelain. .. . 31

9 4-1 Our second model formalizes a hypothesis space of physical object rep- resentations, where each object is defined by its mass, friction coef-

ficient, 3D shape, and a positional offset w.r.t. an origin. To model videos, we draw objects from that hypothesis space into the physics

engine. The simulations from the physics engine are compared to ob-

servations in the velocity space...... 36

4-2 Simulation results. Each row represents one video in the data: (a) the first frame of the video, (b) the last frame of the video, (c) the first

frame of the simulated scene generated by Bullet, (d) the last frame of

the simulated scene, (e) the estimated object with larger mass, (f) the

estuimateu oUjecUtwih larig iiction cIeUfIcIent...... 39

4-3 Mean squared errors of oracle estimation, our estimation, and uniform estimations of mass on a log-normalized scale, and the correlations between estimations and ground truths ...... 41

4-4 The log-likelihood traces of several chains with and without recognition-

model (LeNet) based initializations...... 41

4-5 Mean errors in numbers of pixels of human predictions, Galileo out-

puts, and a uniform estimate calculated by averaging ground truth ending points over all test cases. As the error patterns are similar for

both target objects (foam and cardboard), the errors here are averaged across target objects for each material...... 43

4-6 Heat maps of user predictions, Galileo outputs (orange crosses), and

ground truths (white crosses)...... 43

4-7 Average accuracy of human predictions and Galileo outputs on the

tasks of mass prediction and "will it move" prediction. Error bars

indicate standard deviations of human accuracies...... 44

10 List of Tables

3.1 Accuracies (%, for oracle) or clustering purities (%, for joint training) on material estimation. In the joint training case, as there is no su-

pervision on the material layer, it is not necessary for the network to

specifically map the responses in that layer to material labels, and we

do not expect the numbers to be comparable with the oracle case. Our analysis is just to show even in this case the network implicitly grasps

some knowledge of object materials...... 29

3.2 Correlation coefficients of our estimations and ground truth for mass, density, and volume ...... 30 3.3 Mean squared errors in pixels of human predictions (H), model outputs (M), or uniform estimate minimizing the mean squared error (U) .. . 31

3.4 Correlation coefficients on the tasks of predicting the moving distance and the bounce height, and accuracies on predicting whether an object

floats ...... 32

4.1 Correlations between pairs of outputs in the mass prediction experi- ment (in Spearman's coefficient) and in the "will it move" prediction

experiment (in Pearson's coefficient)...... 44

11 12 Chapter 1

Introduction

Our visual system is designed to perceive a physical world that is full of dynamic content. Consider yourself watching a Rube Goldberg machine unfold: as the kinetic energy moves through the machine, you may see objects sliding down ramps, collid- ing with each other, rolling, entering other objects, falling - many kinds of physical interactions between objects of different masses, materials, and other physical proper- ties. How does our visual system recover so much content from the dynamic physical world? What is the role of experience in interpreting a novel dynamical scene?

Further, there is evidence that babies form a visual understanding of basic physical concepts, as a basic component of common sense knowledge, at a very young age; they learn properties of objects from their motions [1]. As young as 2.5 to 5.5 months old, infants learn basic physics even before they acquire advanced high-level knowledge like semantic categories of objects [5, 1]. Both infants and adults also use their physics knowledge to learn and discover latent labels of object properties, as well as predict the physical behavior of objects [2]. These facts suggest the importance for a visual system of understanding physics, and motivate our goal of building a machine with such visual competency.

Recent behavioral and computational studies of human physical scene understand- ing push forward an account that people's judgments are best explained as proba- bilistic simulations of a realistic, but mental, physics engine [2, 151. Specifically, these studies suggest that the brain carries detailed but noisy knowledge of the physical

13 attributes of objects and the laws of physical interactions between objects (i. e., New- tonian mechanics). To understand a physical scene, and more crucially, to predict the future dynamical evolution of a scene, the brain relies on simulations from this mental physics engine.

Even though the probabilistic simulation account is very appealing, there are missing practical and conceptual leaps. First, as a practical matter, the probabilistic simulation approach is shown to work only with synthetically generated stimuli in only 2D or 3D block worlds. The joint inference of the mass and coefficient of friction is also not handled [2]. Second, as a conceptual matter, previous research rarely clarifies how a mental physics engine could take advantage of previous experience of the agent 118]. It is the case that humans have a life long experience with dynamical scenes, and a fuller account of human physical scene understanding should address it.

We aim to build on the idea that humans utilize a realistic physics engine as part of a generative model to interpret real-world physical scenes. Given a video as observa- tion to the model, physical scene understanding in the model corresponds to inverting the generative model by probabilistic inference to recover the underlying physical ob- ject properties in the scene. Our formulation combines deep learning, which serves as a powerful low-level visual recognition system, with a physics simulator to estimate physical properties directly from unlabeled videos. We study two possible forms of a physics simulator: the first is a symbolic physics interpreter encoded as layers in deep learning; and the second is a mature physics engine. Compared to recent stud- ies in vision and robotics on predicting physical interactions for 3D reasoning [10, 231 and tracking [16], our goal is to infer physical object properties directly, and we in- corporate a generative physics simulator with a powerful discriminative recognition model, which distinguishes our framework from previous methods introduced in the and robotics community for predicting physical interactions or prop- erties of objects for various purposes [14, 20, 10, 23, 19, 3, 4, 8, 24].

We also construct a video dataset for evaluating machine and human performance on real-world data. We collected a dataset of 101 objects made of different materials and with a variety of masses and volumes. We started by collecting videos of these

14 objects from multiple viewpoints in four various scenarios: objects slide down an

inclined surface and possibly collide with another object; objects fall onto surfaces

made of different materials; objects splash in water; and objects hang on a spring.

These seemingly straightforward setups require understanding multiple physical prop- erties, e.g., material, mass, volume, density, coefficient of friction, and coefficient of

restitution, as discussed later. We called this dataset Physics101, highlighting that we are learning elementary physics, while also indicating the current object count. Our dataset contains not only over 12, 000 RGB videos, but also more than 4, 000 depth videos and audios, which could benefit our future study on learning from multi- modality data.

Based on the estimates we derived from visual input with a physics simulator, a natural extension is to generate or synthesize training data for any automatic learning systems by bootstrapping from the videos already collected, and labeling them with estimates of models. This is a self-supervised learning algorithm for inferring generic physical properties, and relates to the wake/sleep phases in Helmholtz machines [9], and to the cognitive development of infants. Extensive studies suggest that infants either are born with or can learn quickly physical knowledge about objects when they are very young, even before they acquire more advanced high-level knowledge like semantic categories of objects [5, 1]. Young babies are sensitive to physics of objects mainly from the motion of foreground objects from background [1]; in other words, they learn by watching videos of moving objects. But later in life, and clearly in adulthood, we can perceive physical attributes in just static scenes without any motion.

Here, building upon the idea of Helmholtz machiness [9], our approach suggests one potential computational path to the development of the ability to perceive physical content in static scenes. Following the recent work [22], we train a recognition model

(i.e., sleep cycle) that is in the form of a deep convolutional network, where the training data is generated in a self-supervised manner by the generative model itself

(i.e., wake cycle: real-world videos observed by our model and the resulting physical inferences). Interestingly, this computational solution asserts that the infant starts

15 with a relatively reliable mental physics engine, or acquires it soon after birth.

Our research has various generalizations and extensive applications. With physical object properties, we may build intelligent systems for high-level scene understanding, including the study of physics-related concepts like object stability in the scene, and we may incorporate agents interacting with the physical world for particular goals.

Our study is inspired by findings in developmental psychology, but can also lead to interesting and fundamental research questions there, for instance, whether there exist connections between the learning processes of infants and machines on physical concepts.

16 Chapter 2

Modeling the Physical World

There exist highly involved physical processes in daily events in our physical world, even simple scenarios like objects sliding down an inclined surface. As shown in

Figure 2-la, we can divide all involved physical properties into two groups: the first is the intrinsic physical properties of objects like volume, material, and mass, many of which we cannot directly measure from the visual input; the second is the descriptive physical properties which characterize the scenario in the video, including but not limited to velocity of objects, distances that objects traveled, or whether objects float if they are thrown into water. The second group of parameters are observable, and are determined by the first group, while both of them determine the content in videos.

Our goal is to build an architecture that can automatically discover those observ- able descriptive physical properties from unlabeled videos, and use them as supervi- sion to further learn and infer unobservable latent physical properties. Our generative model can then apply learned knowledge of physical object properties for other tasks like predicting outcomes in the future.

The computer vision community has made much progress through its datasets, and there are datasets of objects, attributes, materials, and scene categories. Here, we introduce a new type of dataset, Physics 101, capturing physical interactions of objects. The dataset consists of four different scenarios, for each of which plenty of intriguing questions may be asked. For example, in the ramp scenario, will the object on the ramp move, and if so and two objects collide, which of them will move next

17 Descriptive Physical Properties Acceleration Velocity Bounce Extended Height Distance

Intrinsic Physical Object Properties Coeff Coeff Restitution Friction Mass Material Volume

Videos

(b) Our scenario and a snapshot of our dataset, (a) Abstraction of physical prop- Physics 101, of various objects at different stages. erties and how they determine the Our data are taken by four sensors (3 RGB and 1 content of video. depth).

Figure 2-1: Abstraction of the physical world, and a snapshot of our dataset. and how far?

2.1 Scenarios

We seek to learn physical properties of objects by observing videos. To this end, we build a dataset by recording videos of moving objects. We pick an introductory setup with four different scenarios, which are illustrated in Figures 2-1b and 2-2. We then introduce each scenario in detail.

Ramp We put an object on an inclined surface, and the object may either slide down or keep static, due to gravity and friction. This seemingly straightforward scenario already involves understanding many physical object properties including material, coefficient of friction, mass, and velocity. Figure 2-2a analyzes the physics behind our setup.

At first, there are three external forces on the object: a gravitational force G, a normal force N from the surface, and a friction force R. When the friction force R is strong, then the object would not move. Otherwise, the object will start to slide. After it reaches the ground, these forces would still exist, but now the object will slow

18 R., N,

NH 1N N, NB N N

I. Initial setup II. Before collision III. At collision IV. After collision V. Final result

(a) The ramp scenario. Several physical properties will determine if object A will move, if it will reach to object B, and how far each object will move. Here, N, R, and G indicate a normal force, a friction force, and a gravity force, respectively.

I. Initial setup II. After extension I. A floating object II. A sunk object

(b) The spring scenario. (c) The liquid scenario.

I. Initial setup II. At collision III. Bounce (d) The fall scenario.

Figure 2-2: Illustrations of the scenarios in our Physics 101 dataset. down due to the friction force R. If the object A slides all the way to B, then A will hit B and both of them will move. How far A and B move depends on their friction coefficients, masses, and the velocity of A at the moment of collision.

In this scenario, the observable descriptive physical properties are the velocities of the objects, and the distances both objects traveled. The latent properties directly involved are coefficient of friction and mass.

Spring We hang objects on a spring, and gravity on the object will stretch the spring, as shown in Figure 2-2b. Here the observable descriptive physical property is length that the spring gets stretched, and the latent properties are the mass of the object and the elasticity of the spring.

Fall We drop objects in the air, and they freely fall onto various surfaces. Figure 2-

2d illustrates this scenario. Here the observable descriptive physical properties are

19 Plastic Block *

Metal Foam Coin

Hallow 4 Dough Wood

Metal Plastic Pole -10000 O Toy

Wooden eE - Porcelain #4 Pole 4l Plastic VOOPlastic Doll 40

Wooden Bleck 1111146

Hollow% Rubbe, IWO.1 ,10 10 Cardlbomii 0 V

Rubber Target I

Figure 2-3: Physics 101: this is the set of objects we used in our experiments. We vary object material, color, shape, and size, together with external conditions such as the slope of a surface or the stiffness of a string. Videos recording the motions of these objects interacting with target objects will be used to train our algorithm. the the bounce heights of the object, and the latent properties are the coefficient of restitution of the object and the surface.

Liquid As shown in Figure 2-2c, we drop objects into some liquid, and they may float or sink at various speeds. In this scenario, the observable descriptive physical property is the velocity of the sinking object (0 if it floats), and the latent properties are the densities of the object and the liquid.

2.2 The Physics 101 Dataset

The outcomes of various physical events depend on multiple factors of objects, such as materials (density and friction coefficient), sizes and shapes (volume), and slopes of ramps (gravity), elasticities of springs, etc. We collect our dataset while varying all these conditions. Figure 2-3 shows the entire collection of our 101 objects, and the following are more details about our variations:

20 Material Our 101 objects are made of 15 different materials - cardboard, dough, foam, hollow rubber, hollow wood, metal coin, metal pole, plastic block, plastic doll, plastic ring, plastic toy, porcelain, rubber, wooden block, and wooden pole.

Appearance For each material, we have 4 ~ 12 objects of different sizes, shapes, and colors.

Slope (ramp) We also vary the angle a between the inclined surface and the ground (to vary the gravity force). We set a = 100 and 200 for each object.

Target (ramp) We have two different target objects - a cardboard and a foam box. They are made of different materials, thus having different friction coefficients and densities.

Spring We use two springs with different stiffness.

Surface (fall) We drop objects onto five different surfaces: foam, glass, metal, wooden table, and woolen rug. These materials have different coefficients of restitu- tion.

We also measure the physical properties of these objects. We record the mass and volume of each object, which also determine density. Please refer to the supplementary material for the statistics of all these measured properties.

For each setup, we record their actions for 3 ~ 10 trials. We measure multiple times because some external factors, e.g., orientations of objects and rough planes, may lead to different outcomes. Having more than one trial per condition increases the diversity of our dataset by making it cover more possible outcomes.

Finally, we record each trial from three different viewpoints: one sideview, one top-down view, and one upper-top view. For the first two views, we take data with DSLR cameras, and for the upper-top view, we use a Kinect V2 to record both RGB and depth maps. After removing trials with significant noise, we have 4,352 trials in total. Given we captured videos in three RGB maps and one depth map, there are

17,408 video clips altogether. These video clips constitute the Physics 101 dataset.

21 22 Chapter 3

Physical Object Model: Learning with a Symbolic Interpreter

We aim to discover physical object properties under a unified system with minimal supervision, rather than training each classifier/regressor for labels (such as material and volume) in a fully supervised manner. With this philosophy, we develop two physical object models; one uses deep learning and a symbolic physics interpreter for recognizing physical properties, and the other incorporates a mature physics engine and predicts physical properties via an analysis-by-synthesis approach. Both meth- ods have built-in knowledge of physics, and work in an unsupervised setting. With these generative models, we are able to not only discover all physical properties (e.g. material, volume) simply by observing motions of objects in unlabeled videos, but also predict different physical interactions (e.g. how far will the object move, if it moves at all) based on inferred physical properties.

In this chapter we describe our first model, shown in Figure 3-1. Our method is based on a convolutional neural network (CNN) [11], which consists of three compo- nents. The bottom component is a visual property discoverer, which aims to discover physical properties like material or volume which could at least partially be observed from visual input; the middle component is a physics interpreter, which explicitly encodes physical laws into the network structure and models latent physical proper- ties like density and mass; the top component is a physical world simulator, which

23 Height Acceleraton Velocity Acceleraton Bounce on ampn in Sinken Extended Distance Descriptive Physical Properties

Physical Laws Physical World Simulator

Mass Latent Intrinsic Physical Properties Coeff Restitution Coeff Friction Density Physics Interpreter

Material Volume Visual Intrinsic Physical Properties

ConvNet ConvNet Visual Property Discoverer

Videos

With Learning Without Learning

Data

Figure 3-1: Our first model exploits the advancement of machine learning algorithm (convolutional neural network) - we supervise all levels by a physics interpreter. This interpreter provides the physical constraints on what each layer can take values. Dur- ing the training and testing, our model has no label of physical properties, in contrast to the standard approaches. characterizes descriptive physical properties like distances that objects traveled, all of which we may directly observe from videos.

Our network corresponds to our physical world model introduced in Chapter 2. We would like to emphasize here that our model learns object properties from completely unlabeled data. We do not provide any labels for physical properties like material, velocity, or volume; instead, our model automatically discovers observations from videos, uses them as supervision to the top physical world simulator, which in turn advises what the physics interpreter should discover.

3.1 Visual Property Discoverer

The bottom meta-layer of our architecture in Figure 4-1 is designed to discover and predict low-level properties of objects including material and volume, which can also be at least partially perceived from the visual input. These properties are the basic parts of predicting any derived physical properties at upper layers, e.g. density and mass.

In order to interpret any physical interaction of objects, we need to be able to

24 first locate objects inside videos. We use a KLT point tracker [17] to track moving

objects . We also compute a general background model for each scenario to locate

foreground objects. Image patches of objects are then supplied to our visual property discoverer.

Material and volume Material and volume are properties that can be estimated

directly from image patches. Hence, we have LeNet [12] on top of image patches extracted by the tracker. Once again, rather than directly supervising each LeNet with their labels, we supervise them by automatically discovered observations which

are provided to our physical world simulator. To be precise, we do not have any individual loss layer for LeNet components. Note that inferring volumes of objects

from static images is an ambiguous problem. However, this problem is alleviated by our data from different viewpoints and both RGB and depth maps.

3.2 Physics Interpreter

The second meta-layer of our model is designed to encode the physical laws. For instance, if we assume an object is homogeneous, then its density is determined by its material; the mass of an object should be the multiplication of its density and volume.

Based on material and volume, we expand a number of physical properties in this physics interpreter, which will later be used to connect to real world observations.

The following shows how we represent each physical property as a layer depicted in Figure 4-1:

Material A Nm dimensional vector, where Nm is the number of different materials.

The value of each dimension represents the confidence that the object belongs to that dimension. This is an output of our visual property discoverer.

Volume A scalar representing the predicted volume of the object. This is an output of our visual property discoverer.

Coefficient of friction and density Each is a scalar representing the predicted physical property based on the output of the material layer. Each output is the inner

25 product of Nm learned parameters and responses from the material layer.

Coefficient of restitution A Nm dimensional vector representing how much of the kinetic energy remains after a collision between the input object with other objects of various materials. The representation is a vector, not a scalar, as the coefficient of restitution is determined by the materials of both objects involved in the collision.

Mass A scalar representing the predicted mass based on the outputs of the density layer and the volume layer. This layer is the product of the density and volume layers.

3.3 Physical World Simulator

Our physical world simulator connects the inferred physical properties to real- world observations. We have different observations for different scenarios, and use velocities of objects and distances objects traveled as observations of the ramp sce- nario, the length that the string is stretched as an observation of the spring scenario, the bounce distance as an observation of the fall scenario, and the velocity that object sinks as an observation of the liquid scenario. All observations can be derived from the output of our tracker.

To connect those observations to physical properties our model inferred, we employ physical laws. The physical laws we used in our model include

Newton's law F = mg sin 0 - pmg cos 0 = ma, or (sin 0 - p cos O)g = a, where 0 is angle between the inclined surface and the ground, p is the coefficient of friction, and a is the acceleration of an object (observation). This is used for the ramp scenario.

Conservation of momentum and energy CR = (Vb - Va)/(ta - Ub), where vi is the velocity of object i after collision, and ui is its velocity before collision. All ui and vi are observations, and this is also used for the ramp scenario.

Hooke's law F = kX, where X is the distance that the string is extended (our observation), k is the stiffness of the string, and F = G = mg is the gravity on the object. This is used for the spring scenario. 26 Bounce CR= h/H, where CR is the coefficient of restitution, h is the bounce height (observation), and H is the drop height. This can be viewed as another representation of conservation of energy and momentum, and is used for the fall scenario.

Buoyancy dVg - dmVg = ma = dVa, or (d - d.)g = da, where d is density of the object, d, is the density of water (constant), and a is the acceleration of the object in water (observation). Note that for d < de, a = 0. This is used for the liquid scenario. We use MSE between our model's estimate and the target value supplied by the physical world simulator as our loss during training.

3.4 Experiments

In this section, we present experiments with our models in various settings. We start with extensive verifications of our models on learning physical properties. Later, we investigate the generalization ability of our model on other tasks like detecting objects with unusual properties, predicting outcomes given partial information, and transferring knowledge across different scenarios. We use Torch7 [6] for all experi- ments. For learning physical properties from Physics 101, we study our algorithm in the following settings:

" Split by frame: for each trial of each object, we use 95% of the patches we get from tracking as training data, while the other 5% of the patches as test data.

" Split by trial: for each trial of each object, we use all patches in 95% of the trials we have as training data, while patches in the other 5% of the trials as test data.

" Split by object: we randomly choose 95% of the objects, and use their patches as training data and the others as test data.

Among these three settings, split by frame is the easiest as for each patch in test data, the algorithm may find some very similar patch in the training data. Split by

27 object is the most difficult setting as it requires the model to generalize to objects that it has never seen before.

We consider training our model in different ways:

" Oracle training: we train our model with images of objects and their associated ground truth labels. We apply oracle training on those properties we have

ground truths labels of (material, mass, density, and volume).

" Standalone training: we train our model on data from one scenario. Automat-

ically extracted observations serve as supervision.

* Joint training: we jointly train the entire network on all training data without

any labels of physical properties. Our only supervision is the physical laws encoded in the top physical world simulator. Data from different scenarios supervise different layers in the network.

Oracle training is designed to test the ability of each component and can be viewed as an upper bound of the performance the model may achieve. Our focus is on standalone and joint training, where our model learns from unlabeled videos directly.

We are also interested in understanding how our model can perform at inferring some physical properties purely from depth maps. Therefore, besides using RGB data, we conduct some experiments where training and test data are depth maps only.

3.4.1 Learning Physical Properties

Material perception: We start with the task of material classification. Table 3.1 shows the accuracy of the oracle models on material classification. We observe that they can achieve nearly perfect results in the easiest case, and is still significantly better than chance on the most difficult split-by-object setting. Both depth maps

and RGB maps give good performance on this task with oracle training.

28 Methods Frame Trial Object Depth (Oracle) 92.6 62.5 35.7 RGB (Oracle) 99.9 77.4 52.2 RGB (ramp) 26.9 24.7 19.7 RGB (spring) 29.9 22.4 14.3 RGB (fall) 29.4 25.0 17.0 RGB (liquid) 22.2 15.4 12.6 RGB (joint) 35.5 28.7 25.7 Depth (joint) 38.3 26.9 22.4 Uniform 6.67 6.67 6.67

Table 3.1: Accuracies (%, for oracle) or clustering purities (%, for joint training) on material estimation. In the joint training case, as there is no supervision on the material layer, it is not necessary for the network to specifically map the responses in that layer to material labels, and we do not expect the numbers to be comparable with the oracle case. Our analysis is just to show even in this case the network implicitly grasps some knowledge of object materials.

In the standalone and joint training case, given we have no labels on materials, it is not possible for the model to classify materials; instead, we expect it to cluster objects by their materials. To measure this, we perform K-means on the responses of the material layer of test data, and use purity, a common measure for clustering, to

measure if our model indeed discovers clusters of materials automatically. As shown in Table 3.1, the clustering results indicate that the system learns the material of objects to a certain extent.

Physical parameter estimation: We then test our systems, trained with or with- out oracles, on the task of physical property estimation. We use Pearson product- moment correlation coefficient as measures. Table 4-3 shows the results on estimating

mass, density, and volume. Notice that here we evaluate the outputs on a log scale to avoid unbalanced emphases on objects with large volumes or masses.

We observe that with oracle our model can learn all physical parameters well. For standalone and joint learning, our model is also consistently better than a nontrivial baseline, which selects the optimum uniform estimate which minimizes the mean squared error.

29 Mass Density Volume Methods Frame Trial Object Frame Trial Object Frame Trial Object RGB (Oracle) 0.79 0.72 0.67 0.83 0.74 0.65 0.77 0.67 0.61 Depth (Oracle) 0.79 0.72 0.67 0.83 0.74 0.65 0.77 0.67 0.61

RGB (spring) 0.40 0.35 0.20 N/A N/A N A N A N/A N1 A RGB (liquid) N A N/A N'A 0.33 0.27 0.30 N A N A N 'A RGB (joint) 0.58 0.42 0.38 0.38 0.39 0.39 0.40 0.37 0.30 Depth (joint) 0.43 0.32 0.25 0.49 0.37 0.17 0.30 0.20 0.22 Uniform 0 0 0 0 0 0 0 0 0

Table 3.2: Correlation coefficients of our estimations and ground truth for mass, density, and volume

Mass Density Volume

Est Est Est a CT e a b c d a a b C d e

Figure 3-2: Charts for the estimations of rings. The physical properties, especially density, of the first ring is different from those of the other rings. The difference is hard to perceive by merely visual appearances; however, by observing videos with object interactions, our algorithm is able to learn the properties and find the outlier. All figures are on a log-normalized scale.

3.4.2 Detecting Objects with Unusual Properties

Sonetimes objects with similar appearances may have distinct physical properties.

In this section, we test whether our system is able to find these expectation-violation cases.

In Physics 101, among all five plastic rings, the bottom part of smallest ring is

made of a different material with a larger density, which makes its mass greater than those of the other four, but its volume smaller. The material of the smallest ring also has a lower friction coefficient, indicating that the velocity of the smallest ring at collision would be higher than those of the others.

In Figure 3-2, we show the estimations of our RGB joint model on the properties of all five rings, as well as their appearances. As shown, it is hard to perceive the difference between the physical properties of the first ring and those of the others

30 Foam Material Cardboard H M U H M U cardboard 28.8 40.7 97.0 15.0 77.2 84.0 dough 27.4 25.2 84.4 150.9 105.1 113.4 hollow wood 35.7 19.4 108.9 81.0 35.0 21.4 metal coin 13.4 32.2 149.8 31.9 33.3 75.8 metal pole 272.9 257.6 280.0 91.4 188.7 184.0 plastic block 29.8 82.1 97.6 46.9 57.2 35.0 plastic doll 49.4 23.6 44.0 128.8 41.8 93.9 plastic toy 30.1 41.9 121.2 33.3 9.5 70.6 porcelain 138.5 127.0 110.9 196.0 216.6 314.8 wooden block 45.9 32.8 36.2 47.3 37.5 14.2 wooden pole 78.9 88.0 138.9 58.7 89.8 74.3 Mean 68.2 70.1 115.4 80.1 81.1 98.3

Table 3.3: M\ean squared errors in pixels of human predictions (H), model outputs (M), or uniform estimate minimizing the mean squared error (U)

Figure 3-3: Heat maps of user predictions, model outputs (in orange), and ground truths (in white). Objects from top to bottom, left to right: dough, metal coin, metal pole, plastic block, plastic doll, and porcelain. purely from visual appearances. By observing videos where they slide down and hit other objects, our system can learn physical parameters, and model the outliers.

3.4.3 Predicting Outcomes

We may apply our model to a variety of outcome prediction tasks for different scenarios. We consider three of them: how far would an object move after being hit by another object; how high an object will bounce after being dropped at a certain height; and whether an object will float in the water. With estimated physical object properties, our model can answer these questions using physical laws.

Transferring Knowledge Across Multiple Scenarios As some physical knowl- edge is shared across multiple scenarios, it is natural to evaluate how learned knowl-

31 Tasks Methods Frame Trial Object Collision Dist RGB (joint) 0.65 0.42 0.33 Collision Dist Uniform 0 0 0 Bounce Height RGB (joint) 0.35 0.31 0.23 Bounce Height RGB (transfer) 0.22 0.21 0.11 Spring Ext Uniform 0 0 0 Float RGB (joint) 0.94 0.87 0.84 Float Uniform 0.70 0.70 0.70

Table 3.4: Correlation coefficients on the tasks of predicting the moving distance and the bounce height, and accuracies on predicting whether an object floats edge from one scenario may be applied to a novel one. Here we consider the case where the model is trained on all but the fall scenarios. We then apply the model to the fall scenario for predicting how high an object bounces. Our intuition is the learned coefficients of restitution from the ramp scenario can help to predict to some extent.

Results Table 3.4 shows outcome prediction results. We can see that our method works well, and can also transfer learned knowledge across multiple scenarios.

Behavior Experiments We would like to see how well our model does compared to a human. To do this, we conducted experiments on predicting the moving distance of an object after collision on Amazon Mechanical Turk. Specifically, among all objects that slide down, we select one object of each material, show AMT workers the videos of the object, but only to the moment of collision. We then ask workers to label where they believe the target object (either cardboard or foam) will be after the collision, i.e., how far the target will move. Before testing, each users are provided four full videos of other objects made of the same material, which contain complete collisions, so that users can simply infer the physical properties associated with the material and the target object in their mind. We tested 30 users per case.

Table 3.3 shows the mean squared errors in pixels of human predictions (H), model predictions (M), or uniform estimate minimizing the mean squared error (U). We can

32 see that the performance of our model is close to that of human on this task. Figure 3- 3 shows the heat maps of user predictions, model outputs (orange), and ground truths (white).

33 34 Chapter 4

Physical Object Model: Incorporating a Physics Engine

4.1 The Galileo Model

Here we describe our second model. Compared to the first one, our second model

(shown in Figure 4-1) incorporates a physics engine in its core, and the gist of our second model can be summarized as probabilistically inverting the physics engine to recover unobserved physical properties of objects. For this model, we focus on the ramp scenario, and in honor of the famous physicist, we name our model Galileo.

The first component of Galileo is the physical object representations, where each object is a rigid body and represented not only by its 3D geometric shape (or volume) and its position in space, but also by its mass and its friction. All of these object attributes are treated as latent variables in the model, and are approximated or estimated on the basis of the visual input.

Specifically, we collectively refer to the unobserved latent variables of an object as its physical representationT. For each object i, T consists of its mass mi, friction coefficient ki, 3D shape Vi, and position offset pi w.r.t. an origin in 3D space. We place uniform priors over the mass and the friction coefficient for each object: mi ~ Uniform(O.001, 1) and ki ~ Uniform(0, 1), respectively. For 3D shape Vi, we have four variables: a shape type ti, and the scaling factors for three dimensions

35 Physical object i - Mass (m) - Friction coefficient (k) - 3D shape (S) - Position offset (x)

Draw two 7 physical objects

2 3D Physics engine

Simulated velocities(t,1 v 2 ) Likelihood function Observed velocities (!r v.2)

_,,_- Tra~cking algorithm

Figure 4-1: Our second model formalizes a hypothesis space of physical object rep- resentations, where each object is defined by its mass, friction coefficient, 3D shape, and a positional offset w.r.t. an origin. To model videos, we draw objects from that hypothesis space into the physics engine. The simulations from the physics engine are compared to observations in the velocity space.

xi, yi, zi. We simplify the possible shape space in our model by constraining each shape type ti to be one of the three with equal probability: a box, a cylinder, and a torus. Note that applying scaling differently on each dimension to these three basic shapes results in a large space of shapes.' The scaling factors are chosen to be uniform over the range of values to capture the extent of different shapes in the dataset.

Remember that our scenario consists of an object on the ramp and another on the ground. The position offset, pi, for each object is uniform over the set

{0, 1, 2, . 5, 5}. This indicates that for the object on the ramp, its position can be perturbed along the ramp (i.e., in 2D) at most 5 units upwards or downwards from its starting position, which is 30 units upwards on the ramp from the ground.

The next component of our generative model is a fully-fledged realistic physics

'For shape type box, xi, y, and zi could all be different values; for shape type torus, we con- strained the scaling factors such that xi = zi; and for shape type cylinder, we constrained the scaling factors such that yi = zi.

36 engine that we denote as p. Specifically we use the Bullet physics engine [7] following

the earlier related work. The physics engine takes a specification of each of the physical objects in the scene within the basic ramp setting as input, and simulates

it forward in time, generating simulated velocity vectors for each object in the scene,

v., and v, 2 respectively - among other physical properties such as position, rendered image of each simulation step, etc.

In light of initial qualitative analysis, we use velocity vectors as our feature rep- resentation in evaluating the hypothesis generated by the model against data. We employ a standard tracking algorithm (KLT point tracker [17]) to "lift" the visual observations to the velocity space. That is, for each video, we first run the tracking

algorithm, and we obtain velocities by simply using the center locations of each of

the tracked moving objects between frames. This gives us the velocity vectors for the

object on the ramp and the object on the ground, v,1 and v 0 2 , respectively. Note that we could replace the KLT tracker with state-of-the-art tracking for more complicated scenarios.

The third part of Galileo is the likelihood function. We evaluate the observed real-world videos with respect to the model's hypotheses using the velocity vectors

of objects in the scene. Given a pair of observed velocity vectors, v0, and v02, the

recovery of the physical object representations T and T2 for the two objects via physics-based simulation can be formalized as

P(T, T 2 vI0 1 , v 0 2 , p()) oc P(vO, v0 21v5I, 8 2) ' P(v8 1 , V 8 2 1T1, T2 , p(.)) .P(T1 , T 2 ), (4.1)

where we define the likelihood V function as P(vO, 0 2 1v81I, V 8 2 ) = N(voIv, E), where vo is the concatenated vector of vol, v 0 2 , and v, is the concatenated vector of v, 1 , V8 2 . The dimensionality of vo and v, are kept the same for a video by adjusting the number of simulation steps we use to obtain v, according to the length of the video. But from video to video, the length of these vectors may vary. In all of our simulations, we fix E to 0.05, which is the only free parameter in our model. Experiments show that the value of E does not change our results significantly.

37 4.1.1 Tracking as Recognition

The posterior distribution in Equation 4.1 is intractable. In order to alleviate the burden of posterior inference, we use the output of our recognition model to predict and fix some of the latent variables in the model.

Specifically, we determine the Vi, or {ti, xi, yi, zi }, using the output of the tracking algorithm, and fix these variables without further sampling them. Furthermore, we fix values of pis also on the basis of the output of the tracking algorithm.

4.1.2 Inference

Once we initialize and fix the latent variables using the tracking algorithm as our recognition model, we then perform single-site Metropolis Hasting updates on the remaining four latent variables, in, M 2 , k, and k 2. At each MCMC sweep, we propose a new value for one of these random variables, where the proposal distribution is Uniform(-0.05, 0.05). In order to help with mixing, we also use a broader proposal distribution, Uniform(-0.5, 0.5) at every 20 MCMC sweeps.

4.2 Simulations

For each video, as mentioned earlier, we use the tracking algorithm to initialize and fix the shapes of the objects, S1 and S2 , and the position offsets, pi and P2- We also obtain the velocity vector for each object using the tracking algorithm. We determine the length of the physics engine simulation by the length of the observed video - that is, the simulation runs until it outputs a velocity vector for each object that is as long as the input velocity vector from the tracking algorithm.

We use 150 videos from our Physics 101 dataset, uniformly distributed across different object categories. We perform 16 MCMC simulations for a single video, each of which was 75 MCMC sweeps long. We report the results with the highest log-likelihood score across the 16 chains (i.e., the MAP estimate). In Figure 4-2, we illustrate the results for three individual videos. Every two frame

38 ra

On

0

(a) (b) (c) (d) (e) (f)

Figure 4-2: Simulation results. Each row represents one video in the data: (a) the first frame of the video, (b) the last frame of the video, (c) the first frame of the simulated scene generated by Bullet, (d) th e last frame of the simulated scene, (e) the estimated object with larger mass, (f) the estimated object with larger friction coefficient. of the top row shows the first and the last frame of a video, and the bottom row images show the corresponding frames from our model's simulations with the MAP estimate.

We quantify different aspects of our model in the following behavioral experiments, where we compare our model against human subjects' judgments. Furthermore, we use the inferences made by our model here on the 150 videos to train a recognition model to arrive at physical object perception in static scenes with the model.

Importantly, note that our model can generalize across a broad range of tasks beyond the ramp scenario. For example, once we infer the coefficient friction of an object, we can make a prediction on whether it will slide down a ramp with a different slope by doing simulation. We test some of the generalizations in Chapter 4.4.

4.3 Bootstrapping as Efficient Perception in Static

Scenes

Based on the estimates we derived from the visual input with a physics engine, we bootstrap from the videos already collected, by labeling them with estimates of Galileo. This is a self-supervised learning algorithm for inferring generic phys- ical properties. As discussed in Chapter 1, this formulation is also related to the

39 wake/sleep phases in Helmholtz machines, and to the cognitive development of in- fants.

Here we focus on two physical properties: mass and friction coefficient. To do this, we first estimate these physical properties using the method described in earlier sections. Then, we train LeNet [13], a widely used deep neural network for small-scale datasets, using image patches cropped from videos based on the output of the tracker as data, and estimated physical properties as labels. The trained model can then be used to predict these physical properties of objects based on purely visual cues, even though they might have never appeared in the training set.

We also measure masses of all objects in the dataset, which makes it possible for us to quantitatively evaluate the predictions of the deep network. We choose one object per material as our test cases, use all data of those objects as test data, and the others as training data. We compare our model with a baseline, which always outputs a uniform estimate calculated by averaging the masses of all objects in the test data, and with an oracle algorithm, which is a LeNet trained using the same training data, but has access to the ground truth masses of training objects as labels. Apparently, the performance of the oracle model can be viewed as an upper bound of our Galileo system.

Table 4-3 compares the performance of Galileo, the oracle algorithm, and the baseline. We can observe that Galileo is much better than baseline, although there is still some space for improvement.

Because we trained LeNet using static images to predict physical object properties such as friction and mass ratios, we can use it to recognize those attributes in a quick bottom-up pass at the very first frame of the video. To the extent that the trained

LeNet is accurate, if we initialize the MCMC chains with these bottom-up predictions, we expect to see an overall boost in our log-likelihood traces. We test by running several chains with and without LeNet-based initializations. Results can be seen in

Figure 4-4. Despite the fact that LeNet is not achieving perfect performance by itself, we indeed get a boost in speed and quality in the inference.

40 - initialization with recognition model - random initialization Oe+00 - Mass Methods MSE Corr 0 o- le+05- Oracle 0.042 0.71 Galileo 0.052 0.44 o -2e+05 Uniform 0.081 0

0 20 40 60 Number of MCMC sweeps Figure 4-3: Mean squared errors of or- acle estimation, our estimation, and Figure 4-4: The log-likelihood traces uniform estiniations of mass on a of several chains with and without log-normalized scale, and the correla- recognition-model (LeNet) based initial- tions between estimations and ground izations. truths

4.4 Experiments

In this section, we conduct experiments from multiple perspectives to evaluate our model. Specifically, we use the model to predict how far objects will move after the collision; whether the object will remain stable in a different scene; and which of the two objects is heavier based on observations of collisions. For every experiment, we also conduct behavioral experiments on Amazon Mechanical Turk so that we may compare the performance of human and machine on these tasks.

4.4.1 Outcome Prediction

In the outcome prediction experiment, our goal is to measure and compare how well human and machines can predict the moving distance of an object if only part of the video can be observed. Specifically, for behavioral experiments on Amazon

Mechanical Turk, we first provide users four full videos of objects made of a certain material, which contain complete collisions. In this way, users may infer the physical properties associated with that material in their mind. We select a different object, but made of the same material, show users a video of the object, but only to the

41 moment of collision. We finally ask users to label where they believe the target object (either cardboard or foam) will be after the collision, i.e., how far the target will move. We tested 30 users per case.

Given a partial video, for Galileo to generate predicted destinations, we first run it to fit the part of the video to derive our estimate of its friction coefficient. We then estimate its density by averaging the density values we derived from other objects with that material by observing collisions that they are involved. We further estimate the density (mass) and friction coefficient of the target object by averaging our estimates from other collisions. We now have all required information for the model to predict the ending point of the target after the collision. Note that the information available to Galilpo is fxactlythe s a that qvqih1p to hImans

We compare three kinds of predictions: human feedback, Galileo output, and, as a baseline, a uniform estimate calculated by averaging ground truth ending points over all test cases. Figure 4-5 shows the Euclidean distance in pixels between each of them and the ground truth. We can see that human predictions are much better than the uniform estimate, but still far from perfect. Galileo performs similar to human in the average on this task. Figure 4-6 shows, for some test cases, heat maps of user predictions, Galileo outputs (orange crosses), and ground truths (white crosses). The error correlation between human and POM is 0.70. The correlation analysis for the uniform model is not useful because the correlation is a constant independent of the uniform prediction value.

4.4.2 Mass Prediction

The second experiment is to predict which of two objects is heavier, after observing a video of a collision of them. For this task, we also randomly choose 50 objects, we test each of them on 50 users. For Galileo, we can directly obtain its guess based on the estimates of the masses of the objects.

Figure 4-7 demonstrates that human and our model achieve about the same ac- curacy on this task. We also calculate correlations between different outputs. Here for correlation analysis, we use the ratio of the masses of the two objects estimated

42 JEHuman 250 77@Model Uniform 200

150

0 100 LLJ 50 IL-I.IipK~ ILI

Figure 4-5: Mean errors in numbers of pixels of human predictions, Galileo outputs, and a uniform estimate calculated by averaging ground truth ending points over all test cases. As the error patterns are similar for both target objects (foam and cardboard), the errors here are averaged across target objects for each material.

Figure 4-6: Heat maps of user predictions, Galileo outputs (orange crosses), and ground truths (white crosses). by Galileo as its predictor. Human responses are aggregated for each trial to get the proportion of people making each decision. As the relation is highly nonlinear, we calculate Spearman's coefficients. From Table 4.1, we notice that human responses, machine outputs, and ground truths are all positively correlated.

4.4.3 "Will it move" Prediction

Our third experiment is to predict whether a certain object will move in a different scene, after observing one of its collisions. On Amazon Mechanical Turk, we show users a video containing a collision of two objects. In this video, the angle between the inclined surface and the ground is 20 degrees. We then show users the first frame of a 10-degree video of the same object, and ask them to predict whether the object will slide down the surface in this case. We randomly choose 50 objects for the

43 |Human Mass Spearman's Coeff [-~Model 0.8 Human vs Galileo 0.51 Human vs Truth 0.68 06 Galileo vs Truth 0.52

0.4 "Will it move" Pearson's Coeff 02 Human vs Galileo 0.56 0 Will it move" Human vs Truth 0.42 Galileo vs Truth 0.20

Figure 4-7: Average accuracy of human Table 4.1: Correlations between pairs of predictions and Galileo outputs on the outputs in the mass prediction experiment tasks of mass prediction and "will it (in Spearman's coefficient) and in the "will move" prediction. Error bars indicate it move" prediction experiment (in Pear- standard deviations of human accuracies. son s coefficient). experiment, and divide them into lists of 10 objects per user, and get each of the item tested on 50 users overall.

For Galileo, it is straightforward to predict the stability of an object in the 10- degree case using estimates from the 20-degree video. Interestingly, both humans and the model are at chance on this task (Figure 4-7), and their responses are reasonably correlated (Table 4.1). Again, here we aggregate human responses for each trial to get the proportion of people making each decision. Moreover, both subjects and the model show a bias towards saying "it will move." Future controlled experimentation and simulations will investigate what underlies this correspondence.

44 Chapter 5

Beyond Understanding Physics

The perception of intrinsic object properties like physical properties, appearances, and affordances serves as a key role in explaining many of our daily observations, including the interactions within objects, between objects and scenes, and between agents and objects.

Scene Understanding Knowledge about physical object properties could be crucial to scene understanding. Is the configuration of the objects in the room stable?

What may happen if someone throws a ball against some particular object? Will people inside the room be safe if there is a minor earthquake? To answer questions like these, a computational system needs to understand basic physical laws, which could be provided by a mature physics engine, as well as some level of physical object properties.

An initial attempt could be to build a system working with synthetic scenes, which we can generate at very little cost and have perfect knowledge of. We are actively designing a new generative model with a physics engine, which follows the architec- ture of our second model, but with a focus on scene understanding. We hope our model could achieve two goals: first, using physics to help generate physically plau- sible scenes; and second, discriminatively predicting the stability and other physical properties of every location in a given scene.

45 Cognitive Agents In the dynamic world we live in, we are not only an observer, but also a participant. Similarly, with a physical object model, it is natural to in- corporate an agent, which actively explores and interacts with the world. An agent could be unintentional. Infants may play with an object not because of any particular purpose; instead, they merely want to discover what they can do to the object and what the responses will be. In perceiving physical object properties, it is reasonable to expect an agent, which actively interacts with objects, to perform better than a computational system, which only learns by watching videos.

Agents may also target at some goals like moving an object to a certain place efficiently or deconstructing an unstable pile of building blocks. Besides combining

Ueep eingtt1LL..L, UU a pIysic n inei1, aiiuiith Uit ioi Itha U Lct-I VsJ ILLI Ul sd sL to integrate reinforcement learning into the loop, which has been proven effective in similar tasks.

Developmental Psychology In our second model, we assume uniform priors on physical properties like mass and coefficient of friction during sampling. This does not seem to align with the intuition that people expect objects with larger volumes to be heavier, or objects with smoother surfaces to have smaller coefficients of friction. To what extent do these priors exist, and if so, how do these priors affect human decisions? These questions could have profound implications when agents try to interact with objects, e.g., a robot should exert a smaller force on a light and fragile object to avoid breaking it. To rigorously answer these questions, however, requires a careful design of behavior experiments.

Further, as discussed in Chapter 1, infants obtain basic concepts of physics at an early age. If we observe these priors on adults, when do young children develop similar concepts, and what kind of priors do they have in their minds? A thorough under- standing of these questions could inspire research in both developmental psychology and artificial intelligence.

46 Chapter 6

Conclusion

In this thesis, we discussed the task of learning physical object properties. We studied several scenarios, with which humans are familiar and can learn to infer the involved physical object properties even when they are young. We proposed a novel dataset, Physics 101, which contains over 17, 000 videos from four viewpoints of 101 objects in four scenarios. We further proposed two novel models for learning physical properties of objects by incorporating physics simulators with deep neural nets, and conducted extensive evaluations.

The main contribution of this thesis is that it shows that a generative vision system with physical object representations and a realistic 3D physics engine or a symbolic physics interpreter at its core can efficiently deal with real-world data when proper recognition models and feature spaces are used. Our behavior study also points towards an possibility of an account of human vision with generative physical knowledge at its core, and various recognition models as helpers to induce efficient inference. We hope our paper could inspire future study on learning physical and other common sense knowledge from visual data.

47 48 Bibliography

[1] Ren6e Baillargeon. Infants' physical world. Current directions in psychological science, 13(3):89-94, 2004.

[2] Peter W Battaglia, Jessica B Hamrick, and Joshua B Tenenbaum. Simulation as an engine of physical scene understanding. PNAS, 110(45):18327-18332, 2013.

[3] Sean Bell, Paul Upchurch, Noah Snavely, and Kavita Bala. Material recognition in the wild with the materials in context database. CVPR, 2015.

[4] Katherine L Bouman, Bei Xiao, Peter Battaglia, and William T Freeman. Esti- mating the material properties of fabric from video. In ICCV, 2013.

[5] Susan Carey. The origin of concepts. Oxford University Press, 2009.

[6] Ronan Collobert, Koray Kavukcuoglu, and C16ment Farabet. Torch7: A matlab- like environment for machine learning. In BigLearn, NIPS Workshop, 2011.

[7] Erwin Coumans. Bullet physics engine. Open Source Software: http://bulletphysics.org, 2010.

[8] Abe Davis, Katherine L Bouman, Justin G Chen, Michael Rubinstein, Fr6do Durand, and William T Freeman. Visual vibrometry: Estimating material prop- erties from small motions in video. In CVPR, 2015.

[9J Peter Dayan, Geoffrey E Hinton, Radford M Neal, and Richard S Zemel. The helmholtz machine. Neural computation, 7(5):889-904, 1995.

[101 Zhaoyin Jia, Andy Gallagher, Ashutosh Saxena, and Tsuhan Chen. 3d reasoning from blocks to stability. IEEE TPA MI, 2014.

[11] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.

[121 Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324, Nov 1998.

[13] Yann LeCun, L6on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278- 2324, 1998.

49 [14] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. arXiv preprint arXiv:1504.00702, 2015.

[151 Adam N Sanborn, Vikash K Mansinghka, and Thomas L Griffiths. Reconciling intuitive physics and newtonian mechanics for colliding objects. Psychological review, 120(2):411, 2013.

[16] John Schulman, Alex Lee, Jonathan Ho, and Pieter Abbeel. Tracking deformable objects with point clouds. In ICRA, 2013.

[171 Carlo Tomasi and Takeo Kanade. Detection and tracking of point features. IJCV, 1991.

[18] Tomer Ullman, Andreas Stuhlmiiller, Noah Goodman, and Josh Tenenbaum. Learning physics from dynamical scenes. In CogSci, 2014.

[191 Manik Varma and Andrew Zisserman. A statistical approach to material classi- fication using image patch exemplars. IEEE TPAMI, 31(11):2032-2047, 2009.

[20] Jacob Walker, Abhinav Gupta, and Martial Hebert. Patch to the future: Unsu- pervised visual prediction. In CVPR, 2014.

[211 Jiajun Wu, Ilker Yildirim, Joseph J. Lim, William T. Freeman, and Joshua B. Tenenbaum. Galileo: Perceiving physical object properties by integrating a physics engine with deep learning. In NIPS, 2015.

[22] Ilker Yildirim, Tejas D Kulkarni, Winrich A Freiwald, and Joshua B Tenenbaum. Efficient analysis-by-synthesis in vision: A computational framework, behavioral tests, and modeling neuronal representations. In CogSci, 2015.

[23] Bo Zheng, Yibiao Zhao, Joey C Yu, Katsushi Ikeuchi, and Song-Chun Zhu. Detecting potential falling objects by inferring human action and natural distur- bance. In ICRA, 2014.

[24] Yixin Zhu, Yibiao Zhao, and Song-Chun Zhu. Understanding tools: Task- oriented object modeling, learning and recognition. In CVPR, 2015.

50