Incorporating a Physics Engine 4.1 the Galileo Model

Computational Perception of Physical Object Properties by Jiajun Wu B.Eng., B.Ec., Tsinghua University (2014) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 2016 @ Massachusetts Institute of Technology 2016. All rights reserved. Signature redacted Author ................. Department of E)rical Engineering and Computer Science January 29, 2016 Certified by ................... Signature redacted......... William T. Freeman Thomas and Gerd Perkins Professor of Electrical Engineering and Computer Science Thesis Supervisor Signature redacted C ertified by ....... .................... Joshua B. Tenenbaum Professor of Computational Cognitive Science Thesis Supervisor Signature redacted A ccepted by ................. ...... U---- - .. .... .. .... .... .. Leslie A. Kolodziejski on Graduate Students MASSACHUSMS INSTTUTE Chair, Depar ent Committee OF TECHNOLOGY APR 15 2016 LIBRARIES ARCHIVES 2 Computational Perception of Physical Object Properties by Jiajun Wu Submitted to the Department of Electrical Engineering and Computer Science on January 29, 2016, in partial fulfillment of the requirements for the degree of Master of Science Abstract We study the problem of learning physical object properties from visual data. In- spired by findings in cognitive science that even infants are able to perceive a physical world full of dynamic content at a early age, we aim to build models to characterize object properties from synthetic and real-world scenes. We build a novel dataset con- taining over 17, 000 videos with 101 objects in a set of visually simple but physically rich scenarios. We further propose two novel models for learning physical object properties by incorporating physics simulators, either a symbolic interpreter or a mature physics engine, with deep neural nets. Our extensive evaluations demonstrate that these models can learn physical object properties well and, with a physic engine, the responses of the model positively correlate with human responses. Future research directions include incorporating the knowledge of physical object properties into the understanding of interactions among objects, scenes, and agents. Thesis Supervisor: William T. Freeman Title: Thomas and Gerd Perkins Professor of Electrical Engineering and Computer Science Thesis Supervisor: Joshua B. Tenenbaum Title: Professor of Computational Cognitive Science 3 4 Acknowledgments I would like to express sincere gratitude to my advisors, Professor William Freeman and Professor Joshua Tenenbaum. Bill and Josh are always inspiring and encouraging, and have led me through my research with profound insights. Not only have they taught me how to aim for top-quality research, but they have been sharing with me invaluable lessons about life. I deeply appreciate the guidance and support from my undergraduate advisor, Professor Zhuowen Tu, who introduced me into the world of AI and vision, and has long been my mentor and friend since then. I also thank Professor Andrew Chi- Chih Yao and Professor Jian Li for advising me during my undergraduate study, Dr. Yuandong Tian for mentoring me at Facebook AI Research, and Dr. Kai Yu and Dr. Yinan Yu for mentoring me at Baidu Research. The thesis would not have been possible without the inspiration and support from my colleagues in the MIT Vision Group and Computational Cognitive Science (Co- CoSci) Group. I would like to deliver my appreciation to my collaborators, Dr. Joseph Lim, Tianfan Xue, and Dr. Ilker Yildirim. I am also thankful to other encouraging and helpful group members, especially Andrew Owens, Donglai Wei, Dr. Tomer Ull- man, Katie Bouman, Kelsey Allen, Tejas Kulkarni, Dr. Dilip Krishnan, Dr. Hossein Mohabi, Dr. Tali Dekel, Dr. Daniel Zoran, Pedro Tsividis, and Hongyi Zhang. I would like to extend my appreciation to my dear friends for their backing in my academic and daily life. I received the Edwin S. Webster Fellowship during my first year, and have been partially funded by NSF-6926677 (Reconstructive Recognition). I appreciate the support from all funding agencies. Finally, I thank my parents, for their lifelong encouragement and love. 5 6 Contents 1 Introduction 13 2 Modeling the Physical World 17 2.1 Scenarios ......... .................................. 18 2.2 The Physics 101 Dataset ......................... 20 3 Physical Object Model: Learning with a Symbolic Interpreter 23 3.1 Visual Property Discoverer ... .... ..... .... .... .... 24 3.2 Physics Interpreter ... .... .... ..... .... .... .... 25 3.3 Physical World Simulator . .... .... .... .... .... ... 26 3.4 Experim ents ..... .... .... .... .... .... .... ... 27 3.4.1 Learning Physical Properties ... .... .... .... ... 28 3.4.2 Detecting Objects with Unusual Properties . .... .... .. 30 3.4.3 Predicting Outcomes . ..... ..... .... ..... ... 31 4 Physical Object Model: Incorporating a Physics Engine 35 4.1 The Galileo M odel . .... .... ... .... ... .... ... .. 35 4.1.1 Tracking as Recognition .... ..... ...... ...... 38 4.1.2 Inference .. .... .... .... .... .... .... ... 38 4.2 Sim ulations ... .... .... .... ..... .... .... .... 38 4.3 Bootstrapping as Efficient Perception in Static Scenes ... ..... 39 4.4 Experim ents .. ...... ..... .... ..... ..... ..... 41 4.4.1 Outcome Prediction . ..... ..... .... ..... ... 41 7 4.4.2 M ass Prediction ..... ...... ...... ..... .... 42 4.4.3 "Will it move" Prediction ... ...... ...... ...... 43 5 Beyond Understanding Physics 45 6 Conclusion 47 8 List of Figures 2-1 Abstraction of the physical world, and a snapshot of our dataset. 18 2-2 Illustrations of the scenarios in our Physics 101 dataset. ... .... 19 2-3 Physics 101: this is the set of objects we used in our experiments. We vary object material, color, shape, and size, together with external conditions such as the slope of a surface or the stiffness of a string. Videos recording the motions of these objects interacting with target objects will be used to train our algorithm. .. .... .... .... 20 3-1 Our first model exploits the advancement of machine learning algorithm (convolutional neural network) - we supervise all levels by a physics interpreter. This interpreter provides the physical constraints on what each layer can take values. During the training and testing, our model has no label of physical properties, in contrast to the standard approaches. .... .... .... .... .... .... ... 24 3-2 Charts for the estimations of rings. The physical properties, especially density, of the first ring is different from those of the other rings. The difference is hard to perceive by merely visual appearances; however, by observing videos with object interactions, our algorithm is able to learn the properties and find the outlier. All figures are on a log-normalized scale. .... ... .... .... .... .... .... .... .... 30 3-3 Heat maps of user predictions, model outputs (in orange), and ground truths (in white). Objects from top to bottom, left to right: dough, metal coin, metal pole, plastic block, plastic doll, and porcelain. .. 31 9 4-1 Our second model formalizes a hypothesis space of physical object rep- resentations, where each object is defined by its mass, friction coef- ficient, 3D shape, and a positional offset w.r.t. an origin. To model videos, we draw objects from that hypothesis space into the physics engine. The simulations from the physics engine are compared to ob- servations in the velocity space. ... ........ ....... ... 36 4-2 Simulation results. Each row represents one video in the data: (a) the first frame of the video, (b) the last frame of the video, (c) the first frame of the simulated scene generated by Bullet, (d) the last frame of the simulated scene, (e) the estimated object with larger mass, (f) the estuimateu oUjecUtwih larig iiction cIeUfIcIent. .... ...... .. 39 4-3 Mean squared errors of oracle estimation, our estimation, and uniform estimations of mass on a log-normalized scale, and the correlations between estimations and ground truths . ..... ...... ..... 41 4-4 The log-likelihood traces of several chains with and without recognition- model (LeNet) based initializations. .... .... .... .... ... 41 4-5 Mean errors in numbers of pixels of human predictions, Galileo outputs, and a uniform estimate calculated by averaging ground truth ending points over all test cases. As the error patterns are similar for both target objects (foam and cardboard), the errors here are averaged across target objects for each material. .. .... ..... ..... 43 4-6 Heat maps of user predictions, Galileo outputs (orange crosses), and ground truths (white crosses). ... .... .... ... .... .... 43 4-7 Average accuracy of human predictions and Galileo outputs on the tasks of mass prediction and "will it move" prediction. Error bars indicate standard deviations of human accuracies. .. .... ... .. 44 10 List of Tables 3.1 Accuracies (%, for oracle) or clustering purities (%, for joint training) on material estimation. In the joint training case, as there is no su- pervision on the material layer, it is not necessary for the network to specifically map the responses in that layer to material labels, and we do not expect the numbers to be comparable with the oracle case. Our analysis is just to show even in this case the network implicitly grasps some knowledge of object materials. ..... ...... ...... .. 29 3.2 Correlation coefficients of our estimations and ground truth for mass, density, and

Load more