Unsupervised Structured Learning of Human Activities for Robot Perception
Total Page:16
File Type:pdf, Size:1020Kb
UNSUPERVISED STRUCTURED LEARNING OF HUMAN ACTIVITIES FOR ROBOT PERCEPTION A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Chenxia Wu August 2016 © 2016 Chenxia Wu ALL RIGHTS RESERVED UNSUPERVISED STRUCTURED LEARNING OF HUMAN ACTIVITIES FOR ROBOT PERCEPTION Chenxia Wu, Ph.D. Cornell University 2016 Learning human activities and environments is important for robot perception. Human activities and environments comprise many aspects, including a wide variety of human actions and various objects that interact with humans, which make their modeling very challenging. We observe that these aspects are re- lated to each other spatially, temporally and semantically. They form sequen- tial, hierarchical or graph structures. Understanding these structures is key to the learning algorithms and systems of robot perception. Therefore, this thesis focuses on structured modeling of these complex hu- man activities and environments using unsupervised learning. Our unsupervised learning approaches can detect hidden structures from the data itself, without the need for human annotations. In this way, we enable more useful applica- tions, such as forgotten action detection and object co-segmentation. While structured models in supervised settings have been well-studied and widely used in various domains, discovering latent structures is still a chal- lenging problem in unsupervised learning. In this work, we propose unsuper- vised structured learning models, including causal topic models and fully con- nected Conditional Random Field (CRF) auto-encoders, which have the ability to model more complex relations with less independence. We also design effi- cient learning and inference optimizations that maintain the tractability of com- putations. As a result, we produce more flexible and accurate robot perceptions in more interesting applications. We first note that modeling the hierarchical semantic relations of objects and objects’ interactions with humans is very important for developing flexible and reliable robotic perception. We therefore propose a hierarchical semantic labeling algorithm to produce scene labels at different levels of abstraction for specific robot tasks. We also propose unsupervised learning algorithms to leverage the interactions between humans and objects, so that the machine can automatically discover the useful common object regions from a set of images. Second, we note that it is important for a robot to be able to detect not only what a human is currently doing, but also more complex relations, such as ac- tion temporal and human-object relations. Thus, the robot is able to achieve bet- ter perception performance and more flexible tasks. Thus, we propose a causal topic model to incorporate both short-term and long-term temporal relations between human actions, as well as human-object relations, and we develop a new robotic system that watches not only what a human is currently doing, but also what he has forgotten to do, and reminds the person of the latter where necessary. In the domain of human activities and environments, we show how to build models that can learn the semantic, spatial and temporal structures in the unsupervised setting. We show that these approaches are useful in mul- tiple domains, including robotics, object recognition, human activity model- ing, image/video data mining and visual summarization. Since our tech- niques are unsupervised and structured modeled, they are easily extended and scaled to other areas, such as natural language processing, robotic plan- ning/manipulation, multimedia analysis, etc. BIOGRAPHICAL SKETCH Chenxia Wu was born and grew up in Zhenjiang, a beautiful eastern city in China, which is sitting on the southern bank of the Yangtze River. Before joining the computer science PhD program at Cornell University, he obtained a Bache- lor’s degree in computer science from Southeast University, China and a master degree in computer science from Zhejiang University, China. He enjoys solv- ing challenging science problems such as machine learning, computer vision, robotics and game theory. He also likes to innovate solutions for breakthrough applications such as assistive robots, home automations, and self-driving cars. He loves cooking and photography in his free time. iii To my parents – Hui Wu and Xuemei Li, and my wife – Jiemi Zhang. iv ACKNOWLEDGEMENTS First and foremost, I would like to thank my advisor, Ashutosh Saxena, for his great guidance and support through the years. Besides the knowledge of com- puter science, especially he taught me how to always pursue breakthroughs in my research and how to solve the challenges step by step. He motivates me to work hard, and more importantly to work smart as well as to think in depth. I was also fortunate to collaborate with Silvio Savarese at Stanford AI Lab. I learned many valuable insights into computer vision research from him. I am very grateful to Bart Selman, who was very supportive to my research and PhD studies. I am very grateful to the other members of my thesis committee, Charles Van Loan and Thorsten Joachims for their insightful and constructive suggestions on my work. I would also like to thank my colleagues from Robot learning lab at Cor- nell: Yun Jiang, Hema Koppula, Ian Lenz, Jaeyong Sung, Ozan Sener, Ashesh Jain, Dipendra K Misra. I am grateful to the helps from members at Stanford CVGL group: Amir R. Zamir, David Held, Yu Xiang, Kevin Chen, Kuan Fang. I also thank to my friends and colleagues from Cornell: Xilun Chen and Pu Zhang. I would like to thank friends and colleagues at Brain of Things Inc: David Cheriton, Deng Deng, Lukas Kroc, Brendan Berman, Shane Soh, Jingjing Zhou, Pankaj Rajan, and Jeremy Mary. I am grateful to my colleagues at Google X: Ury Zhilinsky, Congcong Li, Zhaoyin Jia, Jiajun Zhu, and Junhua Mao. Finally, I thank my parents Hui Wu and Xuemei Li for the love and support. I especially thank my wife Jiemi Zhang, not only my best partner, but also my soulmate. v TABLE OF CONTENTS Biographical Sketch . iii Dedication . iv Acknowledgements . v Table of Contents . vi List of Tables . viii List of Figures . ix 1 Introduction 1 1.1 Human Environments Learning . 3 1.2 Human Activities Learning . 4 1.3 First Published Appearances of Described Contributions . 7 2 Hierarchical Semantic Labeling for Task-Relevant RGB-D Perception 8 2.1 Introduction . 8 2.2 Related Work . 11 2.3 Overview . 13 2.4 Preliminaries . 15 2.4.1 Unary term of a segment . 16 2.4.2 Labeling RGB-D Images with Flat Labels . 16 2.5 Hierarchical Semantic Labeling . 17 2.5.1 Labeling Segmentation Trees with Flat Labels . 18 2.5.2 Labeling Segmentation Trees with Hierarchical Labels . 18 2.6 Efficient Optimization . 22 2.7 Scene Labeling Experiments . 24 2.7.1 Results . 26 2.8 Robotic Experiments . 29 2.8.1 Object Search Experiments . 29 2.8.2 Object Retrieval Experiments . 31 2.8.3 Object Placement Experiments . 33 2.9 Summary . 33 3 Human Centered Object Co-Segmentation 34 3.1 Introduction . 34 3.2 Related Work . 37 3.3 Problem Formulation . 39 3.4 Model Representation . 40 3.4.1 Fully Connected CRF Encoding . 42 3.4.2 Reconstruction . 45 3.5 Model Learning and Inference . 45 3.5.1 Efficient Learning and Inference . 46 3.6 Experiments . 49 3.6.1 Compared Baselines . 49 vi 3.6.2 Evaluations . 49 3.6.3 Datasets . 50 3.6.4 Results . 52 3.7 Summary . 56 4 Unsupervised Learning of Human Actions and Relations 57 4.1 Introduction . 57 4.2 Related Work . 61 4.3 Overview . 62 4.4 Visual Features . 64 4.5 Learning Model . 65 4.6 Gibbs Sampling for Learning and Inference . 69 4.7 Applications . 71 4.7.1 Action Segmentation and Recognition . 71 4.7.2 Action Patching . 72 4.8 Experiments . 74 4.8.1 Dataset . 74 4.8.2 Experimental Setting and Compared Baselines . 75 4.8.3 Evaluation Metrics . 76 4.8.4 Results . 77 4.9 Summary . 81 5 Unsupervised Learning for Reminding Humans of Forgotten Actions 83 5.1 Introduction . 83 5.2 Related Work . 86 5.3 Watch-Bot System . 88 5.4 Learning Model . 90 5.4.1 Learning and Inference . 93 5.5 Forgotten Action Detection and Reminding . 95 5.6 Experiments . 98 5.6.1 Dataset . 98 5.6.2 Baselines . 99 5.6.3 Evaluation Metrics . 100 5.6.4 Results . 101 5.6.5 Robotic Experiments . 104 5.7 Summary . 105 6 Conclusion and Future Work 106 6.1 Conclusion . 106 6.2 Future Work . 107 6.2.1 Deep Structures in Feature Encoding and Decoding . 108 6.2.2 Extending to Semi-Supervised Learning . 108 6.2.3 Practical Robotic Applications . 109 vii LIST OF TABLES 2.1 Major notations in this chapter. 16 2.2 Major notations in hierarchical semantic labeling. 17 2.3 Average class recall of each class level on NYUD2 dataset. 26 2.4 Robotic experiment results. Success rates for perception (‘perch’) and actual robotic execution (‘exec’) of each task. 31 3.1 Co-Segmentation results on CAD-120 dataset (%). 52 3.2 Co-Segmentation results on Watch-n-Patch dataset (%). 52 3.3 Co-Segmentation results on PPMI dataset (%). 53 3.4 Co-Segmentation results on MS COCO + Watch-n-Patch dataset (%). 53 4.1 Notations in our model. 67 4.2 Results using the same number of topics as the ground-truth ac- tion classes. HMM-DTF, CaTM-DTF use DTF RGB features and others use our human skeleton and RGB-D features. 78 5.1 Notations in our model. 92 5.2 Action segmentation and cluster assignment results, and forgot- ten action/object detection results. 101 5.3 Robotic experiment results. The higher the better.