Robust Deep Learning in the Open World with Lifelong Learning and Representation Learning
Total Page:16
File Type:pdf, Size:1020Kb
Robust Deep Learning in the Open World with Lifelong Learning and Representation Learning by Kibok Lee A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science and Engineering) in the University of Michigan 2020 Doctoral Committee: Associate Professor Honglak Lee, Chair Assistant Professor David Fouhey Professor Alfred O. Hero, III Assistant Professor Justin Johnson Kibok Lee [email protected] ORCID iD: 0000-0001-6995-7327 © Kibok Lee 2020 All Rights Reserved ACKNOWLEDGEMENTS First of all, I would like to express my sincere gratitude to my advisor, Professor Honglak Lee, for his guidance and support throughout my graduate studies. It was an honor and privilege to have him as a mentor. All my works presented in this dissertation would not have been possible without his insightful advice. I would also like to thank Professor Jinwoo Shin and Kimin Lee from KAIST for their continuous collaboration with me. More than half of my publications have been written through discussion with them. I would like to extend my appreciation to my dissertation committee members, Pro- fessor David Fouhey, Professor Alfred O. Hero III, and Professor Justin Johnson. They provided valuable feedback on this dissertation. I would like to thank all my former and current labmates for their valuable discussions and warm friendship: Scott Reed, Yuting Zhang, Wenling Shang, Junhyuk Oh, Ruben Villegas, Xinchen Yan, Ye Liu, Seunghoon Hong, Lajanugen Logeswaran, Rui Zhang, Sungryull Sohn, Jongwook Choi, Yijie Guo, Yunseok Yang, Wilka Carvalho, Anthony Liu, and Thomas Huang. Many thanks to all my collaborators in academia and industry: Professor Anna Gilbert, Yi Zhang, Kyle Min, and Yian Zhu from UMich, Professor Bo Li from UIUC, Sukmin Yun from KAIST, Woojoo Sim from Samsung, Ersin Yumer, Raquel Urtasun, and Zhuoyuan Chen from Uber ATG, and Kihyuk Sohn and Chun-Liang Li from Google Cloud AI. I am grateful to have had the opportunity to work on research projects at Uber ATG as a research intern, which allowed me to experience conducting research from an industrial perspective. I would like to express my deepest gratitude to my family for their endless love and support. My parents and brother have always been a source of comfort and hope in difficult times. Your presence has been a great pleasure and encouragement to me. Lastly, I gratefully acknowledge the financial support from Kwanjeong Educational Foundation Scholarship, Samsung Electronics, NSF CAREER IIS-1453651, ONR N00014- 16-1-2928, and DARPA Explainable AI (XAI) program #313498. ii TABLE OF CONTENTS ACKNOWLEDGEMENTS :::::::::::::::::::::::::::: ii LIST OF FIGURES :::::::::::::::::::::::::::::::: vi LIST OF TABLES ::::::::::::::::::::::::::::::::: xi LIST OF APPENDICES :::::::::::::::::::::::::::::: xv ABSTRACT ::::::::::::::::::::::::::::::::::::: xvi CHAPTER I. Introduction .................................1 1.1 Motivation..............................1 1.2 Organization.............................6 1.3 List of Publications.........................7 II. Hierarchical Novelty Detection for Visual Object Recognition .....9 2.1 Introduction.............................9 2.2 Related Work............................ 11 2.3 Approach............................... 13 2.3.1 Taxonomy........................ 13 2.3.2 Top-Down Method.................... 14 2.3.3 Flatten Method...................... 16 2.4 Evaluation: Hierarchical Novelty Detection............. 18 2.4.1 Evaluation Setups..................... 18 2.4.2 Experimental Results................... 20 2.5 Evaluation: Generalized Zero-Shot Learning............ 22 2.5.1 Evaluation Setups..................... 22 2.5.2 Experimental Results................... 24 2.6 Summary............................... 25 III. Overcoming Catastrophic Forgetting with Unlabeled Data in the Wild 26 iii 3.1 Introduction............................. 26 3.2 Approach............................... 29 3.2.1 Preliminaries: Class-Incremental Learning....... 29 3.2.2 Global Distillation.................... 31 3.2.3 Sampling External Dataset................ 33 3.3 Related Work............................ 36 3.4 Experiments............................. 38 3.4.1 Experimental Setup.................... 38 3.4.2 Evaluation........................ 40 3.5 Summary............................... 43 IV. Network Randomization: A Simple Technique for Generalization in Deep Reinforcement Learning ....................... 44 4.1 Introduction............................. 44 4.2 Related Work............................ 46 4.3 Network Randomization Technique for Generalization....... 47 4.3.1 Training Agents Using Randomized Input Observations 48 4.3.2 Inference Methods for Small Variance.......... 50 4.4 Experiments............................. 50 4.4.1 Baselines and Implementation Details.......... 51 4.4.2 Experiments on CoinRun................. 51 4.4.3 Experiments on DeepMind Lab and Surreal Robotics Control.......................... 55 4.5 Summary............................... 56 V. i-MixUp : Vicinal Risk Minimization for Contrastive Representation Learning ................................... 58 5.1 Introduction............................. 58 5.2 Related Work............................ 60 5.3 Preliminary............................. 62 5.3.1 Contrastive Representation Learning........... 62 5.3.2 MixUp in Supervised Learning.............. 63 5.4 Vicinal Risk Minimization for Contrastive Representation Learning 64 5.4.1 i-MixUp for Memory-Free Contrastive Representation Learning......................... 64 5.4.2 i-MixUp for Memory-Based Contrastive Representation Learning......................... 65 5.4.3 InputMix......................... 66 5.5 Experiments............................. 66 5.5.1 i-MixUp with Contrastive Learning Methods and Trans- ferability......................... 67 5.5.2 Scaliability of i-MixUp................. 68 iv 5.5.3 Embedding Analysis................... 69 5.5.4 Contrastive Learning without Domain-Specific Data Aug- mentation......................... 70 5.5.5 i-MixUp on Other Domains............... 71 5.5.6 i-MixUp on ImageNet.................. 72 5.6 Summary............................... 73 VI. Conclusion and Future Directions ..................... 74 APPENDICES ::::::::::::::::::::::::::::::::::: 76 BIBLIOGRAPHY ::::::::::::::::::::::::::::::::: 121 v LIST OF FIGURES Figure 2.1 An illustration of our proposed hierarchical novelty detection task. In contrast to prior novelty detection works, we aim to find the most specific class label of a novel data on the taxonomy built with known classes... 10 2.2 Illustration of two proposed approaches. In the top-down method, clas- sification starts from the root class, and propagates to one of its children until the prediction arrives at a known leaf class (blue) or stops if the pre- diction is not confident, which means that the prediction is a novel class whose closest super class is the predicted class. In the flatten method, we add a virtual novel class (red) under each super class as a representative of all novel classes, and then flatten the structure for classification..... 13 2.3 Illustration of strategies to train novel class scores in flatten methods. (a) shows the training images in the taxonomy. (b) shows relabeling strategy. Some training images are relabeled to super classes in a bottom-up man- ner. (c–d) shows leave-one-out (LOO) strategy. To learn a novel class score under a super class, one of its children is temporarily removed such that its descendant known leaf classes are treated as novel during training. 17 2.4 Qualitative results of hierarchical novelty detection on ImageNet. “GT” is the closest known ancestor of the novel class, which is the expected pre- diction, “DARTS” is the baseline method proposed in Deng et al.(2012) where we modify the method for our purpose, and the others are our pro- posed methods. “"” is the distance between the prediction and GT, “A” indicates whether the prediction is an ancestor of GT, and “Word” is the English word of the predicted label. Dashed edges represent multi-hop connection, where the number indicates the number of edges between classes. If the prediction is on a super class (marked with * and rounded), then the test image is classified as a novel class whose closest class in the taxonomy is the super class......................... 21 2.5 Known-novel class accuracy curves obtained by varying the novel class score bias on ImageNet, AwA2, and CUB. In most regions, our proposed methods outperform the baseline method.................. 22 vi 2.6 Seen-unseen class accuracy curves of the best combined models obtained by varying the unseen class score bias on AwA1, AwA2, and CUB. “Path” is the hierarchical embedding proposed in Akata et al.(2015), and “TD” is the embedding of the multiple softmax probability vector obtained from the proposed top-down method. In most regions, TD outperforms Path..................................... 25 3.1 We propose to leverage a large stream of unlabeled data in the wild for class-incremental learning. At each stage, a confidence-based sampling strategy is applied to build an external dataset. Specifically, some of un- labeled data are sampled based on the prediction of the model learned in the previous stage P for alleviating catastrophic forgetting, and some of them are randomly sampled for confidence calibration. Under the combi- nation of the labeled training dataset and the unlabeled external dataset, a teacher model C first learns the current task, and then the new model M learns