Towards a Unified Framework for Learning and Reasoning

Towards a Unified Framework for Learning and Reasoning Han Zhao August 2020 CMU-ML-20-111 Machine Learning Department School of Computer Science Carnegie Mellon University Pittsburgh, PA Thesis Committee Geoffrey J. Gordon, Chair Ruslan Salakhutdinov Barnabás Póczos Tommi S. Jaakkola (Massachusetts Institute of Technology) Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Copyright c 2020 Han Zhao This research was supported by the Office of Naval Research under award numbers N000140911052 and N000141512365, the Air Force Research Laboratory award number FA87501720152, a grant from NCS Pearson, Inc., and a Nvidia GPU grant. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of any sponsoring institution, the U.S. government or any other entity. Keywords: Artificial Intelligence, Machine Learning, Deep Learning, Information Theory, Learn- ing Theory, Probabilistic Graphical Models, Representation Learning, Algorithmic Fairness, Domain Adaptation, Transfer Learning, Multitask Learning, Sum-Product Networks, Probabilistic Circuits To my parents, my brother, and my lovely girlfriend. Abstract The success of supervised machine learning in recent years crucially hinges on the availabil- ity of large-scale and unbiased data, which is often time-consuming and expensive to collect. Recent advances in deep learning focus on learning rich and invariant representations that have found abundant applications in domain adaptation, multitask learning, algorithmic fairness, and machine translations, just to name a few. However, it is not clear what price we have to pay in terms of task utility for such universal representations. On the other hand, learning is only one of the two most fundamental cognitive abilities of intelligent agents. An intelligent agent needs to have both the ability to learn from the experience, and the ability to reason from what has been learned. However, classic symbolic reasoning cannot model the inherent uncertainty that ubiquitously exists, and it is not robust to noisy observations. Perhaps more fundamentally, reasoning is computationally intractable in general. As a result, learning, which often takes reasoning as a sub-procedure, is also hard. Building on the fundamental concepts from information theory and theoretical computer science, this thesis aims to understand the inherent tradeoff between utility and invariance in learning the representations, and to develop efficient algorithms for learning tractable and exact probabilistic inference machines. This thesis contains two parts. The first part is devoted to understanding and learning invariant representations. In particular, we will focus on understanding the costs of existing invariant representations by characterizing a fundamental tradeoff between invariance and utility. First, we will use domain adaptation as an example to both theoretically and empirically show such tradeoff in achieving small joint generalization error. This result also implies an inherent tradeoff between demographic parity, a statistical notion of group fairness, and utility in both classification and regression settings. Going beyond, we will further show that such general tradeoff exists in learning with structured data. In particular, we shall derive an impossibility theorem for universal machine translation by learning language-invariant representations. Second, we will focus on designing learning algorithms to escape the existing tradeoff and to utilize the benefits of invariant representations. We will show how the algorithm can be used to guarantee equalized treatment of individuals between groups, and discuss what additional problem structure it requires to permit efficient domain adaptation and machine translation through learning invariant representations. The second part of the thesis is devoted to learning tractable and exact circuits for probabilistic reasoning. It is well-known that exact marginal and conditional inference in classic probabilistic graphical models (PGMs), including Bayesian Networks (BNs) and Markov Networks (MNs), is #P-complete. As a result, practitioners usually need to resort to various approximate inference schemes to ensure computational tractability. Probabilistic circuits, which include Sum-Product Networks (SPNs) as a special case, have been proposed as tractable deep models for exact probabilistic inference. They distinguish themselves from other types of probabilistic graphical models by the fact that inference can be done exactly in linear time with respect to the size of the circuit. This has generated a lot of interest since inference is not only a powerful tool to reason under uncertainty, but also a core task for parameter estimation and structure learning. In this part, we will concentrate on both theoretical and practical parts of learning tractable probabilistic circuits. In particular, we will investigate the representational power of SPNs, as well as its parameter learning procedures in both online and offline settings. Acknowledgments First and foremost, I am greatly indebted to my advisor, Geoff Gordon, who, throughout my five-year journey at Carnegie Mellon University, has provided me with constant supports, insightful discussions, as well as the freedom that allows me to pursue interesting research problems to fulfill my own curiosity. In addition to being a fantastic researcher, Geoff is also the kindest and most supportive individual that I have had the pleasure of working with. He has a great sense of research taste, which also shapes mine to work on important and elegant questions. I would like to sincerely thank Geoff for his intuitive insights in understanding complex problems, for sharing with me his broad knowledge in almost every aspect of machine learning, and for his unwavering faith in me that pushes me to become a better researcher. I hope that I could become a researcher, an advisor and a person like him in the future. I am very grateful to Ruslan Salakhutdinov, Barnabás Póczos and Tommi S. Jaakkola for all the help they have provided during my Ph.D., including serving as my thesis committee members. Russ has been a brilliant collaborator on a number of research projects during my Ph.D. He is extremely knowledgeable and understands almost all aspects of deep learning. I would like to thank him for his tremendous help in research and his countless support in my career. Barnabás has been a great friend and mentor. He is always willing to listen to me on both research and life, and give me his advice and thoughts. Tommi hosted me for a visit at MIT, the results of which include a research paper on the information-theoretic characterization of the fundamental limit of invariant representation learning, and another one on using adversarial learning to defend attribute inference attacks on graphs. During my visit, his sharpness and deep insights helped me achieve a better and clearer understanding about the tradeoff problem I was working on. I would also like to thank my master thesis advisor, Pascal Poupart at the University of Waterloo, who led me to explore this fantastic field of research on tractable probabilistic reasoning, for his patience and encouragement, and his endless support and guidance. Pascal has always been a kind mentor and friend, and it is my privilege to be able to continue collaborating with him after graduating from Waterloo. The results of our collaboration are presented in Chapter 9. A huge thanks goes to all of my other collaborators throughout my Ph.D.: Tameem Adel, Brandon Amos, Jianfeng Chi, Adam Coates, Remi Tachet des Combes, João P. Costeira, Amanda Coston, Chen Dan, Junjie Hu, Priyank Jaini, Stefanie Jegelka, Nebojsa Jojic, Chen Liang, Chin-Hui Lee, Peizhao Li, Peiyuan Alex Liao, Hongfu Liu, José M. F. Moura, Renato Negrinho, Rema Padman, Abdullah Rashwan, Pradeep Ravikumar, Andrej Risteski, Nihar B. Shah, Jian Shen, Xiaofei Shi, Alexander J. Smola, Otilia Stretcu, Ivan Tashev, Yuan Tian, Yao-Hung Tsai, Richard E. Turner, Wen Wang, Yu-Xiang Wang, Guanhang Wu, Keyulu Xu, Yichong Xu, Makoto Yamada, Jianbo Ye, Shuayb Zarar, Kun Zhang, Shanghang Zhang, Zhenyao Zhu, and Honglei Zhuang. It would not have been as fun or as productive without their help. Special thanks goes to Andrej Risteski, Nihar B. Shah, Yao-Hung Tsai and Yichong Xu. It has been my great privilege to be able to work with them, from whom I not only benefit inspiring discussions, but also countless fun moments in life. I also want to thank everyone in the Machine Learning Department at Carnegie Mellon University, who contribute to making it a vibrant and fun place to pursue graduate studies. In particular, I would like to thank Diane Stidle and other amazing staff in our department for their endless efforts in making our everyday life in the department easy and enjoyable. Special thanks to all my friends at CMU: Siddharth Ancha, Ifigeneia Apostolopoulou, Shaojie Bai, Devendra Chaplot, Maria De-Arteaga, Carlton Downey, Simon Du, Avinava Dubey, William Guss, Ahmed Hefny, Hanzhang Hu, Zhiting Hu, Lisa Lee, Leqi Liu, Yangyi Lu, Yifei Ma, Ritesh Noothigattu, Anthony Platanios, Mrinmaya Sachan, Wen Sun, Mariya Toneva, Po-Wei Wang, Yining Wang, Eric Wong, Yifan Wu, Yuexin Wu, Pengtao Xie, Qizhe Xie, Keyang Xu, Diyi Yang, Fan Yang, Zhilin Yang, Ian En-Hsu Yen, Yaodong Yu, Hongyang Zhang and Xun Zheng. Hope our friendship will last forever. Last but most importantly, I would like to thank my parents, Wei Zhao and Yuxia Wang, my younger brother, Rui Zhao and my lovely girlfriend Lu Sun, for all of your unconditional love and support during my Ph.D. journey. Needless to say, this thesis would not have been possible without your encouragement along the way. This thesis is dedicated to all of you. viii Contents 1 Introduction 1 1.1 Main Contributions of This Thesis . .2 1.2 Overview of Part I: Invariant Representation Learning . .3 1.2.1 Learning Domain-Invariant Representations . .4 1.2.2 Learning Fair Representations . .4 1.2.3 Learning Multilingual Representations . .5 1.3 Overview of Part II: Tractable Probabilistic Reasoning . .5 1.3.1 A Unified Framework for Parameter Learning .

Load more