Topics in Causal and High Dimensional Inference

TOPICS IN CAUSAL AND HIGH DIMENSIONAL INFERENCE ADISSERTATION SUBMITTED TO THE DEPARTMENT OF STATISTICS AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Qingyuan Zhao July 2016 © 2016 by Qingyuan Zhao. All Rights Reserved. Re-distributed by Stanford University under license with the author. This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/ This dissertation is online at: http://purl.stanford.edu/cp243xv4878 ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Trevor Hastie, Primary Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Art Owen I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Robert Tibshirani Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost for Graduate Education This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives. iii Abstract Causality is a central concept in science and philosophy. With the ever increasing amount and complexity of data being collected, statistics is playing a more and more significant role in the inference of causes and e↵ects. This thesis consists of three Parts. Part I reviews the necessary mathematical languages so we can talk about causality. Three major approaches (by potential outcomes in Chapter 2,bygraphsinChapter3,andbyfunctionsinChapter4)are described. They are in many senses equivalent or complementary to each other, and excel in di↵erent causal tasks. Part II considers the statistical inference of a single causal e↵ect in the potential outcome framework. Chapter 5 reviews state-of-the-art matching and weighting methods. Chapter 6 proposes a new loss function tailored for propensity score estimation, which can boost the performance of the weighting methods. Chapter 7 reviews outcome regression and doubly robust inference and provides some insight to selecting propensity score models and constructing confidence intervals. Part III considers the statistical inference of multiple confounded e↵ects. Chapter 8 introduces a confounding problem in linear model with latent variables. Two examples are given, one in genetics and one in finance. Chapter 9 proposes a twp-step procedure to adjust for the hidden confounding variables in high dimensional data. Chapter 10 presents the performance of this method in simulations and the two read data examples. iv Acknowledgments When writing this page of gratitude, I have physically left Stanford for a week, but IfeelmyheartisstillwiththeFarmandallthepeoplethere.Iwouldliketotake this opportunity to express my deepest gratitude to the people who have ever o↵ered help or brought joy to me in the last five years. First and foremost the thanks go to my amazing advisor, Trevor Hastie, who guided me through the transition from a good student to an independent researcher. Trevor has been very generous with his time. Throughout so many long discussions we had together, I have benefited not only from his deep insights in statistics, but to a greater extent from his attitude towards work and how he collaborates with researchers in other fields. I am sure they will have a lasting influence on my future career as an applied statistician. Trevor has been supportive all through my graduate studies and has given me incredible freedom to pursue my own research ideas. I would like to thank both Lynda and Trevor for the enchanting dinners in their home, which make those Thanksgivings and the graduation especially cherishable. Next I would like to thank Art Owen, Robert Tibshirani, Mike Baiocchi, Jonathan Taylor for serving my thesis proposal and defense committees. I am also very grateful to all the wonderful faculty and friendly sta↵at the Stanford Statistics Department. I thank all my teachers for the invaluable classes I have taken in Stanford (especially the first-year core courses). IhavebeenveryfortunatetobeabletocollaboratewithmanyStanfordandnon- Stanford people. I am very grateful to Stanford for encouraging multi-disciplinary research, which is extremely important to statisticians. I thank Art Owen for our several collaborations and for his mentorship throughout my Ph.D. studies. I thank v Jingshu Wang, Murat Erdogdu, Hera He, Charles Zheng, and Kinjal Basu for the inspiring discussions we had. I thank Jure Leskovec, Anand Rajaraman, Andres Stapper, and Yang Song for bringing in exciting problems and trusting my knowledge in statistics. I am indebted to Shaji Sebastian from eBay, Daniel Percival, Jean Steiner, and Sangho Yoon from Google for being good mentors and friends during my summer internships. The last five years could not be so delightful without the accompany of my friends and classmates. Special shout-out to the extended 206 family—Bhaswar Bhat- tacharya, Murat Erdogdu, Joshua Loftus and Pooja Loftus, with whom I have shared so many terrific memories. I am sure our friendship will last well beyond graduation. IamgratefultoallthestudentswhoeverdroppedbySequoia206orstoppedmein the hallway for some academic or non-academic chats, including Yi Liu, Shilin Zhu, Gourab Mukherjee, Sumit Mukherjee, Alex Chouldechova, Anirban Basak, Jian Li, Dennis Sun, Zhen Zhu, Yunjin Choi, Weijie Su, Stefan Wager, Edgar Dobriban, Ya Le, Charles Zheng, Subhabrata Sen, Rakesh Achanta, Feng Ruan, and Pragya Sur. Special thanks the fourteen USTCers, Zhouxiao Bao, Fang Liu, Xiaoge Liu, Yuling Liu, Shaoyu Lu, Xiaoxiong Lu, Weiluo Ren, Milan Shen, Haotian Wang, Danjie Wen- ren, Bing Wu, Fan Yang, Hongyuan Yuan, and Zhao Zhao, for starting our journey at Stanford and staying good friends together. I would like to thank the new friends I have made at Stanford, especially Chen Ran, Cong Meng, Yu He, Chunjing Jia, for sharing their enthusiasm in research and life. Finally, I thank my parents for all the support, sacrifice, and everything they have ever done and continue to do for me. Because of them, I have never felt alone or discouraged facing all the challenges in life. Thank you to Hera He for her love and trust. This thesis is never possible without her being the emotional pillar. vi Contents Abstract iv Acknowledgments v I MATHEMATICAL LANGUAGES OF CAUSALITY 1 1 Introduction 2 2 The Potential Outcome Approach 6 2.1 Neyman’s model ............................. 6 2.2 Rubin’s model ............................... 7 2.3 Criticisms ................................. 10 3 The Graphical Approach 12 3.1 Conditional dependence and Bayesian network ............. 12 3.2 Causal Bayesian network ......................... 14 3.3 Graphical identification of causal e↵ect ................. 15 4 The Functional Approach 18 4.1 Laplacian determinism .......................... 18 4.2 Functional model and counterfactuals .................. 20 4.3 Invariant causal prediction ........................ 22 vii II INFERRING A SINGLE EFFECT 25 5 Matching and Weighting 26 5.1 Raw matching ............................... 26 5.2 Propensity score matching ........................ 28 5.3 Empirical calibration weighting ..................... 29 5.4 Propensity score weighting ........................ 32 6 Tailoring the Propensity Score Model 34 6.1 Motivation ................................. 34 6.2 Background on statistical decision theory ................ 37 6.3 Covariate balancing scoring rules .................... 41 6.4 Adaptive strategies ............................ 47 6.5 Theoretical aspects ............................ 56 6.6 Discussions ................................ 65 6.7 Numerical examples ............................ 67 6.8 Theoretical proofs ............................. 74 7 Outcome Regression and Doubly Robust Inference 80 7.1 Outcome regression ............................ 80 7.2 Doubly robust estimation ........................ 81 7.3 Robust inference of ATT ......................... 84 III INFERRING MULTIPLE EFFECTS 87 8 A Common Confounding Problem 88 8.1 Linear model with latent variables .................... 88 8.2 Example 1: Batch e↵ects in microarray experiments .......... 90 8.3 Example 2: Unknown risk factors in stock market ........... 95 8.4 Connection to linear structural equation model ............. 96 viii 9 Confounder Adjustment 98 9.1 Rotation .................................. 98 9.2 Identifiability ............................... 99 9.3 Estimation ................................. 102 9.4 Hypotheses testing ............................ 108 9.5 Extension to multiple regression ..................... 110 9.6 Discussions ................................ 114 10 Numerical Examples 118 10.1 Simulations ................................ 118 10.2 Real data example: Batch e↵ects in microarray experiments ..... 124 10.3 Real data example: Mutual fund selection ............... 128 ix List of Tables 5.1 Matching and weighting methods .................... 32 6.1 Estimands and scoring rules in the Beta family. ............ 43 6.2 Performance of di↵erent loss functions combined with adaptive strategies. 71 8.1 Selected literature in multiple hypothesis testing

Topics in Causal and High Dimensional Inference

Robustbase: Basic Robust Statistics

A Practical Guide to Support Predictive Tasks in Data Science

Detecting Outliers in Weighted Univariate Survey Data

Outlier Detection

Mining Software Engineering Data for Useful Knowledge Boris Baldassari

Robust Statistics Part 1: Introduction and Univariate Data General References

Unsupervised Anomaly Detection in Receipt Data

Package 'Robustbase'

Exploring Ways of Identifying Outliers in Spatial Point Patterns Jie Liu East Tennessee State University

Research Article a Robust Skewed Boxplot for Detecting Outliers in Rainfall Observations in Real-Time Flood Forecasting

Robust Normality Test and Robust Power Transformation with Application to State Change Detection in Non Normal Processes

Package 'Mrfdepth'