Causal Inference Methods for Bias Correction in Data Analyses

CAUSAL INFERENCE METHODS FOR BIAS CORRECTION IN DATA ANALYSES by Razieh Nabi A dissertation submitted to Johns Hopkins University in conformity with the requirements for the degree of Doctor of Philosophy Baltimore, Maryland March 2021 © 2021 Razieh Nabi All rights reserved Abstract Many problems in the empirical sciences and rational decision making require causal, rather than associative, reasoning. The field of causal inference is concerned with establishing and quantifying cause-effect relationships to inform interventions, even in the absence of direct experimentation or randomization. With the proliferation of massive datasets, it is crucial that we develop principled approaches to drawing actionable conclusions from imperfect information. Inferring valid causal conclusions is impeded by the fact that data are unstructured and filled with different sources of bias. The types of bias that we consider in this thesis include: confounding bias induced by common causes of observed exposures and outcomes, bias in estimation induced by high dimensional data and curse of dimensionality, discriminatory bias encoded in data that reflect historical patterns of discrimination and inequality, and missing data bias where instantiations of variables are systematically missing. The focus of this thesis is on the development of novel causal and statistical methodologies to better understand and resolve these pressing challenges. We draw on methodological insights from both machine learning/artificial intelligence and statistical theory. Specifically, we use ideas from graphical modeling to encode our assumptions about the underlying data generating mechanisms in a clear and succinct manner. Further, we use ideas from nonparametric and semiparametric theories to enable the use of flexible machine learning modes in the estimation of causal effects that are identified as functions of observed data. There are four main contributions to this thesis. First, we bridge the gap between ii identification and semiparametric estimation of causal effects that are identified in causal graphical models with unmeasured confounders. Second, we use semiparametric inference theory for marginal structural models to give the first general approach to causal sufficient dimension reduction of a high dimensional treatment. Third,we address conceptual, methodological, and practical gaps in assessing and overcoming disparities in automated decision making using causal inference and constrained optimization. Fourth, we use graphical representations of missing data mechanisms and provide a complete characterization of identification of the underlying joint distribution where some variables are systematically missing and others are unmeasured. iii Committee Members Dr. Ilya Shpitser (Primary Advisor) John C. Malone Assistant Professor Department of Computer Science Whiting School of Engineering Johns Hopkins University Dr. Daniel Scharfstein Professor of Biostatistics Department of Population Health Sciences School of Medicine University of Utah Dr. Eric Tchetgen Tchetgen Luddy Family President’s Distinguished Professor Statistics Department Wharton School of Business University of Pennsylvania Dr. Elizabeth Ogburn Associate Professor Department of Biostatistics Bloomberg School of Public Health Johns Hopkins University iv Preface Prior to working with Ilya, I had almost no exposure to the field of causal inference. My first encounter with “causal reasoning” was a cosmological argument in my pre-college theology courses, which I was not impressed by. Years later when I was doing my masters in statistics, the mantra of “correlation is not causation” got stuck in my head. The summer before applying to PhD programs, I visited my sister, Marzieh, in California and found the Causality book by Judea Pearl in her bookshelf. I started reading parts of it, and came across this quote: “I would rather discover one causal law than be King of Persia” – Democritus. Semi-seriously I thought to myself: maybe if I pursue a degree in causal inference, one day if I am presented with the throne to be the Queen of Persia, I can decline because at that point I might have learned many causal laws! Marzieh’s book now sits in my bookshelf. When I joined the CS program at Hopkins, I started working with Ilya on two separate projects. The first one was on a causal view of algorithmic fairness. There are many stories where AI algorithms demonstrate discriminatory, and potentially harmful, behaviors towards minorities. Initially, I relied on this work as a positive vehicle for addressing the discrimination I felt due to the restrictive immigration policies and the travel ban in the US. Over time, I found more purpose in my research as it seeks to raise awareness and improve the lives of underrepresented minorities. My research on algorithmic fairness (described in Chapter3) led to the development of a causal framework to interrogate and modify AI algorithms to not rely on sensitive attributes, like race or gender, in inappropriate ways. v The basis for my research on use of semiparametric theory in estimation of causal quantities originally stemmed from my passion for developing a method that establishes the cause-effect relations between the high dimensional treatment of radiation therapy and salivary dysfunctions. This was the second project I was working on in parallel with algorithmic fairness (described in Chapter2). This launched me into reading the book on Semiparametric Theory and Missing Data by Anastasios Tsiatis. In a few months, we (Ilya’s group) joined forces with folks at the Biostats department, and our discussions turned into a regular story time narrated by Dan Scharfstein. For over a year, every Friday we would gather in the library on the 3rd floor of the School of Public Health, and enjoy story time accompanied with coffee and donuts from Dunkin’. Receiving validation from senior researchers in the semiparametrics field like Dan boosted my confidence and I grew to enjoy it even more. On the other hand, a colleague of mine, Rohit, was not quite as impressed as I was about the theory. His main issue was lack of an automated procedure to derive influence functions and perform projections. Focusing on average causal effects, Rohit and I started thinking about an automated procedure to find influence functions for effects that are identified in causal graphical models with unmeasured confounders (described in part in Chapter2). Prior to this, Rohit and I worked on two papers on missing data identification (Chapter4) which stemmed from working on structure learning with missing data and getting stuck at the “wasteland” of non-identifiable laws. We paused the structure learning project and started thinking more carefully about the identifiability aspects of missing data models. If I am ever given a chance to go back in time and re-do my PhD, I would try harder to stick to the principles beautifully presented in this quote: “The important thing in life is not to conquer but to fight well and not to win but to take part.” – Pierre de Coubertin. vi To my lovely mom for being a role model of a brave and independent woman & To my late hardworking dad for teaching me to be ambitious and responsible vii Acknowledgments Acknowledgments are by far the hardest part of the dissertation to write. Feelings, emotions, and sensations cannot easily be summarized in a few pages. There are many people to thank. There are even people whose names or faces I do not know that have played a crucial role to place me where I am today; like the immigration officers who decided for me that I should not pursue a PhD degree in Aerospace Engineering at UW! Some may think that was unfortunate. I cannot agree or disagree with this sentiment as exploring all the counterfactual worlds that I could have ended up in is unattainable. What I can say with certainty however, is that I am happy and thankful to all the causes that brought me to Hopkins. At Hopkins, I started working with Ilya on multiple causal inference projects. I immediately fell in love with the field and here I am today concluding my PhD work on causal inference. Ilya’s passion for research and his support throughout these years has kept me swirling in this world of counterfactuals and I cannot thank him enough for this. Ilya has gathered a wonderful group of scholars in “House-of-Ayli.” It has been a pleasure growing up with them as a person and a researcher. I have learned a lot from each and every one of the “fellow-kids.” I thank Amir, Dan, Eli, Jaron, Noam, Numair, Ranjani, Rohit, and Zach for all the nice memories, collaborations, and friendships. I would like to thank Ilya, Dan S, Betsy, Eric, and Emre for writing strong letters of recommendations for me when I was on the job market, and it is a true honor to have Dan S, Betsy, and Eric in my thesis committee. I especially would like to viii thank Dan S for his gracious mentorship when I needed it the most during my job search and interviews, for planting the seed of passion for semiparametrics in me, and for narrating the semiparametric story time for over a year to the “merry band.” This brought me closer to Bonnie, Lamar, Youjin, and Ryan (who made coffee and donuts a story time tradition.) I learned a lot from them during our discussions and brainstorming sessions over the “clearly” stated claims in the “yellow book.” I would also like to thank Dr. Su for his extensive support when I was writing my first paper in grad school. I had the pleasure of having wonderful mentors like the late Dr. Joan Staniswalis, Dr. Ahmet Bulut, and Dr. Tarik Arici. This whole journey would have been a total wreck if it was not for friends to share with them the ups and downs, the laughter and tears along the way, to celebrate the successes and to get inspired in failures. My heart is filled with memories of my friends in undergrad in Tehran, the Taksim Square and Acibadem in Istanbul, the third and second floor of Malone (especially the time spent procrastinating and making coffee in the kitchen of Malone).

Load more