
ANALGORITHMICENQUIRYCONCERNING CAUSALITY by samantha kleinberg A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Computer Science New York University May, 2010 Bhubaneswar Mishra © Samantha Kleinberg All Rights Reserved, 2010 Have patience with everything unresolved in your heart and try to love the questions themselves as if they were locked rooms or books written in a very foreign language. Don’t search for the answers, which could not be given to you now, because you would not be able to live them. And the point is, to live everything. Live the questions now. Perhaps then, someday far in the future, you will gradually, without even noticing it, live your way into the answer. —Rainer Maria Rilke, translated by Stephen Mitchell For Grandma Dorothy iv ACKNOWLEDGEMENTS I must begin by thanking my advisor, Bud Mishra, who has consistently encouraged me to question the status quo and follow my interdisciplinary intuitions. Bud trusted that I could teach myself logic and model checking, and that going all the way back to Hume in my causality reading was a necessary (though not sufficient) condition for understanding. I found myself in Bud’s lab as an undergraduate student primarily through luck, but I cannot imagine a better place to have grown as a scholar or another advisor who would have allowed me the freedom and independence I needed along with subtle guidance during periods of difficulty. I have been fortunate to have a diverse committee who contributed deeply to this thesis in quite varied ways. Ernest Davis has read a multitude of drafts, always with an unbelievable attention to detail. I am grateful for his blunt honesty about my prose, and any unclear sections that remain are due only to me not taking his advice. I thank Petter Kolm for sharing his expertise in finance and working with me over the last few years as I honed my approach. Rohit Parikh has been a thorough reader of this and other works. His depth and breadth of experience has made him invaluable for reminding me of the broader context of my work as I became hyperfocused. Michael Strevens saved me time and again from muddling metaphysics and epistemology and provided critical support and guidance as I delved deeper into philosophy and tried to reconcile it with computer science. I also thank Amnir Pnueli, whose support during v the early stages of this work gave me the confidence to proceed. More fundamentally, without Amir’s contributions to temporal logic, this thesis would not have been possible. I thank Craig Benham, who gave me my first taste of bioinformatics, and Kevin Kelliher, who very patiently introduced me to programming. My deepest thanks to Sebastian Stoenescu who guided me toward the high school internship that fundamentally changed my path. I have always wondered if I would have found this field that I am so passionate about without his influence, and am relieved that I do not have to find out. I was lucky to have someone who knew my interests and abilities better than I did at such a critical moment in my life. I thank all of the Bioinformatics Group members past and present. In particular I thank Marco Antoniotti, who introduced me to common lisp and changed the way I program. I thank James, who has read every draft of everything I’ve written in the last six years. Finally, I want to thank my mother, who instilled in me a passion for learning and a love of writing. I thank my father for continually reminding me of the importance of art and beauty, and contributing to better living through design. I thank Mark for his unwavering support and encouragement over the last twenty years. vi ABSTRACT In many domains we face the problem of determining the underlying causal structure from time-course observations of a system. Whether we have neural spike trains in neuroscience, gene expression levels in systems biology, or stock price movements in finance, we want to determine why these systems behave the way they do. For this purpose we must assess which of the myriad possible causes are significant while aiming to do so with a feasible computational complexity. At the same time, there has been much work in philosophy on what it means for something to be a cause, but comparatively little attention has been paid to how we can identify these causes. Algorithmic approaches from computer science have provided the first steps in this direction, but fail to capture the complex, probabilistic and temporal nature of the relationships we seek. This dissertation presents a novel approach to the inference of general (type-level) and singular (token-level) causes. The approach combines philosophical notions of causality with algorithmic approaches built on model checking and statistical techniques for false discovery rate control. By using a probabilistic computation tree logic to describe both cause and effect, we allow for complex relationships and explicit description of the time between cause and effect as well as the probability of this relationship being observed (e.g. “a and b until c, causing d in 10–20 time units”). Using these causal formulas and their associated probabilities, we develop a novel measure for the significance of a cause for its effect, thus allowing vii discovery of those that are statistically interesting, determined using the concepts of multiple hypothesis testing and false discovery control. We develop algorithms for testing these properties in time-series observations and for relating the inferred general relationships to token-level events (described as sequences of observations). Finally, we illustrate these ideas with example data from both neuroscience and finance, comparing the results to those found with other inference methods. The results demonstrate that our approach achieves superior control of false discovery rates, due to its ability to appropriately represent and infer temporal information. viii TABLEOFCONTENTS Dedication iv Acknowledgementsv Abstract vii List of Figures xii List of Tables xiv List of Appendices xv 1 introduction1 1.1 Overview of thesis 4 2 a brief review of causality8 2.1 Philosophical Foundations of Causality 8 2.2 Modern Philosophical Approaches to Causality 10 2.3 Probabilistic Causality 17 3 current work in causal inference 36 3.1 Causal Inference Algorithms 36 3.2 Granger Causality 48 3.3 Causality in Logic 50 3.4 Experimental inference 55 4 defining the object of enquiry 58 ix TABLE OF CONTENTS 4.1 Preliminaries 58 4.2 A little bit of logic 68 4.3 Types of causes and their representation 74 4.4 Difficult cases 102 5 inferring causality 111 5.1 Testing prima facie causality 111 5.2 Testing for significance 120 5.3 Correctness and Complexity 128 5.4 Other approaches 136 6 token causality 137 6.1 Introduction to token causality 137 6.2 From types to tokens 146 6.3 Whodunit? 163 6.4 Difficult cases 170 7 applications 185 7.1 Neural spike trains 185 7.2 Finance 191 8 conclusions and future work 208 8.1 Conclusions 208 8.2 Future work 212 8.3 Bibliographic Note 216 appendices 219 glossary 258 x TABLE OF CONTENTS index 267 bibliography 274 xi LISTOFFIGURES Figure 2.1 Forks as described by Reichenbach [108]. 20 Figure 2.2 Illustration of Simpson’s paradox example. 22 Figure 3.1 Faithfulness example. 39 Figure 3.2 Screening off example. 41 Figure 3.3 Firing squad example. 44 Figure 3.4 Desert traveler example. 45 Figure 4.1 Example probabilistic structure. 77 Figure 4.2 Smoking, yellow fingers, and lung cancer. 96 Figure 4.3 Bob and Susie throwing rocks at a glass bottle. 103 Figure 4.4 Suicide example. 108 Figure 5.1 Example of a probabilistic structure that might be observed. 114 Figure 7.1 Comparison of results from various algorithms on synthetic MEA data. 200 Figure 7.2 Neural spike train example. 201 Figure 7.3 Close-up of the tail area of Figure 7.2. 201 Figure 7.4 Histogram of z-values computed from the set of "avg values for two tests, using our algorithm. 202 Figure 7.5 Test results for our inference algorithm on various sized subsets of the actual market data. 203 Figure 7.6 Relationships found in one year of actual market data. 203 xii LIST OF FIGURES Figure 7.7 Graph representing results from DBN algorithm on MEA data. 204 Figure 7.8 Neuronal pattern 1. 205 Figure 7.9 Neuronal pattern 2. 205 Figure 7.10 Neuronal pattern 3. 206 Figure 7.11 Neuronal pattern 4. 207 Figure 7.12 Neuronal pattern 5. 207 Figure A.1 Illustrations of CTL formulas. 223 Figure E.1 Token causality worked through example. 253 xiii LISTOFTABLES Table 1 Comparison of results for four algorithms on syn- thetic MEA data, with ours being AITIA. 188 Table 2 Summary of synthetic financial time series datasets. 192 Table 3 Comparison of results for two algorithms on syn- thetic financial data. 196 xiv LISTOFAPPENDICES a a brief review of temporal logic & model checking 219 b a little bit of statistics 232 c proofs 238 d algorithms 249 e examples 252 xv 1 INTRODUCTION If a man will begin with certainties he shall end in doubts, but if he will be content to begin with doubts he shall end in certainties. — Francis Bacon The study of “why” is integral to every facet of science, research and even daily life. When we search for factors that are related to lung cancer or assess fault for a car accident, we seek to predict and explain phenomena or find out who or what is responsible for something.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages305 Page
-
File Size-