A Computational Approach to Identification and Comparison of Cell Subsets in Flow Cytometry Data
Total Page:16
File Type:pdf, Size:1020Kb
A COMPUTATIONAL APPROACH TO IDENTIFICATION AND COMPARISON OF CELL SUBSETS IN FLOW CYTOMETRY DATA A DISSERTATION SUBMITTED TO THE PROGRAM IN BIOMEDICAL INFORMATICS AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Noah Zimmerman August 2011 © 2011 by Noah Zimmerman. All Rights Reserved. Re-distributed by Stanford University under license with the author. This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/ This dissertation is online at: http://purl.stanford.edu/hg137hq6178 ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Amarendra Das, Primary Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Guenther Walther, Co-Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Leonore Herzenberg Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost Graduate Education This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives. iii Abstract Changes in frequency and/or biomarker expression in small subsets of peripheral blood cells provide key diagnostics for disease presence, status and prognosis. At present, flow cytometry instruments that measure the joint expression of up to 20 markers in/on large numbers of individual cells are used to measure surface and inter- nal marker expression. This technology is routinely used to determine the frequencies of various marker-defined cell subsets in patient samples and is often used to inform therapeutic decision-making. Nevertheless, quantitative methods for comparing data between samples are sorely lacking. There are no reliable computational methods for determining the magnitude of differences among samples from different patients, among samples obtained from the same patient on different days, or between aliquots of the same sample measured before and after response to stimulation or other treat- ment. This thesis describes novel computational methods that provide reliable indices of change in subset representation and/or marker expression by individual subsets of cells. The methods we have developed utilize a non-parametric clustering algorithm, Density-Based Merging (DBM), that we developed to identify subsets (clusters) of cells that express a common set of markers measured independently for each cell by flow cytometry. To quantitate differences between these subsets, we introduce the ap- plication of Earth Movers Distance (EMD), an algorithm used to compare multivari- ate distributions borrowed from the image retrieval literature. The resultant methods are highly sensitive and reliable for identifying small marker expression differences be- tween subset of cells in flow cytometry data sets. We show that these methods are easily applied and readily interpreted. Importantly, we demonstrate their practical utility with data from an allergy study in which the expression of two markers on iv very rare blood cells (basophils) in response to stimulation with an offending allergen indicates whether the patient is allergic to the stimulating antigen. In addition, we have developed novel evaluation criteria for assessing the performance of clustering algorithms on flow cytometry data by combining mixtures of cells identifiable by di- mensions \hidden" from the algorithm that provide true cluster membership. Thus, we expect that the methods described here will introduce a new approach to using flow cytometry to measure biomarker changes as indices of drug response, disease susceptibility, disease progress and prognosis. v Acknowledgments They say it takes a village to raise a child. Apparently the same is true for raising an interdisciplinary graduate student. First and foremost, I offer my wholehearted thanks to my committee: Amar Das, Guenther Walther and Lee Herzenberg. Without your support, mentorship and patience this work would not have been possible. I am extremely grateful for the time and effort that you have invested in this project and feel fortunate to have been advised by such intellectual giants. To the amazing staff of the Biomedical Informatics (BMI) program at Stanford, the foundation of a strong community that is nurturing and supportive of its students. Mary Jeanne Oliva and Christine Hilliard for keeping me on track with program mile- stones, Larry Fagan and Betty Cheng for feedback on talks and slides, Mark Musen for serving on my Qualifying exam committee, Garry Nolan for serving on my Oral Defense committee and Carol Maxwell for making the Medical School Office Building a bright and friendly environment. This year we lost one of the cornerstones of the BMI program, Darlene Vian. Darlene's commitment to the students was unparalleled and we miss her dearly. BMI will not be the same without Darlene { I feel fortunate for the time we spent together. To my BMI friends, who have been a great resource both academically and per- sonally. My sage predecessors Kaustubh Supekar, Nikesh Kotecha, Yael Garten and Will Bridewell, my partners in crime Alex Morgan, David Chen, Sarah Aerni, Marina Sirota and Guanglei Xiong, and the next batch of BMI virtuosos Robert Bruggner, Nicholas Tatonetti, Tiffany Chen, Konrad Karczewski, Saeed Hassanpour and Rachel Finck. Thanks to my non-BMI friends, Mark Shervey, Savitri Glowe and Daniel Horowitz, for listening to me talk about research too! vi If you can give a seminar in the Herzenberg Lab meeting, you can give a seminar anywhere. It is a magnet for open-minded, eccentric and passionate individuals. Thank you to everyone that has made it such a special place: Len and Lee Herzenberg, Yang Yang, Yael Gernez, Kondala Atkuri, Rabin Tirouvanziam, David Parks, Wayne Moore, Stephen Meehan, John Mantovani and Claudia Weber, my west coast mother. An extra special thanks to my close friend and professional colleague Eliver Ghosn, who taught me about Macrophages and life with equal patience. Some people are fortunate to have loving and supportive partners that make graduate school possible emotionally and financially. I am lucky enough to have one that does those things AND provides scientific feedback and critique on my slides! Thank you for everything, Veronica. Thank you to my sister Joanna, for your love and support and for paving the way for Ph.Ds in the family. My brothers from another mother (and father) Jon and Dan, and my extended siblings Stephanie, Andrea and Rachel. And of course my niece Tova who helps keep life in perspective. Last, but certainly not least, I want to thank my Mom and Dad, who have always been sources of unconditional love and support. vii Contents Abstract iv Acknowledgments vi 1 Introduction 1 1.1 Complexity and the immune system . 1 1.2 Unmet computational needs . 5 2 Automated Gating of Cell Populations 9 2.1 Abstract . 9 2.2 Introduction . 10 2.3 Background . 10 2.3.1 Manual gating . 10 2.3.2 Automated Gating . 11 2.3.3 Discussion . 15 2.4 Density-Based Merging . 16 2.4.1 Representing data on a grid . 16 2.4.2 Density estimation . 17 2.4.3 Constructing \uphill" association pointers . 19 2.4.4 Assignment to cluster or background . 21 2.4.5 Merging clusters . 22 2.5 Materials . 25 2.5.1 Synthetic data . 25 2.5.2 Neonatal mouse spleen and peritoneal cavity . 26 viii 2.5.3 Nut allergy blood samples . 26 2.6 Results . 27 2.6.1 Evaluation using manual gating . 27 2.6.2 Validating differences between gating and clustering . 34 2.6.3 Comparison with mixture model on synthetic data . 34 2.7 Discussion . 40 3 Quantifying Changes in Cell Subsets 42 3.1 Abstract . 42 3.2 Introduction . 43 3.3 Background . 44 3.3.1 Non-parametric test statistics . 44 3.3.2 Distance metrics . 48 3.3.3 Discussion . 50 3.4 Earth Mover's Distance . 52 3.4.1 Overview . 52 3.4.2 Algorithm . 52 3.5 Materials . 54 3.5.1 Nut allergy human blood samples . 54 3.5.2 Spleens from 3 mouse strains . 56 3.6 Results . 56 3.6.1 Application to spleens from 3 mouse strains . 56 3.6.2 Application to human blood samples for nut allergy . 57 3.6.3 Performance evaluation . 59 3.7 Discussion . 66 4 Evaluating Automated Gating Algorithms 69 4.1 Introduction . 69 4.2 Background . 70 4.2.1 Evaluation using synthetic data . 71 4.2.2 Evaluation using manual gating . 71 4.3 Cluster Evaluation using Hidden Labels (CEHL) . 72 ix 4.3.1 Experiment design . 73 4.3.2 Materials . 75 4.3.3 Evaluation criteria . 77 4.4 Results . 79 4.4.1 Classification accuracy . 79 4.4.2 Sensitivity . 80 4.5 Discussion . 82 5 Conclusion and Future Directions 85 5.1 Conclusion . 85 5.2 Future Directions . 87 A Generating synthetic FC-like data 89 Bibliography 92 x List of Tables 2.1 Results of Density Based Merging on synthetic data - combined F- measure: 0.96 . 36 2.2 Results of merging t-mixture components on synthetic data - t-mixture model clustering performed with k=5 components using the flowClust package with default parameter settings, then merged using the flowMerge package with outlier level=0.90. The merging results in 4 clusters, with a combined F-measure: 0.95 . 39 4.1 8-parameter assay used to identify 6 cell subsets used for cluster eval- uation .