A COMPUTATIONAL APPROACH TO IDENTIFICATION AND COMPARISON OF CELL SUBSETS IN FLOW CYTOMETRY DATA

A DISSERTATION SUBMITTED TO THE PROGRAM IN BIOMEDICAL INFORMATICS AND THE COMMITTEE ON GRADUATE STUDIES OF IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

Noah Zimmerman August 2011

© 2011 by Noah Zimmerman. All Rights Reserved. Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 License. http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/hg137hq6178

ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Amarendra Das, Primary Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Guenther Walther, Co-Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Leonore Herzenberg

Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives.

iii Abstract

Changes in frequency and/or biomarker expression in small subsets of peripheral blood cells provide key diagnostics for disease presence, status and prognosis. At present, flow cytometry instruments that measure the joint expression of up to 20 markers in/on large numbers of individual cells are used to measure surface and inter- nal marker expression. This technology is routinely used to determine the frequencies of various marker-defined cell subsets in patient samples and is often used to inform therapeutic decision-making. Nevertheless, quantitative methods for comparing data between samples are sorely lacking. There are no reliable computational methods for determining the magnitude of differences among samples from different patients, among samples obtained from the same patient on different days, or between aliquots of the same sample measured before and after response to stimulation or other treat- ment. This thesis describes novel computational methods that provide reliable indices of change in subset representation and/or marker expression by individual subsets of cells. The methods we have developed utilize a non-parametric clustering algorithm, Density-Based Merging (DBM), that we developed to identify subsets (clusters) of cells that express a common set of markers measured independently for each cell by flow cytometry. To quantitate differences between these subsets, we introduce the ap- plication of Earth Movers Distance (EMD), an algorithm used to compare multivari- ate distributions borrowed from the image retrieval literature. The resultant methods are highly sensitive and reliable for identifying small marker expression differences be- tween subset of cells in flow cytometry data sets. We show that these methods are easily applied and readily interpreted. Importantly, we demonstrate their practical utility with data from an allergy study in which the expression of two markers on

iv very rare blood cells (basophils) in response to stimulation with an offending allergen indicates whether the patient is allergic to the stimulating antigen. In addition, we have developed novel evaluation criteria for assessing the performance of clustering algorithms on flow cytometry data by combining mixtures of cells identifiable by di- mensions “hidden” from the algorithm that provide true cluster membership. Thus, we expect that the methods described here will introduce a new approach to using flow cytometry to measure biomarker changes as indices of drug response, disease susceptibility, disease progress and prognosis.

v Acknowledgments

They say it takes a village to raise a child. Apparently the same is true for raising an interdisciplinary graduate student. First and foremost, I offer my wholehearted thanks to my committee: Amar Das, Guenther Walther and Lee Herzenberg. Without your support, mentorship and patience this work would not have been possible. I am extremely grateful for the time and effort that you have invested in this project and feel fortunate to have been advised by such intellectual giants. To the amazing staff of the Biomedical Informatics (BMI) program at Stanford, the foundation of a strong community that is nurturing and supportive of its students. Mary Jeanne Oliva and Christine Hilliard for keeping me on track with program mile- stones, Larry Fagan and Betty Cheng for feedback on talks and slides, Mark Musen for serving on my Qualifying exam committee, Garry Nolan for serving on my Oral Defense committee and Carol Maxwell for making the Medical School Office Building a bright and friendly environment. This year we lost one of the cornerstones of the BMI program, Darlene Vian. Darlene’s commitment to the students was unparalleled and we miss her dearly. BMI will not be the same without Darlene – I feel fortunate for the time we spent together. To my BMI friends, who have been a great resource both academically and per- sonally. My sage predecessors Kaustubh Supekar, Nikesh Kotecha, Yael Garten and Will Bridewell, my partners in crime Alex Morgan, David Chen, Sarah Aerni, Marina Sirota and Guanglei Xiong, and the next batch of BMI virtuosos Robert Bruggner, Nicholas Tatonetti, Tiffany Chen, Konrad Karczewski, Saeed Hassanpour and Rachel Finck. Thanks to my non-BMI friends, Mark Shervey, Savitri Glowe and Daniel Horowitz, for listening to me talk about research too!

vi If you can give a seminar in the Herzenberg Lab meeting, you can give a seminar anywhere. It is a magnet for open-minded, eccentric and passionate individuals. Thank you to everyone that has made it such a special place: Len and Lee Herzenberg, Yang Yang, Yael Gernez, Kondala Atkuri, Rabin Tirouvanziam, David Parks, Wayne Moore, Stephen Meehan, John Mantovani and Claudia Weber, my west coast mother. An extra special thanks to my close friend and professional colleague Eliver Ghosn, who taught me about Macrophages and life with equal patience. Some people are fortunate to have loving and supportive partners that make graduate school possible emotionally and financially. I am lucky enough to have one that does those things AND provides scientific feedback and critique on my slides! Thank you for everything, Veronica. Thank you to my sister Joanna, for your love and support and for paving the way for Ph.Ds in the family. My brothers from another mother (and father) Jon and Dan, and my extended siblings Stephanie, Andrea and Rachel. And of course my niece Tova who helps keep life in perspective. Last, but certainly not least, I want to thank my Mom and Dad, who have always been sources of unconditional love and support.

vii Contents

Abstract iv

Acknowledgments vi

1 Introduction 1 1.1 Complexity and the immune system ...... 1 1.2 Unmet computational needs ...... 5

2 Automated Gating of Cell Populations 9 2.1 Abstract ...... 9 2.2 Introduction ...... 10 2.3 Background ...... 10 2.3.1 Manual gating ...... 10 2.3.2 Automated Gating ...... 11 2.3.3 Discussion ...... 15 2.4 Density-Based Merging ...... 16 2.4.1 Representing data on a grid ...... 16 2.4.2 Density estimation ...... 17 2.4.3 Constructing “uphill” association pointers ...... 19 2.4.4 Assignment to cluster or background ...... 21 2.4.5 Merging clusters ...... 22 2.5 Materials ...... 25 2.5.1 Synthetic data ...... 25 2.5.2 Neonatal mouse spleen and peritoneal cavity ...... 26

viii 2.5.3 Nut allergy blood samples ...... 26 2.6 Results ...... 27 2.6.1 Evaluation using manual gating ...... 27 2.6.2 Validating differences between gating and clustering ...... 34 2.6.3 Comparison with mixture model on synthetic data ...... 34 2.7 Discussion ...... 40

3 Quantifying Changes in Cell Subsets 42 3.1 Abstract ...... 42 3.2 Introduction ...... 43 3.3 Background ...... 44 3.3.1 Non-parametric test statistics ...... 44 3.3.2 Distance metrics ...... 48 3.3.3 Discussion ...... 50 3.4 Earth Mover’s Distance ...... 52 3.4.1 Overview ...... 52 3.4.2 Algorithm ...... 52 3.5 Materials ...... 54 3.5.1 Nut allergy human blood samples ...... 54 3.5.2 Spleens from 3 mouse strains ...... 56 3.6 Results ...... 56 3.6.1 Application to spleens from 3 mouse strains ...... 56 3.6.2 Application to human blood samples for nut allergy ...... 57 3.6.3 Performance evaluation ...... 59 3.7 Discussion ...... 66

4 Evaluating Automated Gating Algorithms 69 4.1 Introduction ...... 69 4.2 Background ...... 70 4.2.1 Evaluation using synthetic data ...... 71 4.2.2 Evaluation using manual gating ...... 71 4.3 Cluster Evaluation using Hidden Labels (CEHL) ...... 72

ix 4.3.1 Experiment design ...... 73 4.3.2 Materials ...... 75 4.3.3 Evaluation criteria ...... 77 4.4 Results ...... 79 4.4.1 Classification accuracy ...... 79 4.4.2 Sensitivity ...... 80 4.5 Discussion ...... 82

5 Conclusion and Future Directions 85 5.1 Conclusion ...... 85 5.2 Future Directions ...... 87

A Generating synthetic FC-like data 89

Bibliography 92

x List of Tables

2.1 Results of Density Based Merging on synthetic data - combined F- measure: 0.96 ...... 36 2.2 Results of merging t-mixture components on synthetic data - t-mixture model clustering performed with k=5 components using the flowClust package with default parameter settings, then merged using the flowMerge package with outlier level=0.90. The merging results in 4 clusters, with a combined F-measure: 0.95 ...... 39

4.1 8-parameter assay used to identify 6 cell subsets used for cluster eval- uation ...... 76 4.2 6 subsets and marker phenotype identified by assay in Table 4.1 . . . 76

xi List of Figures

2.1 Gating strategy used to identify immune cells present in naive mouse peritoneal cavity based on surface molecules [24] ...... 12 2.2 A cartoon illustrating the steps of the Density Based Mergning algo- rithm ...... 18 2.3 2-dimensional synthetic ‘Flow Cytometry-like’ data containing 4 clus- ters and background (n=10000) ...... 25 2.4 Synthetic data cluster statistics, purple is background noise . . . . . 25 2.5 Comparison of manual and DBM gating in the scatter dimensions sin- glet gates are shown as determined by the researcher (top) and DBM (bottom, colored plot frames) for neonatal mouse spleen cells. The subset is further gated using the researchers live/dead gate and dis- played in context of the next gating decision by the researcher. Note that the results of the DBM clustering are displayed with same soft- ware that was used for the manual gating (FlowJo). In the bottom left plot, color is used to code the clusters found by DBM...... 28 2.6 Differences in manual versus DBM gating in scatter dimensions – cells included by both gates (top), cells included in the manual gate and excluded by the DBM gate (middle), and cells included in the DBM gate and excluded by the manual gate (bottom) are displayed (column 1). Cells are live/dead gated as described in the text, and shown in the context of the next manual gating decision (column 2)...... 30

xii 2.7 Comparison of manual and DBM gating for 3-step gating sequencead- ult mouse spleen cells are analyzed using the researchers manual gates (top plots) and the corresponding clusters identified by DBM (bottom plots with colored plot frames). Color is used to code the clusters found by DBM in the first three plots on bottom. Each of the manual/DBM gate pairs has ¡ 4% difference in total number of cells. In this study, the researcher is interested in B cells (column 4)...... 31 2.8 Oneway analysis of variance (ANOVA) on the number of clusters (top) and the number of events in the clusters (bottom) for each of three aliquots from each sample. No panel exhibits a significant difference among the within group (allergic,non-allergic) stimulations. Between groups, there is a difference in the number of basophils (p=0.0095) but no difference in the number of clusters (p=0.2)...... 33 2.9 CD123+++ population does not exhibit significant basophil activation compared to CD123+ ...... 35 2.10 Synthetic FC-like data in 2-dimensions showing true class labels and DBM cluster results...... 36 2.11 Top row: t-mixture modeling using k=4 and k=5 components on syn- thetic data. Bottom row: t-mixture models from top row used as input to mixture model component merging algorithm, yielding k=3 and k=4 components...... 38

3.1 Adaptive binning algorithm for summarizing non-parametric distribu- tions. Illustrating n=1-8 recursions (2-256 bins). Note that in regions of high density there are more bins than in the sparsely populated regions...... 47 3.2 As the smaller green population moves further from the center of the main (black) population, probability binning plateaus, while EMD con- tinues to increase monotonically ...... 51

xiii 3.3 2-dimensional DBM clustering on the patient samples. The basophil cluster was selected using a simple heuristic: the cluster with the high- est median expression in the CD123 channel, and the lowest median expression in the dump channel. Arrows indicate the selected basophil cluster...... 55 3.4 Mouse spleen cells from two samples of the same BALB/c (a,b), c57 and RAG mice. Values compute the EMD between the first BALB/c sample (control) and each of the other 3 samples ...... 57 3.5 Expression of CD203c and CD63 on the surface of basophils for a single patient showing probability contours (top) and the resulting binning (bottom) ...... 58 3.6 (a) The EMD fold change ratio of allergic patients versus non-allergic patients. Each point represents a single patient in the study. Allergic patients display a 6-fold increase in the EMD fold change ratio for activation markers only in basophils, compared to a 1-fold increase in non-allergic patients. (b) Receiver Operating Characteristic (ROC) curve over a range of EMD fold change values (0-6). This measures the frequency of true positives (patients with diagnosed allergies that we predict have allergies) and false positives (patients with NO diagnosed allergies, that we predict have allergies) given different values of EMD fold change...... 59 3.7 Fold change calculated as the ratio of EMD between unstimulated sam- ple and sample stimulated with offending allergen, normalized by the EMD between unstimulated sample and sample stimulated with non- offending allergen. Samples are grouped by allergic and non-allergic individuals, and metrics computed for: Earth Mover’s Distance, Prob- ability Binning metric, and Mahalanobis distance...... 60 3.8 (A) single human basophil sample binned according to adaptive bin- ning (described in Section 3.6.3) (B) bin summaries using bin centroids and bin center of mass ...... 63 3.9 Number of bins does not effect the ability to distinguish groups . . . 64

xiv 3.10 Effect of number of bins on the running time. x − axis is on a log scale. 65 3.11 Effect of sample size on EMD metric for stimulated basophils . . . . 66 3.12 Increasing sample size increases statistical significance, or why NOT to use p-values in place of distance metrics ...... 67

4.1 Combining samples from a wild-type CB.17 mouse labeled with GFP and a RAG knockout mouse. 6 of the 8 subsets of cells identified with the 7 marker assay are present in both strains. The 2 remaining subsets are only found in the wild-type CB.17. The mixture can be clustered based on the 7 markers and later de-convolved based on the presence of GFP, which is only on the wild-type strain...... 74 4.2 B-cell gate ...... 78 4.3 CB.17 events randomly sampled in varying concentrations and mixed with RAG -/- base. 25,000 total events in each sample. CB.17 con- centrations range from 0.1%-10% of the total number of events. Nine concentrations plotted, showing Ig κ/λ vs. CD19. The blue rect- angle highlights the region where the B-cells appear with increasing frequency as the concentration of CB.17 increase...... 79 4.4 Non-lymphocyte clusters contain a mixture of GFP+ and GFP− cells 80 4.5 DBM defined B-cell cluster with a range of settings for the outlier parameter, which determines the percentage of events that get classi- fied as background. (a) Fewer outliers increase the cluster boundary, more outliers lead to conservative boundaries. (b) More conservative boundaries decrease recall ...... 81 4.6 Precision and recall for DBM detection of B-lymphocyte cluster applied to 10 concentrations of CB.17 events in a RAG -/- background. . . . 82

A.1 Synthetic FC-like data in 4-dimensions showing true class labels, gen- erated using code in Appendix A ...... 91

xv Chapter 1

Introduction

“Yet it was with those who had recovered from the disease that the sick and the dying found most compassion. These knew what it was from experience, and had now no fear for themselves; for the same man was never attacked twice – never at least fatally.”

Thucydides, The History of the Peloponnesian War, Chapter VII

1.1 Complexity and the immune system

In his description of the plague of Athens in the second year of the Peloponnesian War, Thucydides offers one of the earliest written accounts (430 BC) of the phenomenon of immunity, observing that survivors of the plague could act as caretakers for the sick because they would not contract the disease a second time. Two thousand years later, we would discover that Thucydides was reporting on immunological memory, one of the hallmark properties of the adaptive immune response, in which secondary exposure to a previously seen antigen results in a stronger, more rapid and more effective response to the pathogen [40]. Long-lived memory response is one of many complex mechanisms of cell-mediated immunity made possible by specialized subsets of cells capable of pathogen recogni- tion and response. Additional mechanisms include multi-cellular signaling networks capable of responding to environmental stimuli [26], pathogen recognition of billions

1 CHAPTER 1. INTRODUCTION 2

of unique molecular structures [36], and self-nonself recognition [38], a safeguard against autoimmunity to restrict effector response to nonself molecules. This func- tional diversity is made possible by a layered immune system, comprised of hundreds of specialized subsets of cells capable of sensing and reacting to their environment. Characterizing these subsets and their role in the immune response is one of the central goals of modern immunology. Cell subsets are characterized by intrinsic properties of the cell such as size, vol- ume, shape and internal structure as well as extrinsic properties such as DNA content, enzyme activity, and the amount and type of various membrane and intracellular molecules [79]. For the purposes of this dissertation, a cell subset is defined as a ho- mogeneous group of cells that share some or all of these measurable properties. This definition of a subset is intentionally recursive to allow for new subsets to be defined from existing subsets as more measurable properties become available. Understanding the role of these subsets provides insight into the functioning of the immune system and the underlying mechanism of disease in a disrupted system. For instance, Tirouvanziam et al [84] recently demonstrated that profound functional and signaling changes occur in neutrophils in the airways, but not in the blood, of patients with Cystic Fibrosis (CF). Neutrophils are an abundant subset of white blood cells critical to CF pathogenesis. The most common cause of mortality among CF patients is airway disease, owing to the massive recruitment of dysfunctional blood neutrophils to the lungs. Neutrophils in the CF airway differ from their blood counterparts based on increased expression of certain cell surface proteins (CD11b, CD66b, CD63, CD80, CD294, and MHC type II) and decreased expression of other markers (CD14, CD16). This small neutrophil population demonstrates extensive functional and signaling changes that contribute to CF pathogenesis, including decreased lung function and inability to clear bacteria from the lung. The effect of this small dysfunctional subset of neutrophils is a major contributor to CF disease progression and is currently under investigation as a therapeutic target. A second example illustrating the importance of the impact of small subsets of cells on clinical outcomes comes from Gernez et al [23], who demonstrated that a molecule appearing on the surface of activated basophils (CD63) can be used as CHAPTER 1. INTRODUCTION 3

an ex vivo diagnostic for nut allergy. Food allergies are represented by a range of symptoms, from acute and potentially fatal reactions, to chronic disease affecting the skin and gastrointestinal tract [13]. While our mechanistic understanding of the basis of food allergy has progressed over the past two decades, techniques for diagnosing and treating the disease have remained largely unchanged [81]. This finding offers a new approach to food allergy diagnosis based on the activity of a rare, well characterized subset of immune cells. In Chapter 3, we discuss this example in more detail and demonstrate how additional markers can improve the sensitivity of the test. These two examples are meant to illustrate the very basic notion that subsets of cells play a critical role in immune function and clinical outcomes. In each case it would be impossible to characterize the role of these subsets using bulk cell count measurements, which include irrelevant cell populations and reduce the ability to detect cells of interest [14]. Instead, it is cells with very specific phenotype and function that direct the response. These examples highlight the need for advanced technologies and accompanying analytic methods, capable of jointly measuring large numbers of cellular markers to characterize the role of functionally distinct cell subsets that comprise the immune system. The technology that enables us to rapidly measure properties of large numbers of cells is the Fluorescence-Activated Cell Sorts (FACS). Developed in the late 1960s, the first (FACS) instrument was built in 1971 by Dick Sweet, Mack Fulwyler, Marvin Van Dilla and Leonard Herzenberg at Stanford University. It presented an alterna- tive to fluorescence microscopy that provided a high-speed, accurate method for cell separation [30]. FACS takes advantage of differences in the physical properties of the cells, as well as properties of the surface membranes and molecules inside the cell to separate closely related yet functionally distinct cell types [31]. The following is a brief description of the anatomy of a FACS experiment (which I will use interchangeably with Flow Cytometry), the purpose of which is to present the groundwork for the computational methods that represent the bulk of this dis- sertation. For the canonical treatment of Flow Cytometry (FC), including the full anatomy of an FC experiment, consult Practical Flow Cytometry by Howard Shapiro [79]. CHAPTER 1. INTRODUCTION 4

Anatomy of a FACS Experiment

Animal cells are harvested by breaking down an organ to obtain a suspension of viable dissociated cells in medium. For many tissues, physical dissociation can be done by simply crushing the tissue or spinning it in a centrifuge. For other tissues, enzymes or chemicals are required to break down the intercellular connections. The cells are prepared by labeling with fluorochrome-conjugated monoclonal antibodies. The specificity of antibody binding ensures that the target molecules is present in/on the cell. Individual cells suspended in a droplet of sheath fluid pass single-file through the flow cytometer where they are struck by a light source. The light source excites the antibody-coupled fluorescent molecules bound to the cell, causing those molecules (and any auto-fluorescent matter in the cell) to emit light that is measured by a photomultiplier tube (PMT). For each cell in the sample, a measurement of the light scattering and fluorescence properties is collected. As the droplet passes through the light source, the optical signals are compared with predefined criteria and the droplet containing the observed cell is deflected into the appropriate reservoir. In the case of cell sorting, the droplets of fluid containing cells are electrostatically charged and deflected to different collection reservoirs based on the fluorescent and optical patterns measured by the instrument. The pattern of measurements for a single cell is used to define its subset. Iden- tifying finer subsets of cells requires multiple markers to differentiate cell type and function [60]. In a modern FACS experiment, we may utilize 8 parameters to identify cell lineage, and another 6 to assess the functional capabilities (cytokine, phospho- rylation, apoptosis). Presently, up to 18 fluorescent parameters and 2 light scatter parameters can be measured simultaneously on a single cell [15].

Characteristics of FACS data

• Many observations (104 - 106) • Comprised of many discrete cell subsets • Subsets have irregular distributions CHAPTER 1. INTRODUCTION 5

• Rare subsets (<1 %) are of interest • Contain outliers (cell debris, dead cells, doublets) and noise

1.2 Unmet computational needs

In keeping with the increasing centrality of flow cytometry (FC) in basic research and clinical medicine, the number and data acquisition power of flow cytometry instru- ments has expanded greatly in the last decade. On the horizon, a next generation mass spectrometry flow cytometer aims to measure up to 100 parameters simultane- ously on a single cell [8]. But while flow instrumentation has improved markedly to meet these needs, the development of automated tools for processing and analyzing flow data has lagged sorely behind, thus impacting data quality and restricting the use of advanced technology to laboratories that have the time and expertise to prop- erly analyze the data. FC data processing and analysis still requires substantial time and expertise to operate and still requires users to perform repetitive tasks and make arbitrary judgments about data quality. Existing methods provide only minimal help for extracting important features from the data (e.g., locating subsets) and for com- paring results from several analyses. Thus, even experienced users are commonly unable to exploit the full capabilities of the highly advanced flow instruments that are available at most modern research and medical centers. The recent advances in FC instrumentation have far surpassed the ability to man- ually investigate and interpret the resulting high-dimensional data. Methods for reliably gating FC data, i.e. the sequential identification of cell populations that are homogenous with respect to marker expression, are a central step in FACS data analysis. Gating has traditionally been performed manually (using a mouse on the computer screen), with a researcher sequentially displaying FACS data from pairs of parameters for which data was taken in a multi-parameter assay and drawing a gate around a population of interest. The operator selects the cells in the gate (or out of it) and initiates another round by displaying two additional measurements collected for the gated cells (a more detailed description of manual gating is presented in Sec- tion 2.3.1). This 2-dimensional gating procedure has to be repeated many times to CHAPTER 1. INTRODUCTION 6

arrive at a gating sequence for multivariate FACS data. As a consequence, manual gating is very time consuming and generates subjective results due to varying user choices about where the boundaries of gates should be placed. Differences in manual gating are one of the largest contributors to variability in flow cytometry experiments [52]. More importantly, sequential manual gating is typically unable to explore the information a modern high-dimensional (Hi-D) FC experiment provides. For example, if two dimensions are chosen manually for a 2-dimensional analysis to initiate analysis of 13-dimensional FC data, then there are 78 possible combinations for this first analysis step alone. Furthermore, every subsequent analysis step offers a similar number of possible 2-dimensional views, so that the total number of possibilities (the product of the possibilities in each step) will quickly reach thousands even for an analysis that employs only a few steps. Even with lower dimensional data, the number of possibilities is overwhelming. Clearly, no manually guided analysis will reasonably explore more than a tiny fraction of the available possibilities. Therefore, investigators typically follow well- beaten paths and potentially leave a wealth of information largely untapped. This is arguably the most serious shortcoming in the analysis of FACS data since there is a growing recognition of the immense utility and potential of Hi-D FACS data in the research and clinical setting. For example, DeRosa et. al. make a strong empirical case for the use of Hi-D FACS data in studies identifying or quantitating aspects of T cell subsets. Using Hi-D FACS, these investigators were able to precisely define and characterize naive CD4 and CD8 T cells by phenotype, cytokine production and T-cell receptor diversity. They credit their achievement to the characteristics of the Hi-D FACS data, pointing out that such a high-dimensional analysis can actually simplify understanding of the objects under investigation. However, they also point out the need to develop “new software tools capable of analyzing, managing and reducing the large and complex data sets” [17]. This thesis presents work on the development of novel computational methods to address the challenges of identifying homogeneous cell subsets in a mixture, quan- tifying changes in the joint expression of two or more markers on cell subsets and CHAPTER 1. INTRODUCTION 7

evaluating automated methods of subset identification. The outline of this disser- tation is as follows: we begin Chapter 2 with an introduction to the problem of subset identification in flow cytometry data and a review of recent work in the area. Special attention is given to the strengths and weaknesses of a class of paramet- ric mixture-model proposals owing to their current popularity in the literature. We then describe our work on a non-parametric density-based based clustering algorithm, Density-Based Merging (DBM), that is computationally efficient and has properties that make it well-suited to flow cytometry data [86]. We apply the DBM algorithm and a mixture model algorithm to a synthetic dataset with flow cytometry-like prop- erties and compare the results. We then apply DBM to a mouse spleen dataset examining B-cells and a human blood sample dataset investigating basophils in al- lergic response, and show that DBM is able to recreate expert gating decisions and provide clinically relevant characterizations of cell subsets. In Chapter 3, we explore ways to measure changes in marker expression in/on subsets of cells. Whereas previous work relied on tests of statistical significance to quantify change, we propose the use of distance metrics to measure change. Instead of asking the question, “Are these two things different” we ask the related question “How different are these two things?” For this purpose we introduce the use of the Earth Mover’s Distance (EMD) and associated methodology for adapting the metric to FC data. EMD is a mathematical measure of the distance between two samples borrowed from the computer vision literature [71]. We apply EMD to mouse spleen data from 3 genetically modified strains of mice and quantify difference between cell subsets in those strains consistent with the underlying genetics. In addition, we apply EMD to measure the activation of a rare immune cell subset and show that it reliably predicts nut allergy in a 22 patient study. Finally, we compare the performance of EMD to other proposed metrics on the human nut allergy data and show that EMD is superior at differentiating allergic from non-allergic response. In Chapter 4 we re-visit the evaluation of automated population identification CHAPTER 1. INTRODUCTION 8

algorithms presented in Chapter 2 and discuss the shortcomings of existing ap- proaches. We propose a new approach to evaluating automated subset identifica- tion algorithms, leveraging improvements in instrument technology to generate multi- parameter datasets in which the true cluster labels are contained in the dataset. Using the newly designed dataset, we evaluate the classification accuracy and sensitivity of DBM to detect rare populations. The design of this experiment represents the first biologically realistic flow cytometry dataset created expressly for gating evaluation, enabling objective benchmarking of new algorithms. In summary, we have developed novel computational methodologies to identify cell subsets in a mixture and measure differences between those subsets under different experimental conditions. In addition, we designed a new approach to evaluating our own and other algorithms for cell subset identification. It is our hope that this research will lay the methodological foundation for high-throughput immunology research by enabling accurate, rapid and reproducible characterization of cell subsets measured by flow cytometry. Chapter 2

Automated Gating of Cell Populations

2.1 Abstract

A current limit to the potential of flow cytometry is the lack of automated tools for analyzing the resulting data. We describe methodology and software to automatically identify cell populations in flow cytometry data. Our approach advances the paradigm of manually gating sequential two-dimensional projections of the data to a procedure that automatically produces gates based on statistical theory. Our approach is non- parametric and can reproduce non-convex subsets that are known to occur in flow cytometry samples, but which cannot be produced with current parametric model- based approaches. We illustrate the methodology with a synthetic dataset, samples from mouse spleen and peritoneal cavity cells and human blood samples, and compare with mixture model results.

9 CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 10

2.2 Introduction

“An important prerequisite for studying the components of any biological system, be they the molecules in a cell or the cells in an organ, is the ability to isolate those components from one another so that they can be characterized and recombined under controlled conditions”

Leonard Herzenberg

The disparity between the advanced instrumentation and the limited capabilities of existing analytic tools has lead to numerous efforts to develop methods for auto- mated analysis and gating of FACS data. In Section 2.3, we sketch the landscape of modern approaches to automated gating of FACS data. Section 2.4 describes a novel non-parametric clustering algorithm called Density-Based Merging, developed specif- ically to address shortcomings associated with many clustering algorithms applied to FC data. We evaluate the results of the DBM algorithm against manually gated data in Section 2.6.1 and against a synthetically generated dataset with properties similar to FACS data in Section 2.6.3. Results of the DBM analysis on the synthetic data are compared to results from a state-of-the-art mixture model approach and we conclude with a discussion of the two approaches.

2.3 Background

2.3.1 Manual gating

The primary mechanism for analyzing FACS data is an interactive process referred to as gating. Gating is an exploratory data analysis technique intended to reveal features that are obscured or invisible in the raw data. The goal is to decompose a mixture of cells into its component subsets. Gating has traditionally been performed manually (with a mouse on the computer screen), with a researcher sequentially displaying FACS data from pairs of parameters for which data was taken in a multi- parameter assay and drawing a gate around a population of interest. Gates delineate regions in the 2-dimensional projection of the data containing cells with shared marker CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 11

expression characteristics. The operator selects the cells in the gate (or out of it) and initiates another round by displaying the two additional measurements collected for the gated cells. This 2-dimensional gating procedure has to be repeated many times to arrive at a gating sequence for multivariate FACS data, as shown in a sample gating sequence in Figure 2.1 used by Ghosn et. al. to identify two macrophage subsets in the mouse spleen [24]. The result is a hierarchy of 2-dimensional gates (gating tree) that describe subsets of events in the multivariate distributions that share common characteristics. Difficulties with manual gating

• Time consuming • Requires expertise • Difficult to reproduce across labs • Limited to 2-dimensional projections of the data

Differences in manual gating decisions made by trained researchers are one of the largest contributors to variability in flow cytometry experiments [52]. In one study, pre-stained cells were distributed to 15 laboratories as part of a multisite standardization study. Experienced researchers were asked to acquire the samples and analyze the data. Individual laboratory analyses coefficient of variation ranged from 17-44% depending on the sample type and analysis method [53]. This level of variability in identical samples is a limitation to more widespread adoption of the technology. The amount of time, the variability in results, and the limited exploration of the full information content of an experiment has resulted in a push towards replacing manual gating with gates defined algorithmically by the data, a process called auto- mated gating.

2.3.2 Automated Gating

Limitations in manual gating gained the attention of a computationally inclined flow cytometry researchers as early as 1985, when Robert Murphy applied the K-means clustering algorithm to FC data [57]. Since that time a number of statistical learning CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 12

Figure 2.1: Gating strategy used to identify immune cells present in naive mouse peritoneal cavity based on surface molecules [24] CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 13

algorithms have been used in the analysis of FC data with varying degrees of success and limited generalizability [37][6][11]. These methods can be broadly categorized along the statistical areas of supervised and unsupervised learning. In this Section we highlight a few recent proposals and their application to FC data.

Supervised learning

In supervised learning, a set of predictor variables (features) is used to predict the values of one or more response (dependent) variables. Classical techniques include: (i) nearest-neighbor methods, in which the predictor and response variables are supplied for a set of training data and the response of an unseen predictor variable is determined based on the responses of the predictor variable in the local neighborhood of the training data, and (ii) linear models, in which the parameters of a linear model are fit to the predictor and responses of training data (for instance, using least squares), and the response of a new predictor variable is computed according to the model using the trained parameters [29]. In circumstances where training data are available, supervised learning approaches are very powerful.

Cytometric fingerprinting One such supervised learning approach is cytomet- ric fingerprinting, a recent proposal for representing and analyzing multidimensional FC data via multi-dimensional binning [69]. This “flattened” representation of the data is computationally efficient and amenable to quantitative comparison of sam- ples using conventional classification algorithms. A template binning generated on a control sample is applied to a test sample to identify regions of density variation with respect to the template. These differences in bin density can be used as features for a Support Vector Machine (SVM) or other classifier to differentiate two types of biological samples. This approach was used to classify Acute Myeloid Leukemia samples from healthy controls with high sensitivity (90%) and specificity (99%).

Unsupervised learning

In unsupervised learning, the goal is to directly infer the probability density from a set of observations without the help of a “supervisor” providing correct answers for CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 14

a set of training data [29]. It is the job of the algorithm to discover structure in the data.

K-Means

The first clustering algorithm in the literature to be applied to FC data [57], K-means is fast and easy to implement, and has shown good results in a number of applied fields. In FC, the application has been limited because the number of clusters has to be specified, it is sensitive to initialization parameters, and it is limited to modeling spherical cell populations. A recent proposal by [2] augments the K-means algorithm to address these issues by merging clusters identified by the K-means algorithm to create non-spherical clusters representative of cell populations. The approach is com- putationally efficient, but adds an additional decision point of when and how to merge clusters, and does not directly handle background noise.

Frequency Difference Gating

Frequency difference gating identifies regions in the comparison of two samples in which the distributions are the most different and creates gates based on those re- gions that select the events, or cluster of events, that are different between samples. Frequency difference gating was applied to PBMC to identify differences between CD4 memory subsets. The primary application of this technique is to identify subtle changes in small populations of cells; it is not a general method for cluster analysis [65].

Mixture Models

A gaussian mixture model is a probabilistic model for density estimation, with a nat- ural application to clustering. We assume that the observed data is generated by a mixture of probability distributions. The goal is to estimate the parameters (mean, covariance) of those distributions that maximize the likelihood of the observations. The parameters are usually fit by maximum likelihood using the Expectation Max- imization algorithm [29]. Mixture models have a number of desirable properties for CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 15

use in FC cluster analysis: (i) straightforward interpretation: each component of the model describes a population with multivariate parameters; (ii) incorporation of un- certainty into the prediction; (iii) excel at separating highly overlapping populations. A number of groups have applied mixture models to cluster FC data, using in- creasingly complex distributions (multivariate t-distribution [50], skew t-distribution [63]) to fit the irregular shape of cell subset marker expression measured by FACS. Fitting FACS data using a mixture model provides good detection and discrimination of populations that adhere to the properties of the distribution being fit. However, when the shape of the population does not fit the assumptions underlying the cho- sen distribution, mixture models do not perform well. For instance, in the case of non-convex populations, mixture models tend to split the population into multiple clusters. Because density mixtures are a convex combination of probability distribu- tions, they are unable to model non-convex distributions. In principle, one can do a mixture of non-convex subpopulations. In practice, however, all the usual parametric models (such as normals) will have convex contours. So if one does the usual fitting of parametric mixture models, then it will be a problem if one wants to use parametric non-convex subpopulations. So, while mixture models have a number of desirable properties, they also have serious limitations, namely: (i) Cell subsets measured by flow cytometry have irregularly shaped distributions that are poorly handled by para- metric distributions; (ii) Parameter estimation in high dimensions is computationally difficult.

2.3.3 Discussion

When the goal of an analysis is to detect the presence or absence of a single type of cell, and you need to perform the same analysis repeatedly, there are several super- vised learning approaches that have displayed promising results. However, supervised classifiers are highly specific to a particular application and cannot be generally ap- plied to the problem of exploratory gating, which by its nature, seeks to characterize unknown cell subsets. The more general process of gating is not a well-formulated supervised learning task – it lacks training data and fixed class labels representative CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 16

of all cell populations. Thus, unsupervised approaches represent the most active area of research. No approach to date has adequately handled the characteristics of FACS data described in Section 1.1. In the following Section we describe a density based clustering algorithm with good performance in automated gating of FACS data.

2.4 Density-Based Merging

Here we introduce a novel method for clustering non-parametric multivariate data where the density of clusters is not uniform. Our algorithm, which we have named Density-Based Merging (DBM), is based on the notion that clusters of data can be delineated by the contours of high-density regions [28]; this is the same rationale that underlies manual gating. Some of the important properties of DBM that make it useful for automated gating in FC: Key properties of DBM:

• Can detect clusters with irregular shape (e.g. non-convex) • Number of clusters determined automatically from the data • Does not require training data • Accounts for background/unclassified events

Sections 2.4.1–2.4.5 provide a formal description of the algorithm, which is divided into 4 steps: (1) represent the distribution on a grid (2) estimate the density at each grid point (3) construct association pointers between grid points that follow the gradient ascent (4) merge clusters not separated by a statistically significant trough. The following description outlines the 2-dimensional case for clarity, but the algorithm is readily generalized to >2 dimensions.

2.4.1 Representing data on a grid

As noted in Section 1.1, FC data can be collected for a large number of events. It is typical to collect between 105 and 106 cells depending on the frequency of the cell population of interest for the experiment. With sample sizes in that range efficiency CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 17

is an important analytic consideration. In lieu of operating directly on the 106 data points, we bin the data on a grid (∼ 104 points) which allows faster processing with little loss in accuracy [87].

Given n data points xi = (xi1, xi2), i = 1, . . . , n, we choose a positive integer M, typically M = 128 or 256, and construct a grid consisting of M 2 points as follows: Set (max x − min x ) ∆ = i i,j i i,j (2.1) j (M − 1) for j = 1, 2

th and define the j coordinate of y(m1,m2) to be ymj = mini xi,j + (mj − 1) × ∆j for 2 mj = 1,...,M. The grid is defined as {y(m1, m2): m1, m2 ∈ {1,...,M} . 2 Next, each grid point ym, where m= (m1, m2) ∈ {1,...,M} , is assigned a weight wm by linearly binning the observations xi such that

n 2 X Y |xi,j − ymj| w = max(0, 1 − ) (2.2) m ∆ i=1 j=1 j

2 2 The grid {ym, m ∈ {1,...,M} } and the associated weights {wm, m ∈ {1,...,M} } represent an approximation of the observed data. It has been shown that linear bin- ning leads to substantial accuracy improvements over simple binning [87]. The value one chooses for M is a tradeoff between efficiency and accuracy; a larger choice of M results in a finer grid that more accurately approximates the data at the expense of more computations. In our testing (and in accordance with [87] and Section 3.6.3), we found that a relatively small number of bins (M = 128) yields a very good ap- proximation.

2.4.2 Density estimation

2 2 Having created a grid {ym, m ∈ {1,...,M} } and the associated weights {wm, m ∈ {1,...,M} }, 2 we next estimate the density surface fˆ(y ). Denote by φ(x) = √1 × exp( −x ) the m 2π 2 CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 18

a b

c d

Density Density high high

low low

e f Cluster C Cluster A Cluster Cluster D Cluster B

Density Density Cluster E high high

low low

Background g Cluster C Cluster A Cluster Cluster D Cluster B

Density high

low

Figure 2.2: A cartoon illustrating the steps of the Density Based Mergning algorithm CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 19

Gaussian kernel. Then the estimated density at ym is given by:

Z Z 2 ∆j 1 1 2 φ(lj ) ˆ X X Y hj f(ym) = wm−1 × (2.3) n hj l1=−Z1 l2=−Z2 j=1

− 1 where l=(l1, l2), Zj = min(b4hj/∆jc,M − 1), hj = SD({xi,j, i = 1, . . . , n})n 6 , and SD denotes standard deviation. Equation 2.3 can be computed efficiently with the Fast Fourier Transform (FFT) in a well-known way [87]. For each grid point we have computed the density estimate according to equa- tion 2.3; we now compute the corresponding standard error of the density estimate to test whether the density is significantly different from zero. For each grid point ym 2 in {ym, m ∈ {1,...,M} }

Z Z 2 2 ∆j 1 1 2 φ (lj ) 1 2 X X Y hj ˆ 2 σˆm = wm−1 × 2 − f(ym) (2.4) n(n − 1) hj n − 1 l1=−Z1 l2=−Z2 j=1

2 σˆm is an estimate of the standard error of the estimated density. The grid points where the density estimate is significantly different from zero form the set S = {m ∈ {1,...,M}2}, such that

ˆ p 2 f(ym) > 4.3 × σˆm (2.5)

The factor 4.3 is a correction for multiple testing over the grid, obtained by calcu- lations as in [25]. Thus S is the set of grid points where the density is significantly different from zero. Grid points ∈/ S are marked as background; from each grid point ym, m∈ / S, a pointer is created between ym, m and a dummy state that represents background noise. Like the density estimate (equation 2.3), equation 2.4 can be efficiently computed with a FFT.

2.4.3 Constructing “uphill” association pointers

Next we construct links between grid points that follow the density gradient. The intuition is to create links that point “uphill”, where “uphill” refers to an increase CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 20

in the density function (e.g. gradient ascent [62]). We do this by visiting each grid point (of which there are at most 8 in the 2-d case) and establishing a link to the neighboring grid point with the highest value of the density estimate, provided that the difference in density estimates is statistically significant. The test for statistical significance guards against superfluous links between clusters caused by variability in the density estimate.

Formally, consider all the neighboring grid points p1, . . . , pnm, defined as the set of all grid points contained in the box

2 \  x : ymj − ∆j ≤ xj ≤ ymj + ∆j (2.6) j=1

ˆ ˆ Let p ∈ {p1, . . . , pnm} such that f(p) = maxk=1,...,nm f(pk), splitting ties arbitrarily. Thus p is the neighbor with the highest density estimate. We establish a pointer from ym to p if the following two conditions hold:

ˆ ˆ f(p) > f(ym) (1) ˆ (∂/∂e)f(ym) > λm (2)

ˆ where e = (p − ym)/ kp − ymk, k·k denotes Euclidean norm, and (∂/∂e)f(ym) and λm defined as:

2 ∂ X ∂ fˆ(y ) = e fˆ(y ) (2.7) ∂e m a ∂y m a=1 ma

lj ∆j ∂ 1 Z1 Z2 −l ∆ 2 φ( ) ˆ X X a a Y hj f(ym) = wm−1 × 2 (2.8a) ∂yma n ha hj l1=−Z1 l2=−Z2 j=1

1 q K ˆ 2 λm = q(0.95 ) Σm (2.9a) CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 21

#S × Σ w K = m∈S m (2.10a) Q2 ˆ n2π j=1 hjΣm∈S f(ym)

2  ! ˆ 2 1 X ∂ ˆ ∂ ˆ Σm = eaeb A − f(ym) f(ym) (2.11a) n − 1 ∂ym ∂ym a,b=1 a b

l ∆ Z1 Z2 2 φ( j j ) 1 X X lalb∆a∆b Y hj A = wm−1 × 2 2 2 (2.12a) n hahb hj l1=−Z1 l2=−Z2 j=1

ˆ 2 Here ea denotes the standard Euclidean basis vector and Σm is an estimate of the ∂ 1 ˆ K variance of ∂e f(ym). q(0.95 ) is the critical value of the normal distribution adjusted for multiple testing via K as in [25], where q(x) denotes the 100 × xth percentile of ∂ ˆ ∂ ˆ the standard normal distribution. A is an estimate of f(ym) f(ym). ∂yma ∂ymb Recall that the purpose of equation 2.7 is to check that the derivative at ym in the direction of p is significant, in order to prevent an accidental linking of different clusters. This significance check may result in the inability to establish links at the plateau near the maximum, where the density surface is relatively flat. This is addressed by the merging step described in 2.4.5.

2.4.4 Assignment to cluster or background

For all grid points ym ∈ S: If a link originates at ym, then it will point to a different grid point, which itself may have a pointer originating from it. This succession of pointers if followed until we arrive at a grid point yz that either:

(a) yz does not have any pointer originating from it

(b) yz has a pointer originating from it which points to a dummy state that represents a cluster or background noise

In case (a) all the pointers visited in the path leading to yz will be removed and new pointers originating from each grid point in succession will be established to the CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 22

dummy state that represents the background noise, provided the following condition holds:

1 ˆ K p 2 f(yz) < q(0.95 σˆz ) (2.13)

Otherwise, provided there is a pointer into yz, then a new pointer will be established that originates from yz and points to a newly established dummy state that represents a new cluster. In case (b) no pointers are removed or established.

2.4.5 Merging clusters

In the final step of the algorithm, we determine whether two clusters should be merged because they are connected by a path that possesses no statistically significant trough. This is done by iteratively building a set of grid points which are neighbors to a local maximum of the density surface, are not maxima or background, and do not exhibit a statistically significant change in density when compared to the local maximum. If this set in turn possess a neighboring grid point that is a local maximum, then we found a path (via this set) between two local maxima that does not exhibit a statistically significant trough. We merge the corresponding clusters by establishing pointers to the grid point with the highest density, and iterate until there are not more changes in cluster assignment. It can be shown that there will take a finite number of iterations to converge.

Let {ym(1), . . . , ym(k)} be the set of all grid points which have a pointer originating from them to a dummy state representing a cluster. A partial ordering of this set is ˆ ˆ given by Y : f(ym(i)) ≥ f(ym(j)), and the merging routine is outlined in algorithm 1, utilizing the subroutines outlined in algorithms 2–4. When the merging algorithm terminates, every grid point has a link originating from it. Following the succession of pointers leads to a dummy state which represents either background noise or a cluster. The grid points which are linked to the same dummy state pertain to the same cluster, or background noise. Cluster membership of individual observations are derived from the membership of the grid points according to 2.4.1. CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 23

Algorithm 1 Merge Input: S # The set of all non-zero grid point indices Input: Y # The sorted list of all grid point indices pointing to a cluster A, B, P := ∅ updated := 0 repeat for each i in Y do A := {i} repeat P := LocalMaxima(A, S, i) A ← P until P = ∅ B := LocalMaximaNeighbors(A) if B 6= ∅ then updated := UpdateP ointers(A, B) end if end for until updated = 0

Algorithm 2 LocalMaxima Input: A # Set of grid point indices in the neighborhood of a local maxima Input: S # The set of all non-zero grid point indices Input: i # index of some grid point that points to a cluster P := ∅ for each a ∈ A do for each p ∈ S do if neighbors(ya, yp) AND outpointers(yp) = ∅ AND ˆ ˆ f(yp) +σ ˆp ≥ f(yi) − σˆi then P ← p end if end for end for return P CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 24

Algorithm 3 LocalMaximaNeighbors Input: A # Set of grid point indices in the neighborhood of a local maxima Input: Y # The set of grid point indices pointing to a cluster B := ∅ for each j in Y do if neighbor(yj, yp) where p ∈ A then B ← j end if end for return B

Algorithm 4 UpdatePointers Input: A # Set of grid point indices in the neighborhood of a local maxima Input: B # The set of grid point indices pointing to a cluster with a neighbor in A ˆ q := maxr∈B f(yr) for all p ∈ A do addpointer(yp, yq) end for updated := 0 for each r ∈ B, r 6= q do removepointer(yr, cluster) addpointer(yr, yq) updated = 1 end for return updated CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 25

2.5 Materials

2.5.1 Synthetic data 10

Cluster color Size 5

2 Black 6088 Red 2415 Green 977 DIMENSION Blue 311 0 Purple 209 -5 Figure 2.4: Synthetic data cluster -5 0 5 10 15 DIMENSION 1 statistics, purple is background noise

Figure 2.3: 2-dimensional synthetic ‘Flow Cytometry-like’ data containing 4 clusters and background (n=10000)

‘Flow cytometry’-like data were simulated for n=10000 events in 3-dimensions containing 4 clusters. Four clusters can be seen by examining a plot of the 1st vs. 2nd dimensions. The largest cluster is curved like a kidney-bean, and is generated by mixing two normal distributions together. The 3rd dimension contains noise. The R code used to generate this data can be found in Appendix A. Important properties of the data that make it ‘FC-like’ are the presence of multiple clusters, clusters of varying frequency and density, overlapping clusters, non-convex clusters, large number of events and the presence of background noise. CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 26

2.5.2 Neonatal mouse spleen and peritoneal cavity

Neonatal mouse spleen and peritoneal cavity cells harvested in serum-containing medium were incubated on ice for 15 minutes with a 10-color staining combination. Data were collected on an LSR II (Becton Dickinson). Studies were designed and carried out by Yang Yang.

2.5.3 Nut allergy blood samples

Complete details about materials and methods for this study can be found in [23]. Below we present abbreviated details relevant to the present work.

Human subjects All 13 food-allergic subjects (or parents, for minors) and all healthy controls (n = 9) signed informed consent forms before the subjects underwent study procedures. Clinical nut or apple allergy was diagnosed by clinical history of food allergy reaction, nut-specific IgE and/or positive skin prick test to nut allergen. Severity was graded based on published scores of anaphylaxis symptoms.

Sample collection and processing Blood was collected from each patient and stained according to the basophil stimulation assay (below).

Basophil stimulation assay Three microliters of phosphate-buffered saline (PBS) or of an allergen extract (peanut, cockroach, cashew, walnut, apple) used clinically for skin testing were added to 200 µl of blood from each patient and the mixture was incubated for 10 minutes at 37C. 50 to 200 µl of blood were stained with several antibodies against surface determinants for 20 min on ice, in the dark. These antibodies included CD3, CD11b, CD16, CD20, CD41a, CD56, CD63, CD66b, CD123, CD203c, CD294 and HLA-DR. Each patient sample was divided in three and stimulated with: PBS, a non-offending allergen and an offending allergen. Data for 1.5–2×105 cells per sample were acquired on an LSRII (Becton Dickinson, Inc.). CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 27

2.6 Results

We implemented the density-based merging (DBM) algorithm in a Java applica- tion with a graphical user interface that allows cluster visualization and sequen- tial selection of clusters to support progressive gating. To enable comparison of DBM gating with data gated manually with a commercial analysis package (FlowJo, http://www.treestar.com/), we record cluster assignments for each event in associa- tion with the original data. These values are used as synthetic gating parameters in the commercial package, where we can directly compare results with those obtained by manual gating. We implemented our methodology in a sequential 2D setting to automate the tra- ditional manual gating. While the methodology can in principle be implemented in a higher-dimensional setting, there are also advantages to remain with the traditional sequential procedure. First, many users are familiar with the sequential gating pro- cedure and may be hesitant to work with the high-dimensional output of a black box, which may be difficult to interpret. Second, it is common practice to first project the data on the forward light scatter (FSC) and sideward light scatter (SSC) to distin- guish basic cell types (e.g., monocytes and lymphocytes) and to remove dead cells and cell debris. Also, the user may have prior knowledge that leads her to consider certain 2D projections or gating paths. These aspects are readily incorporated in our imple- mentation. Third, sequential 2D gating allows for an informative and straightforward visualization of the gating and the results

2.6.1 Evaluation using manual gating

Here we compare the results of DBM to data previously analyzed by a senior re- searcher for two datasets: B-cells in mouse spleen and peritoneal cavity and basophils from human blood samples. Details about the collection of these data were described in Sections 2.5.3 and 2.5.2. CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 28

250K 250K 250K 105

200K 200K 200K 104

150K 150K 150K 224059 (80%) Manual 3 211203 (94%) 10 gating 100K 100K 100K

2 10 50K 50K 50K

0 93.6 0 0 0

250K 250K 250K 105

200K 200K 200K 104

DBM 150K 150K 150K 227541 (81%) 103 gating FSC-A 213897 (94%) 100K 100K 100K

2 10 50K 50K 50K

0 92.9 0 0 0 2 3 4 5 0 50K 100K 150K 200K 250K 0 50K 100K 150K 200K 250K 0 10 10 10 10 0 102 103 104 105 FS C- A FS C- H CD11b FS C- A FSC-H FSC-H CD5+PI F4/80+Gr-1

Figure 2.5: Comparison of manual and DBM gating in the scatter dimensions singlet gates are shown as determined by the researcher (top) and DBM (bottom, colored plot frames) for neonatal mouse spleen cells. The subset is further gated using the researchers live/dead gate and displayed in context of the next gating decision by the researcher. Note that the results of the DBM clustering are displayed with same software that was used for the manual gating (FlowJo). In the bottom left plot, color is used to code the clusters found by DBM. CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 29

B-cells in mouse spleen and peritoneal cavity

In the following analyses, we replicate manual gating decisions from a dataset pre- viously analyzed by a senior researcher using FlowJo. The researcher sequentially selected gates that progressively restrict the inclusion of cells to ultimately encom- pass a known functionally distinct subset. For each of these sequential manual gating decisions, we select the corresponding cluster(s) defined by the DBM algorithm. In our analysis, we thus reproduce the existing workflow of the researcher, with the exception that we use gating boundaries that are defined algorithmically by DBM. Figure 2.5 (first column) compares the initial gating in the forward-scatter area/height dimensions performed manually (top) or with DBM (bottom). The research intention here is to separate single cells from doublets and other debris. Drawing the manual gate requires a great deal of experience for a researcher to draw, owing to the lack of visual differentiation between the overlapping populations. DBM identifies two clus- ters that agree well with the manual gate: the red cluster contains 81% of the total events; the corresponding expert gate contains 80% of the total events; the overlap between the two gates is 98%. Two views of the events encompassed by the clusters are shown in columns 2 and 3 of Figure 2.5. Column 4 shows further gating of the samples with the same manual gate applied to the manually gated (top) and DBM gated (bottom) data shown in columns 2 and 3. The similarity of the yield from the manually gated and DBM gated sample underscores the strong overlap between the two approaches. In each case, a small percentage of the events captured by one of the gating methods are excluded from the other (Figure 2.6). Importantly we find that the DBM gate tends to better capture the desired events then does the researchers gate. We define desirable events as those included in the subsequent gates that the expert set. The gate set by the expert included fewer cells in the desired subset than the DBM gate, resulting in a loss of desired cells (3474 cells). The expert gate also included fewer cells outside the desired subset. However, the additional nondesired cells included in the DBM gate are not relevant since the expert has gated these out of the subsequent analysis. Thus, in this situation, the DBM gate is more successful than the expert gate. CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 30

In scatter gate In live gate 250K 223250 105

200K Found by 104 manual 150K 103 and 100K

DBM 2 10 50K

0 93.2 0

250K 809 105

200K 104 Missed by 150K 103 DBM 100K

2 10 50K

0 83.7 0

250K 4291 105

200K 104 Missed by 150K 103 manual 100K

2 10 50K

0 74.5 0 0 50K 100K 150K 200K 250K 0 102 103 104 10 FS C- A CD11 b FSC-H F480+Gr-1

Figure 2.6: Differences in manual versus DBM gating in scatter dimensions – cells included by both gates (top), cells included in the manual gate and excluded by the DBM gate (middle), and cells included in the DBM gate and excluded by the manual gate (bottom) are displayed (column 1). Cells are live/dead gated as described in the text, and shown in the context of the next manual gating decision (column 2). CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 31

250K 250K 5 105 10 B cells

200K 200K 4 120658 (80%) 10 104 95760 (80%) 150K 150K

Manual 3 10 3 10 gating 100K 100K 86082 (89%)

2 10 50K 50K 0

0 0 0

250K 250K 105 105 B cells 200K 200K 4 10 104 92599 DBM 150K 150K 3 10 3 86628 10 gating 100K 122115 100K

2 10 50K 50K 0

0 0 0 2 3 4 5 2 3 4 5 3 4 5 0 50K 100K 150K 200K 250K 0 10 10 10 10 0 10 10 10 10 0 10 10 10 CD19 FS C- H FS C- A FSC-H CD11b FSC-H CD5+PI F4/80+Gr-1 B220

Figure 2.7: Comparison of manual and DBM gating for 3-step gating sequenceadult mouse spleen cells are analyzed using the researchers manual gates (top plots) and the corresponding clusters identified by DBM (bottom plots with colored plot frames). Color is used to code the clusters found by DBM in the first three plots on bottom. Each of the manual/DBM gate pairs has ¡ 4% difference in total number of cells. In this study, the researcher is interested in B cells (column 4). CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 32

In Figures 2.5 and 2.6, we analyzed the results of a single DBM gate generated to match the first gate that the expert applied in the gating series. Figure 2.7, which is based on a different dataset, compares results from three sequential gates applied by the researcher with the comparable sequential DBM gates. The researcher has chosen three sequential gates (Figure 2.7, top): the first gate excludes doublets and debris; the second gate excludes dead cells (bright PI); the third, which yields a subset that is enriched for B cells (the target of interest to the expert), excludes monocytes and macrophages (CD11bbr, F4/80+GR-1br). Applying the corresponding sequence of DBM clusters results in a distribution (Figure 2.7, bottom) that is almost indistinguishable from the distribution obtained with the experts gates. The principal differences is a small increase in the number of cells in the B cell subset desired by the expert, and the inclusion of a small percentage of cells that lie near, but not within, the B cell subset.

Nut allergy human blood samples

In a second application, DBM was applied to data obtained from human blood sam- ples to identify basophils (samples described in Section 2.5.3). Basophils are the least common subset of granulocytes, representing about 0.01% to 0.3% of circulating white blood cells. DBM is computed for each sample using two channels: a dump channel containing lineage markers for cells that are to be excluded from subsequent analysis (CD3, CD16, CD19, CD66b, HLADR) and CD123, a marker on the surface of mature basophils. For each patient (allergic n=13, non-allergic n=9), we performed DBM clustering on each of 3 blood samples: (i) unstimulated; (ii) stimulated with a non-offending allergen; (iii) and stimulated with an offending allergen. The only parameter tuned for the present study is percent outliers (Q), which defines the per- centage of events that do not get assigned to a cluster. If this value is set to 0, every event is assigned to a cluster. For this analysis, Q is set at 0.5% to obtain a reason- able number of clusters for the population of interest. All other parameters are set according to statistical theory of density estimation as described in Section 2.4. In all of the 22 patient samples, DBM identified at least one cluster corresponding to the basophil subset (recall that the basophil subset represents ∼ 0.5% of the total CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 33

ALLERGIC NON-ALLERGIC 20 20

15 15

10 10 Num clusters Num clusters 5 5

offending non-offending unstimulated offending non-offending unstimulated Stimulation Stimulation

3000 3000

2000 2000

1000 1000 Num basophils Num basophils

0 0 offending non-offending unstimulated offending non-offending unstimulated Stimulation Stimulation

Figure 2.8: Oneway analysis of variance (ANOVA) on the number of clusters (top) and the number of events in the clusters (bottom) for each of three aliquots from each sam- ple. No panel exhibits a significant difference among the within group (allergic,non- allergic) stimulations. Between groups, there is a difference in the number of basophils (p=0.0095) but no difference in the number of clusters (p=0.2). sample). In certain cases, DBM segmented the researcher defined basophil subset into 2 or more clusters. We investigate one of these cases in detail below. We performed a oneway ANOVA (Figure 2.8) on the number of clusters identi- fied and the number of events in each cluster between the allergic and non-allergic samples. There is no significant difference in the mean number of clusters (p=0.2), but there is a significant difference in the number of events in the basophil clusters (median values: allergic=1095, non-allergic=526, p=0.0095). This is consistent with previous observations that individuals with food allergies have elevated baseline ba- sophil frequency [23]. Notably, the basophil cluster was identified in all 22 patients × 3 samples/patient = 66 samples. CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 34

2.6.2 Validating differences between gating and clustering

In some cases, the cells identified by the researcher as the basophil subset was seg- mented into 2 or more subsets by DBM. We examined the branch of these cases where the basophil cluster was segmented into 2 subsets represented by a CD123- intermediate (CD123+) and CD123-high (CD123++) subset (Figure 2.9). For the sample in Figure 2.9, the CD123+ subset contains ∼ 6× the number of cells compared to the CD123+++ subset (1871/294). For each of the two subsets, we look at the activation response in the CD63 and CD203c channels. Notably, while the CD123+ subset exhibits an activation profile consistent with allergic response, the CD123++ subset exhibits substantially lower levels of CD63 and CD203c, a profile more similar to a non-allergic response. This trend was consistent across the 3 allergic patient samples for which the CD123-intermediate/CD123-high split occurred. Although the algorithm produced results that did not directly correspond to the researchers gating strategy in these cases, this example illustrates how DBM is able to differentiate two subsets of cells where manual gating only detected a single subset. We consider this missing segmentation a false negative for the researcher, as the two subsets of cells clearly demonstrate distinct functional profiles, measured by the activation markers. At present, the biological explanation of the CD123++ subset is unclear. However, including these cells, which are not activated during stimulation, in the diagnostic analysis presented earlier will dampen the activation signal of the CD123+ subset.

2.6.3 Comparison with mixture model on synthetic data

In Section 2.6.1 we used experimental flow cytometry data to examine the perfor- mance of DBM with respect to existing manually defined cluster boundaries. These studies test the ability of the algorithm to recapitulate results obtained by a trained domain expert. However, the “gold standard” of manual gating by domain experts is a subjective criteria, making precise quantitative comparison to algorithm per- formance difficult to interpret. A quantitative comparisons of clustering algorithm performance is better suited to data in which the “gold standard” is objective and CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 35 Dump

CD123 CD63

CD203c

Figure 2.9: CD123+++ population does not exhibit significant basophil activation compared to CD123+ well-defined. One convenient approach is to generate data in which the cluster la- bels are known. Using these synthetic datasets, many comparisons can be performed rapidly and classification accuracy is easy to compute and straightforward to inter- pret. In this Section we present the results of DBM applied to the synthetically generated dataset described in Section 2.5.1. The data are plotted in Figure 2.10 along with the true cluster labels and DBM cluster labels, indicated by color. Analyzing these data highlights several of the key properties of DBM. The un- structured events colored pink are marked as background because the density estimate at those grid points was not significantly different from zero. This threshold was set according to [25] and did not require sample-specific tuning. The large kidney-bean shaped cluster (black), which is difficult to represent with parametric models, is well- captured (precision=0.98, recall=0.98). The intersection of the black, red and green CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 36

RAW DATA TRUE CLUSTER LABELS DBM CLUSTER LABELS 10 5 0 -5

-5 0 5 10 15 -5 0 5 10 15 -5 0 5 10 15

Figure 2.10: Synthetic FC-like data in 2-dimensions showing true class labels and DBM cluster results. clusters contain overlapping events, making classification difficult. The green cluster presented the largest error in classification accuracy; ∼10% of the green events were assigned to the neighboring red and black clusters. For each of the clusters, some of the boundary events were assigned to the background (black=0.5%, red=1.7%, green=4.7%, blue=1%). Complete DBM cluster results for the synthetic data are summarized in Table 2.1. DBM Black Red Green Blue Unclassified Precision Recall Black 5936 90 28 0 34 0.98 0.98 Red 51 2314 10 0 40 0.94 0.96 Green 60 40 831 0 46 0.95 0.85 Blue 0 0 0 308 3 0.99 0.99 Background 26 17 5 2 159 0.56 0.76

Table 2.1: Results of Density Based Merging on synthetic data - combined F-measure: 0.96

Mixture models

Using the same synthetic data from 2.5.1, we applied 2 mixture model algorithms for clustering FC data recently proposed in the literature. The first is flowClust, a model-based clustering approach using multivariate t-mixture models and a Box-Cox transformation [50]. The second is flowMerge, an extension to flowClust proposed CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 37

by Finak, et. al., in which mixture components are merged in order to minimize the entropy of the data under the new cluster assignments [21]. Merging a combination of mixture components provides a means for modeling components with irregular (e.g. non-convex) shapes that are poorly modeled by traditional parametric distributions. In order to fit a mixture model, one must supply the number of components in the mixture (where components map to clusters). When the number of components is unknown, the value must be estimated from the data. A common approach, which is employed by [50], estimates the parameters of a mixture model with k components for some range of k = 1 . . . k = n, and selects the model with k = j based on some model selection criteria that maximizes j. Several model selection criteria exists, such as Bayesian Information Criteria (BIC) [77], Akaike Information Criterion (AIC) [3] and the gap statistic [83]; flowClust uses BIC by default. We analyzed the synthetic data using flowClust mixture model clustering with 4 and 5 components (Figure 2.11, top row). The model with the correct number of components (k = 4) splits the kidney-bean cluster, merging part of it with the cluster centered at (6, 3) (green cluster in Figure ). The model with one additional component (k = 5) offers a better fit to the data, dividing the kidney-bean in half but capturing the cluster centered at (6, 3) missed in the 4 component model. The irregular kidney- bean shape cluster illustrates the shortcoming of using models based on parametric distributions, which are necessarily convex, on data containing non-convex shapes. CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 38

T-MIXTURE MODEL (flowClust)

10 K=4 K=5 5 0 -5

-5 0 5 10 15 -5 0 5 10 15 MERGED MODEL (flowMerge)

10 K=3 K=4 5 0 -5

-5 0 5 10 15 -5 0 5 10 15

Figure 2.11: Top row: t-mixture modeling using k=4 and k=5 components on syn- thetic data. Bottom row: t-mixture models from top row used as input to mixture model component merging algorithm, yielding k=3 and k=4 components. CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 39

To address the shortcomings of model-based clustering for FC, [21] proposed a post-processing step which takes as input the output results of a mixture-model and merges components based on cluster entropy to improve the model fit. The entropy of clustering for a K cluster mixture model is defined as

K N X X ENT (K) = −2 piklog2(pik) (2.14) k=1 i=1 where pik is the probability of data point i belonging to cluster k. Highly overlapping clusters have high entropy and clusters with little overlap have entropy near 0. As components are combined by the merging algorithm, the entropy will decrease until there are only well separated clusters, at which point further merging will have a small effect on the total entropy of clustering. This ‘elbow’ in the curve is reflected in a change of slope in a plot of clustering entropy versus number of components and is used to select the optimal number of clusters.

flowMerge Black Red Green Blue Unclassified Precision Recall Black 5749 60 46 0 233 0.99 0.94 Red 30 2280 21 0 84 0.96 0.94 Green 23 29 904 0 21 0.92 0.93 Blue 0 0 0 305 6 0.99 0.98 Unclassified 18 17 11 2 161 0.32 0.77

Table 2.2: Results of merging t-mixture components on synthetic data - t-mixture model clustering performed with k=5 components using the flowClust package with default parameter settings, then merged using the flowMerge package with outlier level=0.90. The merging results in 4 clusters, with a combined F-measure: 0.95

On the synthetic data, flowMerge improves the classification accuracy of flowClust. We ran flowMerge on two models (k = 4, k = 5) generated by flowClust, producing new models with k = 3 and k = 4 components, respectively. We manually tuned the outlier level to produce the most favorable results (outlier level=0.9). Complete classification results for the 4-component flowMerge model are reported in Table 2.2. The 4-component merged t-mixture model demonstrated excellent precision and CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 40

good recall (0.98, 0.94) on the kidney-bean shaped cluster in the synthetic data, illustrating that merging two t-distribution components can produce a good approx- imation to a non-convex cluster (Figure 2.11, bottom right). Recall was lower than DBM (0.94 vs 0.98) for the non-convex cluster (black), but higher for DBM on the green cluster (0.85 vs 0.93). The combined F-measure over all of the clusters is similar for flowMerge and DBM (DBM=0.96, flowMerge=0.95). Overall, both algorithms performed well on the synthetic data.

2.7 Discussion

In a comparison with the recent state-of-the-art model-based clustering approach (flowMerge) using synthetic data, flowMerge and DBM produced comparable re- sults, demonstrating good precision and recall. Although comparable performance between a model-based approach and DBM are shown, we believe practical and the- oretical benefits to DBM make it a more suitable approach. On the theoretical side, DBM is not based on assumptions about the underlying data. This is important, as we have no a priori reason to believe that the pattern of marker expression in/on cells measured by FACS would conform to any known distributions. Empirically, irregular population shapes are routinely characterized by experienced researchers. Practically, the DBM method is efficient to compute using well-known optimizations for density estimation (such as the Fast-Fourier Transform) and is highly amenable to paral- lelization. 2-dimensional analysis of the type presented in this Chapter is computed in ∼1 second on a standard notebook computer. While flowMerge yielded compa- rable classification accuracy on the synthetic dataset, it took considerably longer to compute. One reason for the long running time is that a mixture-model approach does not directly estimate the number of components. Instead, a “maximal” number of components M is specified, and all models consisting of 1—M components are generated. Afterwards, some model selection criteria such as BIC or AIC are used to select the maximum likelihood model. While providing a good estimate, model selection criteria are generic estimates of the best model that do not account for ap- plication specific concerns (e.g. what is a good penalty for adding components). No CHAPTER 2. AUTOMATED GATING OF CELL POPULATIONS 41

reasonable guidance exists for choosing between models with similar selection crite- ria scores. In contrast, the number of clusters is determined directly by the DBM algorithm and only a single iteration of the algorithm is computed. At present, the ability to identify cell subsets in a reliable, robust and efficient manner is the most pressing unsolved problem in flow cytometry data analysis. To this end, we developed a novel clustering algorithm, Density Based-Merging, capable of automatically defining subset boundaries with good accuracy. Notably, our method makes no assumptions about the underlying distribution of the data, performs well in the presence of outliers, does not require specialized parameter selection or tun- ing for different datasets and can be computed efficiently. The number of clusters is determined automatically by the data, and cluster boundaries are based on well established statistical theory of density estimation. We applied DBM to experimen- tal data of B-cells from mouse spleen and peritoneal cavity, and demonstrated that the sequence of DBM clusters corresponding to the expert gating strategy resulted in a distribution that is almost indistinguishable from the distribution obtained with the expert gates. In addition, we applied DBM to 66 blood samples from nut al- lergy patients to identify basophils, a small subset representing about 0.5% of the total cells collected. DBM correctly identified the basophil subset in all 66 samples, demonstrating the ability to detect rare but clinically relevant populations. Chapter 3

Quantifying Changes in Cell Subsets

3.1 Abstract

Changes in frequency and/or marker expression in small subsets of peripheral blood cells can serve as important biomarkers of drug response, disease susceptibility, progress and prognosis. Flow cytometry is particularly useful for such purposes because it en- ables simultaneous measurement of up to 20 markers on the inside and surface of each of a very large number of cells in a blood sample. Although changes in joint expression of two or more independently regulated markers on a given cell type could provide strong indices of response in various modalities, there are still substantial impediments to recognizing and computing such indices from multiparameter flow data. Here we describe methods that facilitate these analyses and provide reliable indices of marker expression by individual types of cells in human blood samples. We show that Earth Movers Distance provides a suitable metric for these purposes. We demonstrate the practical utility of this method with data from an allergy study that demonstrated shifts in the expression of two markers on very rare blood cells (ba- sophils) in response to stimulation with an offending allergen and a monocyte study involving 3 mouse strains. The studies presented here demonstrate that objective and quantitative criteria can be used to differentiate allergic response and genetic

42 CHAPTER 3. QUANTIFYING CHANGES IN CELL SUBSETS 43

differences in mice, that is consistent with supervised analysis.

3.2 Introduction

Once you have identified a cell population, it is common to compare differences in the population between two or more samples. Comparisons between a disease and control, different genetically modified organisms, or samples that have undergone stimulations provide information about the model system. Thus, we require a metric capable of quantitatively measuring these differences. Subsets of cells identified by flow cytometry are frequently compared to find such differences. Measurements are taken for tens of characteristics on thousands or mil- lions of cells, creating a complex non-parametric multivariate distribution. A classical hypothesis test can easily be applied to compare differences between the distribu- tions. However, as we will show in this Section, hypothesis tests are very sensitive to small movements in the data and output a binary result: different or not different. Most analyses performed with flow cytometry measurements aim at a measure of biologically meaningful difference. For example, in the allergy example described in Section 3.6.2, researchers want to know: Do samples from diseased subjects differ in a meaningful way from samples from healthy controls? How much does an individual subject differ from controls? Does the magnitude of deviation from controls correlate with the severity of the disease state? The fundamental need is for methods that are able to operate in the “sweet spot” of statistical and biological significance. Goals

• Quantitate the difference between samples, evaluate statistical significance • Identify changes in joint expression of multiple markers • Rank test samples based on the amount of deviation from a control sample • Identify events that are different between samples

Studies here show that application of Earth Movers Distance (EMD) measure- ments to flow cytometry data accurately measure un-coordinated marker expression changes on the same cells and hence provide a robust measure of responsiveness that CHAPTER 3. QUANTIFYING CHANGES IN CELL SUBSETS 44

is not bound to changes in expression levels of a single marker. Importantly, the methods we have developed significantly simply flow analyses when more than one marker is expressed and readily adapt to automated analysis of flow data analyses.

3.3 Background

The problem of comparing two samples and assessing whether they are different is one of the basic questions of a branch of inferential statistics known as hypothesis testing. Developed in the early 20th century by Sir Ronald Fisher, Jerzy Neyman and Egon Pearson, classical hypothesis testing aims to detect the presence of an effect or a difference under different conditions [80]. This is accomplished through an evaluation of the relative likelihood of two statistical hypotheses in the presence of observed data.

The null hypothesis (H0) is a statement of no effect or difference (i.e. that the two distributions are equal), while the alternative hypothesis (H1) indicates the presence of an effect or a difference. In hypothesis testing, we calculate the probability of the observed data given the null hypothesis. Application-specific thresholds dictate the likelihood at which one is willing to reject the null hypothesis in favor of the alternative hypothesis (e.g. with probability ≤ 0.05). In the following Sections we briefly introduce representative methods for compar- ing two samples as they have been applied in the flow cytometry informatics literature. For clarity, we divide the methods into non-parametric test statistics and distance metrics.

3.3.1 Non-parametric test statistics

Kolmogorov-Smirnov Test

The Kolmogorov-Smirnov (KS) statistic is a nonparametric test appropriate for or- dinal data developed independently by Kolmogorov (1933) and Smirnov (1939). The goal of the statistic is to evaluate the hypothesis: “Do two independent samples rep- resent two different populations?” [80]. The test has several properties that suggest it may be appropriate for flow cytometry applications: (i) The test is sensitive to CHAPTER 3. QUANTIFYING CHANGES IN CELL SUBSETS 45

differences with respect to central tendency, variability, skewness and kurtosis; (ii) it belongs to a class of so-called “distribution-free” tests in which “the expected distri- bution of the test statistic is independent of particular theoretical density law being considered” [59]. The KS test examines the cumulative distribution function (CDF) of two popu- lations and looks for the point that maximizes the vertical distance M between the two CDFs.

ˆ ˆ dKS(H,K) = max(|hi − ki|) (3.1) i where hˆ and kˆ are cumulative histograms. The KS statistic is defined for univari- ate data. Extending it to the multivariate case is not straightforward [59], though methods have been proposed (for examples, see [22][51][59]). Empirically, it has been demonstrated that the KS statistic is too sensitive to provide meaningful values for FC data. For example, two identical FC samples run in succession often result in distributions with statistically significant differences [16]. Given statistically significant differences in biologically identical samples, the test does not allow us to make useful conclusions about biologically different samples.

Probability Binning

Probability binning (PB) is an extension of the chi-squared statistic for multivari- ate distributions, enabling the detection of small differences between two populations [66]. Initially described at the International Society for Analytical Cytology meeting in 1994, the statistic was developed specifically to address the shortcomings of ap- plying existing multivariate statistics to flow cytometry data. The PB proposal is an efficient method for summarizing non-parametric multivariate distributions and an accompanying test statistic that operates on the summarized data for determining the probability that two multivariate distributions are sampled from the same par- ent distributions. We will refer to the binning method as adaptive binning and the statistic as the probability binning statistic (PB). CHAPTER 3. QUANTIFYING CHANGES IN CELL SUBSETS 46

Adaptive Binning Adaptive binning is a method for dividing k-dimensional data into bins such that all bins contain the same number of events. This strategy requires bins of variable size that “adapt” to the structure of the data. The algorithm begins by calculating the median and variance of each of the k-dimensions included in the comparison. Next, we select the dimension j with the maximum variance and divide the data in half along the median value of that parameter, such that each bin contains an equal number of data points. The algorithm proceeds recursively until a pre-defined threshold is met (e.g. minimum number of data points per bin). This results in a collection of n-dimensional hyper-rectangular bins, with each bin containing an equal number of data points. For an illustrative example see Figure 3.1. It should be noted that the adaptive binning approach is simply a special construction of a balanced kd-tree [10].

Probability Binning Statistic The n-dimensional hyper-rectangular bins de- fined by the control population are applied to a test sample and the normalized chi-squared value (χ2) is calculated as:

#bins X (cn − tn)2 χ2 = i i (3.2) (cn + tn) i=1 i i c t cn = i , tn = i i Ec i Et

c s where ci, ti are the number of control and test samples in bin i, and E ,E are the total number of events in the control and test samples. The theoretical range of the PB statistic is [0, 2]. The PB statistic is thus a summation over the χ2 values between all of the bins for some control and test samples. Using this statistic, the authors propose a “metric” analogous to a t-score to capture the magnitude of the difference between the two samples:

(χ2 − χ¯2) T (χ) = max(0, m ) (3.3) σχ2 CHAPTER 3. QUANTIFYING CHANGES IN CELL SUBSETS 47

1 2 30 30 30 25 25 25 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0

0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 3 4 5 30 30 30 25 25 25 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0

0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 6 7 8 30 30 30 25 25 25 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0

0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30

Figure 3.1: Adaptive binning algorithm for summarizing non-parametric distribu- tions. Illustrating n=1-8 recursions (2-256 bins). Note that in regions of high density there are more bins than in the sparsely populated regions. √ 2 B B χ¯ = , σ 2 = E χ E Baggerly [?] shows that Roederer’s T (χ) has a consistent bias, and suggests the following modification:

2ncns 2 n χ − (B − 1) T2(χ) = (3.4) p2(B − 1) where nc, ns are the number of events in the control and sample. However, even with these modifications, the T (χ) value of Probability Binning is not a true distance CHAPTER 3. QUANTIFYING CHANGES IN CELL SUBSETS 48

metric, an issue which we will examine in details in Section 4.5.

3.3.2 Distance metrics

Distance metrics offer an alternative to hypothesis testing for comparing two sam- ples. A distance metric directly measures the difference between two things. Instead of asking the question, “Are these two things different?” we ask the related ques- tion “How different are these two things?” In this Section we will briefly introduce the fundamentals of distance metrics and explain the rationale for an application to flow cytometry data where one wants to quantitate the difference between two cell populations (for instance, before and after treatment with a drug). While there are natural methods to perform comparison of a single data point with another point, and the comparison of a point with a population distribution, it is not immediately clear how to measure the similarity of two probability distributions [61]. Typically, a distance metric is a function between two objects. For instance, the distance between the cities San Francisco, California and Bethesda, Maryland on a map. In the case of FC data we have many objects (cells) and instead want to compare groups of objects (populations of cells) to one another. In the map example, this would be akin to measuring the distance between California and Maryland, where the states are represented as a collection of cities. There are a number of way to measure a distance between groups. A naive attempt may take the average location of all the cities in California and Maryland and calculate the distance between the two averages. However, it is clear from visual inspection of the geographical shape of California and Maryland that they are not normally distributed shapes, and thus the average city position may be poor representation of the actual distribution of cities in the state. Alternatively, we might place a grid on top of California and count the number of cities that fall in each bin of the grid and then place the same grid on top of Maryland and do the same and calculate a χ2 value between each of the bins. Technically, we require a function that operates on two probability distributions that behaves like a measure of distance. CHAPTER 3. QUANTIFYING CHANGES IN CELL SUBSETS 49

Properties of a distance metric

Formally, a mathematical distance metric is defined by 4 properties:

1. Non-negativity d(x, y) ≥ 0

2. Identity of indiscernibles d(x, y) = 0, iff x = y

3. Symmetry d(x, y) = d(y, x)

4. Triangle inequality d(x, z) ≤ d(x, y) + d(y, z)

In the following Sections, we will examine a few proposals for distances between multivariate distributions and describe properties that are important for application to FC data.

Euclidean distance

Euclidean distance is the ordinary measure of the distance between two points.

p 2 2 D(x, y) = (x1 − y1) + ... + (xn − yn) (3.5)

Mahalanobis distance

Mahalanobis distance is based on the correlation between variables. Unlike Euclidean distance, Mahalanobis distance is scale-invariant and takes into account the corre- lations of the data. It is primarily used to define the distance of a point from a distribution by estimating the probability that a point yi belongs to the distribution X

q ¯ −1 ¯ T D(X, yi) = (X − yi) · SX · (X − yi) (3.6) where Sx is the covariance matrix of X. Whereas Euclidean distance is a measure between two points, Mahalanobis is a measure of the distance between a point and a distribution. Aghaeepour et. al. [2] (Equation 3.7) extend the Mahalanobis distance to operate on two distributions: CHAPTER 3. QUANTIFYING CHANGES IN CELL SUBSETS 50

 q ¯ ¯ −1 ¯ ¯ T  (X − Y ) · SX · (X − Y ) D(X,Y ) = min q (3.7) ¯ ¯ −1 ¯ ¯ T  (X − Y ) · SY · (X − Y ) where Sx and Sy are the covariance matrices for X and Y . This measure does not satisfy the triangle inequality.

3.3.3 Discussion

A central issue of this Chapter is the choice of a distance metric to quantify differences between populations where others have used hypothesis-test statistics. There are several advantages to using a distance metric that is a true metric in the formal sense described in 3.3.2. Informal metrics, such as the T (χ) used in Probability Binning or the Mahalanobis distance according to [2], has shortcomings that make the results difficult to interpret. One such property is that they cannot be used to rank samples based on their difference from a control because it does not satisfy the triangle inequality. While the test can make a statement about the similarity of a control c0 and two test samples t1 and t2, no statement about the similarity of t1 and t2 can be directly inferred. Conversely, a true distance metric allows for a statement about t1 and t2 based on their joint comparison with c0. Panel (a) of Figure 3.2 shows two normal distributions: a large population (black) and a smaller population (green). The green population starts with a mean at the same position as the black population, and increases along the x axis in fixed in- crements (2 standard deviations ) in each of the successive panels. At each step, we calculate the PB statistic and the Earth Mover’s Distance (a distance metric de- scribed in detail in 3.4) between the “unstimulated” first panel in (a), and the joint distribution of the main (black) population with stimulated population (green). As the green population moves further from the black population, both the PB and the EMD increase monotonically. However, when the green population gets past 2 standard deviations from the black population, the PB plateaus as the two distribu- tions have reached “maximum” separation based on the PB statistic. No additional movement of the green population can provide further evidence to the alternative CHAPTER 3. QUANTIFYING CHANGES IN CELL SUBSETS 51

hypothesis that these two populations are the same. This observation is noted in [67], “Once the separation is such that there is no more overlap between the positive and negative events (i.e., more than two standard deviations apart), T (χ) no longer increases with increasing separation.” Conversely, EMD continues to increase linearly with the growing separation of the green population.

a (i) (ii) (iii) (iv) (v) 4 2 0 -2 -4

0 5 10 0 5 10 0 5 10 0 5 10 0 5 10 0 stdevs 2 stdevs 4 stdevs 6 stdevs 8 stdevs b PROBABILITY BINNING c EARTH MOVER’S DISTANCE 1.5 800 600 1.0 ) X 400 T( EMD 0.5 200 0 0.0

0 2 4 6 8 0 2 4 6 8 DISTANCE BETWEEN BLACK AND GREEN POPULATION (STDEVS)

Figure 3.2: As the smaller green population moves further from the center of the main (black) population, probability binning plateaus, while EMD continues to increase monotonically

Say we examine two samples: α at 6 stdevs from the control (Figure 3.2, a.iv) and β at 8 stdevs from the control (Figure 3.2, a.v)). The EMD values in the interval between α and β increase, while the t(χ) values in the same interval decrease (Figure 3.2). If we use t(χ) as a metric for sorting this list of samples, we would conclude that (Figure 3.2,a.v) was more similar to the control than (Figure 3.2, a.iv). While the data for Figure 3.2 were generated synthetically, one can imagine an CHAPTER 3. QUANTIFYING CHANGES IN CELL SUBSETS 52

experiment in which increasing amounts of a drug are applied causing a subset of cells to increase expression of a marker based on the amount of the drug. In order to correlate the amount of drug with the level of expression in a reliable fashion, one needs a true distance metric to measure the magnitude of the change in distributions.

3.4 Earth Mover’s Distance

3.4.1 Overview

Earth Mover’s Distance (EMD) is a mathematical measure of the distance between two samples, (e.g., between an unstimulated aliquot of a blood sample from a patient with a diagnosed allergy and an aliquot of the same sample stimulated with offending allergen). If one informally interprets FACS data revealing marker expression on various cell subsets from each sample as a multivariate histogram, and thinks of the two histograms as two piles of dirt in multivariate space, then the EMD between the two samples is the minimum cost of moving one pile into the other. Here cost is defined as the amount of dirt moved times the distance by which it is moved. Thus the biological interpretation of the EMD between two FACS samples involves both the proportion of cells whose marker expression has changed, (i.e., are different between the two samples) and the distance by which they differ. Small shifts, which are commonly observed due to instrumentation error or calibration, produce a small EMD that defines the threshold above which EMD values can be taken as meaningful. This insensitivity to small shifts is a desirable property which does not hold for distances based on measures of statistical significance, such as Probability Binning, where a small shift between two samples may be highly significant.

3.4.2 Algorithm

Signatures

For efficiency, distributions are summarized by signatures, a compact histogram-like approximation of the data. Whereas histogram bins specify a range of values (e.g., CHAPTER 3. QUANTIFYING CHANGES IN CELL SUBSETS 53

0-10, 11-20, ...) that fall in each bin, a signature specifies the mean of the values in the bin. Formally, a signature sj = (mj, wmj) is represented by the mean mj of a group of observations j, and the fraction wmj of observations that belong to the group j.

Computing the Earth Mover’s Distance

EMD only requires the specification of a ground distance - the function used to compute the physical distance between two sample points. For instance, in the case of coordinates on a map, the ground distance might measure the distance between two cities on the map. In computing the EMD we generalize this ground distance, which operates on basic features, to the case of a collection (distribution) of features to get a measure of the distance between two distributions. Formally, the earth movers distance can be stated in terms of a linear programming problem: two distributions represented by signatures, P = (p1, wp1), ..., (pm, wpm) and Q = (q1, wq1), ..., (qn, wqn) where pi, qi are the center of mass of each bin with frequencies wpi, wqi, and D = [dij] the ground distance matrix containing the distance between pi and qj for all i, j. We want to find a flow F = [fij] between pi and qj that minimizes the total cost:

m n X X Cost(P, Q, F ) = dijfij (3.8) i=1 j=1 subject to the constraints:

fij ≥ 0 1 ≤ i ≤ m, 1 ≤ j ≤ n (1) Pn j=1 fij ≤ wpi 1 ≤ i ≤ n (2) Pm i=1 fij ≤ wqj 1 ≤ j ≤ m (3) Pm Pn Pm Pn i=1 j=1 fij = min( i=1 wpi, j=1 wqj) (4)

Constraint 1 ensures that mass is only transported in one direction (e.g. from the source sample to the destination sample). Constraints 2 and 3 limit the amount of CHAPTER 3. QUANTIFYING CHANGES IN CELL SUBSETS 54

mass that can be moved from/to a given signature bin to their respective weights, and constraint 4 ensures that the maximum possible amount of mass is moved. Solving the linear programming problem determines the optimal flow F between the source and destination signatures subject to constraints 1 − 4. The earth movers distance can then be defined as a function of the optimal flow F = [fij] and the ground distance D = [dij]:

Pm Pn i=1 j=1 dijfij EMD(P,Q) = Pm Pn (3.9) i=1 j=1 fij In the case of signatures with the same total mass, which can always be ensured by normalizing each of the two samples to have mass=1, the EMD is a true metric for distributions and is equivalent to the Mallow’s distance [54] as demonstrated by Levina et. al. [47]. When P and Q have the same total mass then constraints (2), (3) and (4) can be simplified to

Pn j=1 fij = wpi 1 ≤ i ≤ n (5) Pm i=1 fij = wqj 1 ≤ j ≤ m (6) Pm Pn Pm Pn i=1 j=1 fij = i=1 wpi = j=1 wqj = 1 (7)

Thus, the EMD applied to probability distributions has a clear probabilistic in- terpretation in the Mallow’s distance. In the applications described here, we ensure equal mass of two samples but retain the Earth Mover’s Distance terminology because we found the metaphor of moving piles of dirt a useful tool in illustrating the intu- ition of the metric to biologists. EMD can be computed quickly using the ‘Hungarian algorithm [44].

3.5 Materials

3.5.1 Nut allergy human blood samples

We utilize DBM described in Chapter 2 to assign individual cells to clusters based on measurements in two channels and compute the EMD between the clusters. The only CHAPTER 3. QUANTIFYING CHANGES IN CELL SUBSETS 55

parameter tuned for the study presented here is percent outliers (Q), which defines the percentage of events that do not get assigned to a cluster. If this value is set to 0, every event is assigned to a cluster. For this analysis, Q is set at 0.5% to obtain a reasonable number of clusters for the population of interest. All other parameters are set according to statistical theory of density estimation as described in Section 2.4.

UNSTIMULATED NON-OFFENDING ALLERGEN OFFENDING ALLERGEN Dump

CD123

Figure 3.3: 2-dimensional DBM clustering on the patient samples. The basophil cluster was selected using a simple heuristic: the cluster with the highest median expression in the CD123 channel, and the lowest median expression in the dump channel. Arrows indicate the selected basophil cluster.

For each patient, we performed DBM clustering on each of 3 blood samples: (i) un- stimulated; (ii) stimulated with a non-offending allergen; (iii) and stimulated with an offending allergen. DBM is computed for each sample using two channels: a dump channel containing lineage markers for cells that are to be excluded from subsequent analysis (CD3, CD16, CD19, CD66b, HLADR) and CD123, a marker on the surface of mature basophils. DBM clustering on the patient samples results in ∼9 clusters per sample for the allergic patients and ∼6 clusters per sample in the matched healthy controls. The basophil cluster was selected using a simple heuristic: the cluster with the highest median expression in the CD123 channel, and the lowest median expression in the dump channel (see Figure 3.3). This heuristic was determined by reviewing the CHAPTER 3. QUANTIFYING CHANGES IN CELL SUBSETS 56

manual gating strategy used to analyze these data.

3.5.2 Spleens from 3 mouse strains

Mouse spleen cells were stained for myeloid markers for each of 3 genetically distinct strains of mouse: BALB/c, c57 and RAG -/-. A second BALB/c sample was run as a Fluorescence Minus One (FMO) control in the study. FMO is a control used to to precisely identify cells which express a specific marker, and are typically used when expression of that marker is low. The FMO stain is performed on cells from the same sample and contains all of the reagents in the primary assay except the reagent for the given marker of interest. We use the FMO control here as a psuedo-replicate by excluding the channel missing from the FMO sample from the analysis (leaving the same stain on the remaining channels).

3.6 Results

In the following experiments, we show how EMD can be used to identify differences in FACS samples based on several different measurements, rather than differences based on any single measurement or combination of single measurements.

3.6.1 Application to spleens from 3 mouse strains

We ranked samples from 3 mouse strains labelled with the same assay (described in Section 3.5.2) based on their similarity to a control sample. The genetics of the mice provide a notion of the “correct” ordering (BALB/c → BALB/c replicate → c57 → RAG). Using the BALB/c mouse as the “control”, we compute the EMD between each of the 3 samples using 2 markers, CD11c and F4/80, and display the results in Figure 3.4. In addition, we compute the intra-sample EMD by splitting the same data file in half using 2 approaches: alternating rows and split in half. These intra-sample measurements give an indication of the baseline variability (or the “null distance”,

D0). CHAPTER 3. QUANTIFYING CHANGES IN CELL SUBSETS 57

Figure 3.4: Mouse spleen cells from two samples of the same BALB/c (a,b), c57 and RAG mice. Values compute the EMD between the first BALB/c sample (control) and each of the other 3 samples

As expected, the sample with the smallest distance to the control is the replicate.

The next closest sample to the control is c57, which is about 10 times D0. The furthest sample from the control is RAG -/-, at 30 times D0. The results coincide with our existing notion of “similarity” based on the genetics of the 4 samples from 3 mouse strains.

3.6.2 Application to human blood samples for nut allergy

In this Section we describe the results of studies applying EMD to human blood sam- ples stimulated with allergens from patients with diagnosed nut allergy (Section 3.5.1). Given defined populations of basophils before (unstimulated) and after stimulation, we measure the dissimilarity between the two populations. Previous results show that after activation with nuts, two basophil surface markers, CD203c and CD63, CHAPTER 3. QUANTIFYING CHANGES IN CELL SUBSETS 58

UNSTIMULATED NON-OFFENDING ALLERGEN OFFENDING ALLERGEN 1524 cells 1575 cells 1392 cells 1.0 64 bins 64 bins 64 bins 0.8 0.6 0.4 0.2 CD63 0.0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 CD203c

Figure 3.5: Expression of CD203c and CD63 on the surface of basophils for a single patient showing probability contours (top) and the resulting binning (bottom) are unregulated in patients with diagnosed allergies [23]. Our goal is to quantify the increase after stimulation to differentiate allergic response. We selected the basophil cluster with DBM as described in Section 3.5.1. We convert the binned distribution (Figure 3.5) into a sample signature as described in Section 3.6.3. For each patient sample (n=13) and healthy control (n=9), we calculate the EMD between the unstimulated control and samples stimulated with a (i) non- offending allergen and (ii) offending allergen using Euclidean distance as the ground distance. We compute the EMD fold change in the CD63 and CD203c dimensions as a ratio:

EMD(c, p) EMD = (3.10) fc EMD(c, q) where c is the unstimulated control, p is the sample stimulated with the offending allergen, and q is the sample stimulated with the non-offending allergen. The ratio normalizes the shift in the response to the offending-allergen by the response of the CHAPTER 3. QUANTIFYING CHANGES IN CELL SUBSETS 59

non-offending allergen.

a b 14 1.00 12 0.80 10 0.60 8

Sensitivity 0.40 6 True Positive 4 0.20 EMD Fold Change 2 0.00 0 allergic non-allergic 0.00 0.30 0.60 0.90 (n=13) (n=9) 1-Specificity status False Positive Area Under Curve = 0.99

Figure 3.6: (a) The EMD fold change ratio of allergic patients versus non-allergic pa- tients. Each point represents a single patient in the study. Allergic patients display a 6-fold increase in the EMD fold change ratio for activation markers only in basophils, compared to a 1-fold increase in non-allergic patients. (b) Receiver Operating Char- acteristic (ROC) curve over a range of EMD fold change values (0-6). This measures the frequency of true positives (patients with diagnosed allergies that we predict have allergies) and false positives (patients with NO diagnosed allergies, that we predict have allergies) given different values of EMD fold change.

We fit a nominal logistic regression model to predict allergic status based on EMD fold change. Figure 3.6 illustrates the Receiver Operating Characteristic (ROC) curve of the model with an Area Under Curve (AUC) of 0.99. Based on this model, we define a threshold at 2.3 to distinguish an allergic response with high sensitivity (92%) and specificity (100%).

3.6.3 Performance evaluation

In the following Sections we empirically examine the performance characteristics of the EMD algorithm, including the effect of binning parameters and sample size on EMD values. CHAPTER 3. QUANTIFYING CHANGES IN CELL SUBSETS 60

Comparison to other metrics

While the EMD has theoretical advantages over other methods proposed in the liter- ature for population comparison, we carried out an empirical comparison to evaluate the applied performance of EMD compared to other “metrics” on the human nut allergy data (Section 3.5.1). We replicated the analysis described in Section 3.6.2 us- ing Probability Binning (PB) and Mahalanobis distance in place of EMD. Figure 3.7 shows plots of the EMD (left), PB (right) and Mahalanobis distance fold changes, displayed as box plots. Mahalanobis distance was calculated using the semi-metric defined by Aghaeepour (equation 3.7). Results show that while the EMD clearly differentiates between the allergic and the non-allergic samples, PB and Mahalanobis distance show a large overlap between the two groups and is not able to differentiate them with sensitivity and specificity comparable to EMD.

EARTH MOVER'S DISTANCE PROBABILITY BINNING MAHALANOBIS DISTANCE 50 12.5 2 40 10 1.5 30 7.5 1 5 20 PB Fold Change MH Fold Change EMD Fold Change 2.5 0.5 10 0 allergic non-allergic 0 allergic non-allergic 0 allergic non-allergic Status Status Status

Figure 3.7: Fold change calculated as the ratio of EMD between unstimulated sample and sample stimulated with offending allergen, normalized by the EMD between unstimulated sample and sample stimulated with non-offending allergen. Samples are grouped by allergic and non-allergic individuals, and metrics computed for: Earth Mover’s Distance, Probability Binning metric, and Mahalanobis distance.

Binning

Recall from Section 3.4.2 that the EMD algorithm operates on signatures, which are histogram-like approximations of the data. In the original application the algorithm CHAPTER 3. QUANTIFYING CHANGES IN CELL SUBSETS 61

was used for content-based image retrieval [71], and signatures were generated by clus- tering the color distributions of pixels in an image and using the average of the cluster color mi and the cluster frequency wmi to generate the signature {sj = (mj, wmj)} of an image. Generating a signature from a flow cytometry data sample requires a different approach than the one used for images. In the case of comparing two cell populations (as in Section 3.6.2) we expect that the cell populations are already defined, either by manual gating or an automated algorithm. Therefore, the approach of further clustering data which likely only contain a single mode, would yield a very coarse description of the data. Instead, we bin the data using adaptive binning (3.3.1) and use the resulting bins to generate a population signature. We follow the algorithm of Roederer [66] for binning the data into groups that will be transformed into a signature. Adaptive binning is a good method for summarizing the data because it is both scalable to high dimensions and non-parametric. We begin by calculating the variance for all of the data for each of the parameters included in the analysis. Then, we choose the dimension with the largest variance, and split the events into 2 bins along the median value in that dimension, such that half of the events fall in each of the 2 resulting bins. The algorithm proceeds recursively on each of the 2 resulting bins, choosing the dimension that maximizes variance and splitting the data about the median value in that dimension. The result is a series of n-dimensional hyper-rectangular bins, each containing an equal number of events. An illustration of the binning can be seen in Figure 3.1. To turn the bins into a signature for EMD, we first summarize the bins. Because of the adaptive binning procedure, each of the bins contains an equal number of events, and are thus weighted equally. Therefore we need only compute the bin “centers” with which to create a signature for the EMD algorithm. Here we compare two methods for calculating bin “centers.”

Bin centroids For each bin represented by an n-dimensional hyper-rectangle with boundary b = {(UR, LL)1,..., (UR, LL)n} where UR, LL are n-dimensional vectors such that UR[i] is the position of the upper right corner of the bin boundary CHAPTER 3. QUANTIFYING CHANGES IN CELL SUBSETS 62

in dimension i, and LL[i] is the position of the lower left corner of the bin boundary in dimension i.

Algorithm 5 Bin Centroid V ← ~0 for each b ∈ bins do boundary = binBoundary(b) V[b] = (boundary.UR + boundary.LL)/2 end for return V

The resulting n-dimensional vector V is of length #bins where V[i] is the centroid of bin i.

Bin center of mass For each bin represented by an n-dimensional hyper- rectangle, calculate the mean position of all the points belonging to that bin. The function points(i) returns all of the data points falling in bin i.

Algorithm 6 Bin center of mass V ← ~0 for each b ∈ bins do p = points(b) V[b] = mean(p) end for return V

The resulting n-dimensional vector V is of length #bins where V[i] is the center of mass of bin i. A comparison of the two bin summary methods is show in Figure 3.8 for a single unstimulated basophil sample using 128 bins. In the high density region at the center of the distribution, the two methods produce consistent results. However, at the periphery of the distribution where the size of the bins are large, the difference between the bin center and the bin center of mass increases. The exterior boundaries of the bins are set based on the minimum and maximum values of the data in each dimension. An extreme outlier will create a large bin where much of the area of the bin is not occupied, skewing the binned representation. CHAPTER 3. QUANTIFYING CHANGES IN CELL SUBSETS 63

ADAPTIVE BINNING BIN CENTER VS. BIN CENTER OF MASS 1524 cells 128 bins 128 bins 0.40 0.40 0.35 0.35 0.30 0.30 0.25 0.25 0.20 0.20

Bin center Center of mass 0.15 0.15 0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4

Figure 3.8: (A) single human basophil sample binned according to adaptive binning (described in Section 3.6.3) (B) bin summaries using bin centroids and bin center of mass

Conversely, taking the center of mass of the bin yields a centroid that is robust to outliers and more closely resembles the underlying distribution (i.e. fewer “binning” artifacts). All EMD values reported here utilize signatures based on the center of mass method.

Number of bins To assess the sensitivity of EMD to binning parameters, we analyzed two datasets from the nut allergy study (Section 3.5.1) for a range of bin sizes. Recall that absolute basophil counts for this study ranged from 1,000-1,500 for allergic patients. We bin the data using n = {4, 8, 16, 32, 64, 128, 256} bins and plot the results of EMD between the unstimulated control and samples stimulated with offending and non-offending allergens for 2 individuals in Figure 3.11. Over the range of n, EMD values are consistent and clear separation between stimulation with offending and non-offending allergens is maintained. Choosing the appropriate number of bins may still depend on the application. Detecting movements in small populations of cells with low frequency may require finer binning than comparisons CHAPTER 3. QUANTIFYING CHANGES IN CELL SUBSETS 64

of larger populations, but overall these data suggest that the method is robust in the choice of the number of bins.

a b 0.20 0.25 0.20 0.15 0.15 Offending Allergen 0.10 EMD EMD Non-offending Allergen 0.10 0.05 0.05

Offending Allergen Non-offending Allergen 0.00 0.00

5 10 20 50 100 200 5 10 20 50 100 200

Number of bins Number of bins Allergic, 100216 Allergic, 100220

Figure 3.9: Number of bins does not effect the ability to distinguish groups

Running time

Running time increases linearly with the number of bins as shown in Figure 3.11 (note that the x-axis is a log scale). In 3.6.3, we saw that the number of bins does not have a significant effect on the metric. Based on these two results, we chose a value of 64 bins for all analyses in this work unless otherwise noted. 64 bins proved to be a good balance of running time and binning resolution. Using 64 bins, the running time of the mouse samples containing 200,000 cells was ∼12 seconds, and the human basophil samples containing ∼1500 cells was ∼2 seconds. All running time calculations were performed on a 2.53 GHz Intel Core 2 Duo with 4 GB of RAM running Mac OS X 10.6.6. The EMD algorithm is implemented in C++, but timing was performed via an external library interface to the R environment, using the R function system.time(), which increases the overhead over the actual algorithm running time. CHAPTER 3. QUANTIFYING CHANGES IN CELL SUBSETS 65

a b BALB/c, C57, RAG MOUSE STRAINS HUMAN BASOPHILS (BEFORE/AFTER STIMULATION) 100 6 80 60 4 Running time (s) Running time (s) 40 2 20 0 0

5 10 20 50 100 200 500 5 10 20 50 100 200

Number of bins Number of bins average (3 runs), 200k events 1 comparison, ~1500 events

Figure 3.10: Effect of number of bins on the running time. x − axis is on a log scale.

Sample size

Here we examine the effect of sample size on the EMD metric. Random samples were drawn from samples stimulated with offending and non-offending allergens, and compared with a random sample of the same size from the unstimulated control. Samples of size n = {100, 200, 400, 800, 1000} were drawn and the results plotted in Figure 3.11. The result is robust in the number of cells collected. Good separa- tion exists down to 100-200 cells per sample. Below this threshold results become less reliable for consistently distinguishing allergic response across the entire study. This minimum basophil threshold was independently confirmed by the original study authors [23]. CHAPTER 3. QUANTIFYING CHANGES IN CELL SUBSETS 66

EFFECT OF CELL COUNT ON EMD 0.25 0.20 0.15 Offending Allergen

EMD Non-offending Allergen 0.10 0.05 0.00

100 200 500 1000 Sample size

100220

Figure 3.11: Effect of sample size on EMD metric for stimulated basophils

3.7 Discussion

One of the primary tools for interrogating differences between two samples is statisti- cal hypothesis testing. Hypothesis testing aims to detect the presence of an effect or a difference under different conditions [80]. These tests are used to scan the data and tease out relationships between subjects in the presence of a drug, disease or stimu- lation. Hypothesis testing aims at establishing whether two samples are significantly different, where significance refers to the statistical meaning of sufficient evidence. There are 3 factors that affect the power of a statistic to identify differences: sam- ple size, population variance and effect size. If one fixes effect size and population variance for a given experiment and changes only the sample size, one can tweak the significance values to find significant differences in data that appears otherwise similar. To illustrate this principle we compare two hypothetical patient groups (n=100,000 each), treated with placebo and drug 3.12. The outcome measure is systolic blood pressure, which differs between the two groups by 0.5%. We randomly sample k pa- tients from each group and, using a two-sample t statistic, calculate a p value under the null hypothesis that there is no difference in systolic blood pressure between the two groups. This trial is performed with 1,000 random samples of size k from the orig- inal patient groups yielding a distribution of p-values. Performing these trials over a CHAPTER 3. QUANTIFYING CHANGES IN CELL SUBSETS 67

range of values for k demonstrates the behavior of statistical significance in the pres- ence of varying levels of evidence. When k=100, we only detect the modest difference between the two groups at an arbitrary threshold of statistical significance (p <0.05) 5.2% of the time. As we increase the size of k, the distribution of p-values shifts such that more and more trials become significant. In the final panel (k=10,000), 40% of the trials demonstrated statistically significant differences. A statistically significant difference absent in smaller trials suddenly reveals itself in the large trials.

Sample size= 100 Sample size= 1000 Sample size= 5000 Sample size= 10000 400 250 250 250 200 200 200 300 150 150 150 200 Frequency Frequency Frequency Frequency 100 100 100 100 50 50 50 0 0 0 0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

pvalue.list pvalue.list pvalue.list pvalue.list % < .05 = 5.2 % < .05 = 8.4 % < .05 = 20.7 % < .05 = 39.2

Figure 3.12: Increasing sample size increases statistical significance, or why NOT to use p-values in place of distance metrics

This observation is not novel. It follows directly from elementary statistical power and sample size calculations. When the number of samples measured is large, as is the case with most high-throughput technologies, the opportunity for statistically significant differences is large. Given enough examples, there is enough evidence to find small, statistically significant differences in most data. Thus we are left with the problem of untangling statistical significance from biological significance. Two groups can be significantly different in the statistical sense without being different in a biologically meaningful way. It simply means that there is strong evidence that the two samples are different, even if the physical difference, (e.g., increase in response) is very small. As demonstrated in the simulation shown in Figure 3.12, large evidence in sup- port of a difference between two samples has a very different meaning from a large CHAPTER 3. QUANTIFYING CHANGES IN CELL SUBSETS 68

difference between two samples. Thus the common practice in bioinformatics of rank- ing candidate items (genes, SNPs, drugs) based on the statistical significance of the association biases the results towards items with more evidence. Ranking candidate items based on a measure of their difference prioritizes large effects over small effects, but is not driven strictly by the amount of evidence available. Chapter 4

Evaluating Automated Gating Algorithms

4.1 Introduction

Criteria used for the gold standard in evaluating automated gating algorithms for flow cytometry broadly fall into 2 categories: synthetic data and manual expert gates. Syn- thetic data are generated from pre-determined distributions to simulate flow cytome- try data, as seen in Section 2.6.3. The primary advantage of using synthetic datasets for evaluation is that the truth about cluster membership is well-defined. Since we have crafted the dataset, it is trivial to add labels that identify each event based on the underlying distribution from which it was generated. Thus, it is straightfor- ward to compute accuracy measurements given the true membership of all the events. Given data with true cluster assignments, the evaluation task becomes identical to classification evaluation. While synthetic datasets are easy to generate and evaluate, they suffer from an inherent limitation: how do we generate distributions that are faithful to real FC data without biasing towards a particular solution? One can easily imagine constructing a dataset for which any particular algorithm will succeed. Conversely, one can imag- ine constructing a dataset for which any particular algorithm will fail. Thus, the reader is left to decide whether or not the assumptions used to generate the data are

69 CHAPTER 4. EVALUATING AUTOMATED GATING ALGORITHMS 70

appropriate for their application. The second type of evaluation is to compare clustering results to results obtained via traditional manual gating performed by an expert (described in 2.3.1 and shown in our own work in Section 2.6.1). In this approach, the same data are manually gated by an experienced researcher and clustered by an algorithm. The expert-defined populations are treated as the true cluster labels and compared to the automated procedure. The strength of this approach is that it allows for a direct comparison between algorithm and experts on real data. However, treating the expert labels as the truth makes it difficult to assess improvement over manual gating, which is one of the goal of this area of research. Even if we are to accept manual gating as a gold standard, there are very few estimates regarding inter-expert agreement in manual gating (see, for instance [53]). Without such data about expert/expert variability, it is impossible to make meaningful statements about expert/algorithm variability. Both of these evaluation criteria play an important role in assessing the prac- tical and theoretical properties of automated gating algorithms, yet neither offer an objective measure of algorithm performance on real flow cytometry data. In this Chapter we offer a proposal of a new approach to evaluating automated gat- ing algorithms, leveraging improvements in instrument technology to generate multi- parameter datasets in which the true cluster labels are contained in the dataset. Section 4.2 presents background of existing evaluation criteria and associated pit- falls. In Section 4.3 we describe the rationale and design of an experiment for Cluster Evaluation Using Hidden Labels (CEHL). We present preliminary results applying the Density Based Merging algorithm 2.4 to the CEHL data and conclude with a discussion of the current state of automated gating evaluation.

4.2 Background

The literature is ripe with methods for validating cluster results, originating both in Statistics ([29], [18]) and Information Retrieval ([55]). The approaches broadly fall into two categories: internal validation criterion measure how well the clusters maximize intra-cluster similarity and minimize inter-cluster similarity, and external CHAPTER 4. EVALUATING AUTOMATED GATING ALGORITHMS 71

criterion, which compare the results to an external benchmark. Since high scores on an internal cluster validity measure do not necessarily imply effective results for a particular application ([55]), external measures are a more reliable tool in evaluating application-specific performance. The difficulty is in identifying a reliable external benchmark.

4.2.1 Evaluation using synthetic data

Lo et. al. [50] use simulation studies to evaluate a mixture model based clustering technique on FC data. In the study, data are generated from each of the following models: t-mixture with Box-Cox transformation, t-mixture, Gaussian mixture with Box-Cox transformation and Gaussian mixture using parameter estimates obtained from a previously analyzed dataset. They then use Expectation Maximization (EM) algorithm to fit Gaussian and t-mixture model parameters for the data that were just generated and evaluate the misclassification rates. They have thus shown that mixture model based clustering is capable of recovering clusters in data generated from the same model components (e.g. Gaussian, t-mixture) that the EM algorithm is trying to fit. This type of circular logic makes it difficult to make any useful conclusions about the algorithm performance.

4.2.2 Evaluation using manual gating

The most common approach to evaluating automated gating algorithms is to compare the results to the outcome of manual gating by an experienced researcher [2][63][86][6]. These studies use the human defined cluster boundary as the true cluster label, and evaluate the ability of an automated approach to recreate these boundaries. Com- parative metrics are applied to quantify the concordance between the two results and rank the performance of different approaches compared to an expert. An empirical study of such metrics by Aghaeepour et. al. [?] on FC data suggests that the F- measure most closely coincides with expert intuitions about clustering “success” and “failure”. The F-measure [55] is the harmonic mean of precision and recall, defined CHAPTER 4. EVALUATING AUTOMATED GATING ALGORITHMS 72

as 2 × precision × recall F = (4.1) precision + recall The F-measure was adopted as the primary evaluation criteria by the inaugural Flow Cytometry: Critical Assessment of Population Identification Methods (FlowCAP) conference held in September 2010. The purpose of the FlowCAP conference is to “advance the development of computational methods for the identification of cell populations of interest in flow cytometry data.” In FlowCAP I, objective testing was based on comparison to manual analysis by experts using common datasets. However, it was noted in the proceedings of the conference that manual gating is an imperfect gold standard on which to base evaluation [76] because of inconsistency and disagree- ment over manual gating decisions. As such, FlowCAP II incorporated the notion of clinical/biological outcome prediction as a secondary criteria for evaluating auto- mated gating algorithms, by finding populations that are predictive of some external outcome measure. The ability to identify populations that are predictive of biological and clinical outcome is a desirable characteristic of an automated gating algorithm, however such predictions can be made without performing a cluster identification step (for instance, using cytometric fingerprinting [69]). As such, this outcome is not an assessment of population identification per se, it is an assessment of classification. In this Chapter we take an alternative approach to directly measure the perfor- mance of automated gating that makes minimal requirements on manual gating. This approach is designed to provide an objective measure of cluster membership and im- prove evaluation of automated gating by providing straightforward interpretations of cluster membership.

4.3 Cluster Evaluation using Hidden Labels (CEHL)

In this Section, we describe the design of a flow cytometry dataset for use in evalu- ating automated gating algorithms. The design of the dataset addresses the short- comings of previous evaluation methods outlined in the previous section, namely, the ability to objectively evaluate clustering results using realistic, biologically relevant CHAPTER 4. EVALUATING AUTOMATED GATING ALGORITHMS 73

data. In general, it is difficult to evaluate the validity of inferences drawn from most unsupervised learning algorithms. If reliable external criterion exist to assign class membership for a set of training data, then the problem could be approached using any number of supervised statistical learning approaches. Clustering, and other un- supervised statistical learning methods, are useful when we do not know the pattern of the data a prior or it is not feasible to create sufficient training data sets for the application. It is precisely this “uncertainty” that makes clustering results difficult to evaluate, since effectiveness is a matter of opinion [29]. Here we describe the design of a flow cytometry experiment in which a set of pa- rameters can be used to cluster the data, and a second set of parameters in the same data can be used to evaluate the clustering results. We exploit the high-dimensionality of flow cytometry technology to conduct an experiment in which the clustering pa- rameters and the cluster labels are contained in a single dataset. The experiment design consists of 3 steps:

1. Create a mixture of cells from different origin (e.g. mouse strain, cell line), in which the origin of the cells in the mixture is labeled, and analyze by flow cytometry

2. Cluster the resulting data, without the origin-labeled dimension

3. Determine the true origin of the cluster by uncovering the origin-label

4.3.1 Experiment design

A typical lab mouse has many cell subsets that are easily identified by flow cytom- etry using surface markers. If we take a mouse with the same background genetics, and knock out a particular gene, that mouse will not be able to develop all of the same subsets present in the wild type mouse. In particular, in a well studied model, knocking out the RAG-1 gene prevents the development of mature B and T lympho- cytes [56]. If we mix cells derived from a RAG-1 knockout mouse with cells derived from a genetically similar wild type mouse, the majority of subsets in the mixture will contain a combination of cells from both strains. However, certain lymphocyte CHAPTER 4. EVALUATING AUTOMATED GATING ALGORITHMS 74

RAG -/- GFP- CB.17 GFP+

CELL SUBSETS (6) Macrophage Spleen/PerC cells Neutrophils Eosinophils VH11+ B lymphocytes B lymphocytes Macrophages T lymphocytes

B lymphocytes

T lymphocytes Neutrophils

Figure 4.1: Combining samples from a wild-type CB.17 mouse labeled with GFP and a RAG knockout mouse. 6 of the 8 subsets of cells identified with the 7 marker assay are present in both strains. The 2 remaining subsets are only found in the wild-type CB.17. The mixture can be clustered based on the 7 markers and later de-convolved based on the presence of GFP, which is only on the wild-type strain. subsets are present only in the wild-type mouse. Labeling all the immune cells from the wild type mouse with a marker to indicate their origin, we can examine the clus- ters composed of those subsets that must exclusively be derived from the wild-type mouse and measure the accuracy of the algorithm based on the presence/absence of the marker indicating the origin of the cells in that cluster (see Figure 4.1). The clustering algorithm operates on all of the dimensions except the strain origin label, which is later used to evaluate the clustering result. Using this model system, we can quantitatively and objectively evaluate 2 properties of an automated gating algorithm: classification accuracy and sensitivity, as described in Section 4.3.3. CHAPTER 4. EVALUATING AUTOMATED GATING ALGORITHMS 75

4.3.2 Materials

Experiments were designed with the help of Eliver Ghosn and carried out by Eliver Ghosn at the Stanford Shared FACS Facility.

Mice and Tissue Preparation

CB.17 mice expressing green fluorescent protein (GFP) under the promoter for MHC- I molecule (CB.17-GFP+) and Recombination Activating Gene 1 knockout (RAG-1 -/-) mice, 6 to 8 weeks old, were maintained at the Stanford Medical School Animal Care Facility. All experiments were conducted with institutional animal care and use committee approval. Peritoneal cells were harvested by injecting 7 mL of staining medium (deficient-RPMI plus 3% newborn calf serum) into the peritoneal cavity (PerC). Spleens were disrupted and resuspended to obtain single cell suspensions.

Experimental procedure and FACS analysis

Mice deficient in RAG-1 enzyme (RAG-1 -/-) fail to generate lymphoid lineages (B and T lymphocytes) but efficiently generate myeloid and granulocyte lineages (neu- trophils, macrophages, etc.). In contrast, CB.17-GFP+ mice efficiently generate all cell lineages of the immune system (lymphoid, myeloid and granulocytes). Moreover, these mice were engineered so all lineages of the immune system constitutively express GFP, which make them easily and readily detectable by FACS analysis (i.e. all cells derived from CB.17-GFP+ mice are green). In studies presented here we generated data from a cell mixture assay: cells derived from CB.17-GFP+ were mixed with cells derived from RAG-1 -/- mice. This assay allows us to identify the origin of each cell type present in the cell mixture. To perform the mixture, varying amounts of PerC cells derived from RAG-1 -/- mice were mixed to PerC cells derived from CB.17-GFP+ mice. The same protocol was followed to generate spleen cell mixtures. Cell mixtures were pre-incubated on ice for 15min. with antiCD16/CD32 mon- oclonal antibodies (mAb) to block FcRII/III receptors and stained with the follow- ing fluorochrome-conjugated mAb in an 8-color staining combination: PE-labeled CHAPTER 4. EVALUATING AUTOMATED GATING ALGORITHMS 76

Determinant Fluorochrome MHC-I GFP Gr-1 PE CD5 PECy5 CD19 PECy55 CD11b APC-Cy7 Ig κ/λ Alexa 405 VH11 QDot 605 Dead cells (nuclear staining) Propidium Iodide

Table 4.1: 8-parameter assay used to identify 6 cell subsets used for cluster evaluation antiGr-1; PECy5-labeled anti-CD5; PECy5.5-labeled anti-CD19; APCCy7-labeled anti-CD11b; biotin-labeled anti-Igκ and anti-λ light chains; Qdot 605-labeled anti- VH11 (stains a specific subset of B-lymphocytes expressing the VH11-variable heavy chain locus). Note that cells derived from CB.17 mice are constitutively labeled with green fluorescent protein (GFP). Cells were then washed and stained again on ice for 15 minutes with streptavidin-Alexa 405 (Invitrogen) to reveal biotin-coupled anti- bodies. After washing, stained cells were resuspended in 10 µg/mL propidium iodide (PI), to exclude dead (PI+) cells. Cells were analyzed on a Becton Dickinson LSRII located in the Stanford Shared FACS Facility. Data were collected for 0.2×106 cells.

Subset Phenotype B-lymphocytes CD19+,Igκ/λ+ VH11+ B-cell subset CD19+,Igκ/λ+, VH11+ T-lymphocytes CD5++, CD19− Neutrophils CD11b++, Gr-1++, SSc+, CD5−, CD19− Eosinophils CD11b+, Gr-1+, SSc++, CD5−, CD19− Macrophages CD11b++, Gr-1−

Table 4.2: 6 subsets and marker phenotype identified by assay in Table 4.1 CHAPTER 4. EVALUATING AUTOMATED GATING ALGORITHMS 77

4.3.3 Evaluation criteria

Classification accuracy

Of the 6 subsets identified by the assay in Table 4.3.2, 3 contain a mixture of GFP+ and GFP− cells. The remaining 3 subsets, B and T lymphocyte subsets, are comprised entirely of GFP+ cells, as the GFP− come from the knockout mouse which cannot produce these subsets. We calculate the precision of a B or T lymphocyte cluster C as: + GF PC P recision(C) = + − (4.2) GF PC + GF PC This measure of precision relies only on the GFP+/GFP− cutoff, which we show in the results Section 4.4 to be an unambiguous threshold. However, calculating recall requires a gate to delineate true positives cells. The B and T cell subsets are well separated in this experiment and we define broad gates according to a trained expert of true positive lymphocytes (B and T cells). Using the lymphocyte gates, we calculate the recall of a lymphocyte cluster C given the lymphocyte gate G as:

+ GF PC Recall(C) = + + (4.3) GF PC + GF PG

By defining broad lymphocyte gates, we risk including contaminating events and considering these part of the same population. However, these represent a small fraction of the total events and remain fixed throughout our experiments. A highly restrictive B-lymphocyte gate contains only 5% fewer total cells than the inclusive gate, and is more subject to expert interpretation.

Sensitivity

In addition to classification accuracy, this model system also allows us to measure the sensitivity of a clustering algorithm to detect rare populations of cells. Mixing increasing concentrations of the wild-type sample into the RAG -/- sample results in the appearance of 3 lymphocyte subsets absent in the RAG -/- sample. Adding CB.17 events at different concentrations allows us to measure the concentration at which the CHAPTER 4. EVALUATING AUTOMATED GATING ALGORITHMS 78

Events: 169645 105 Events: 67280

104

CD1 9 103

0

0 103 104 105 Ig Kappa/Lamda

Figure 4.2: B-cell gate algorithm is able to detect the appearance of a rare subset, as well as the classification accuracy of the algorithm for that subset as the frequency changes. Mixtures can be physical, in the tube before the sample is collected on the flow cytometer, or in silico for samples from two strains measured under the same conditions. Mixtures reported here were generated in silico unless otherwise noted. Starting with a RAG -/- base, we randomly sample events from the CB.17 mouse such that the mixture contains [0.1,0.4,0.7,1,1.3,1.8,2.4,3.2,4.2,5.6,7.5,10]% CB.17 events (Figure 4.3). Each sample contains 25,000 total events. Figure 4.3 plots two B- lymphocyte markers, Ig κ/λ and CD19, for the nine concentrations of CB.17. The blue rectangle highlights the region where B-cells appear (CD19+,Ig κ/λ+) in increas- ing frequency with the addition of CB.17 events. CHAPTER 4. EVALUATING AUTOMATED GATING ALGORITHMS 79

BALB/C SAMPLES SYNTHETICALLY MIXED INTO RAG-/-

5 5 5 10 0.1% 10 0.4% 10 0.7% 105 1%

4 4 4 10 10 10 104

3 3 3 10 10 10 103

2 2 2 10 10 10 102

0 0 0 0

2 3 4 5 2 3 4 5 2 3 4 5 0 10 10 10 10 0 10 10 10 10 0 10 10 10 10 0 102 103 104 105

5 5 10 1.3% 10 1.8% 105 2.4% 105 3.2%

4 4 10 10 104 104

3 3 10 10 103 103

2 2 10 10 102 102

0 0 0 0

2 3 4 5 2 3 4 5 0 10 10 10 10 0 10 10 10 10 0 102 103 104 105 0 102 103 104 105

5 5 105 4.2% 10 5.6% 105 7.5% 10 10%

4 4 104 10 104 10

3 3 103 10 103 10

2 2 102 10 102 10

0 0 0 0 CD19 2 3 4 5 2 3 4 5 0 102 103 104 105 0 10 10 10 10 0 102 103 104 105 0 10 10 10 10 IG κ/λ

Figure 4.3: CB.17 events randomly sampled in varying concentrations and mixed with RAG -/- base. 25,000 total events in each sample. CB.17 concentrations range from 0.1%-10% of the total number of events. Nine concentrations plotted, showing Ig κ/λ vs. CD19. The blue rectangle highlights the region where the B-cells appear with increasing frequency as the concentration of CB.17 increase.

4.4 Results

We measured the performance of the DBM algorithm on the dataset in 4.3.1 by the two criteria described in Section 4.3.3: classification accuracy and sensitivity.

4.4.1 Classification accuracy

We expect the majority of the clusters to contain a combination of GFP+ and GFP− cells. Examining the clusters for non-lymphocyte markers, as in Figure 4.4, we see a range of concentrations of GFP+ and GFP− cells. For the purposes of this study, we defined B-cells as CD19+,Ig κ/λ+. Classification accuracy of the B-cell lymphocyte subset was quantified by comparing the B-cell CHAPTER 4. EVALUATING AUTOMATED GATING ALGORITHMS 80

NON-LYMOPHYCTE CLUSTERS CONTAIN GFP+ AND GFP- CELLS

GFP+ GFP- Cyan 40% 60% Orange 37% 63% Green 44% 56% Blue 35% 65% Red 95% 5%

Figure 4.4: Non-lymphocyte clusters contain a mixture of GFP+ and GFP− cells cluster obtained by DBM to the GFP+ cells as outlined in Section 4.3.3. A range of settings for the percent outlier parameter are examined to assess the impact on classification accuracy. The DBM algorithm was applied to a physical mixture of RAG-/- and CB.17 spleen cells in a ratio of 50:10. Results are plotted of the DBM defined B-cell cluster for the B-cell markers CD19 and Igκ/λ at 9 settings of % outliers ranging from 0.1%–10%. The large number of B-cell events (∼ 65, 000) and the good separation in the CD19 and Igκ/λ positive signal from the negative signal result in extremely high precision (99%) for the B-cell DBM clusters. As expected, recall — a measure of the number of events “left out” from the cluster — decreases as a greater percentage of cells are assigned to the background cluster (Figure 4.5).

4.4.2 Sensitivity

In the previous Section we measured the algorithm performance by fixing the data and tuning a parameter (% outlier) to observe the impact on the classification accuracy measures. In this Section, we fix the algorithm parameters (% outlier = 1%) and instead adjust the composition of the data (as described in 4.3.3) to determine the sensitivity of the algorithm to detect rare cell populations (see Figure 4.3). CHAPTER 4. EVALUATING AUTOMATED GATING ALGORITHMS 81

(a) DBM DEFINED B-CELL CLUSTER AT DIFFERENT OUTLIER THRESHOLDS

(b) 1.00 0.99 0.98 Recall 0.97 0.96 0.95 0 2 4 6 8 10 Percent outliers CD19

Ig κ/λ

Figure 4.5: DBM defined B-cell cluster with a range of settings for the outlier pa- rameter, which determines the percentage of events that get classified as background. (a) Fewer outliers increase the cluster boundary, more outliers lead to conservative boundaries. (b) More conservative boundaries decrease recall

The precision and recall of the DBM algorithm applied to 10 dilutions of GFP+ CB.17 events in a GFP− RAG -/- background are shown in Figure 4.6. DBM was unable to detect a cluster in the B-cell region at concentrations <0.5% (data not displayed). Low frequencies of GFP+ effect recall more dramatically than precision. This results from the large B-cell gate used to define the true positives for the recall measurement, and the large variability in the expression markers defining the B-cell cluster, which span 2 decades in the Igκ/λ dimension and 1 decade in the CD19 dimension. In low frequencies, randomly sampled events can be spread out over the large range of values such that the local density in any given region does not exceed the threshold for background noise. Thus, the recall analysis presented here is dependent on (i) the size of the gate that defines true positives and (ii) the variance of the cluster CHAPTER 4. EVALUATING AUTOMATED GATING ALGORITHMS 82

B-CELL CLUSTER DETECTION SENSITIVITY 0.99 0.9 0.98 0.8 0.97 0.7 0.96 Recall Precision 0.95 0.6 0.94 0.5 0.93 0.4

2 4 6 8 10 0 2 4 6 8 10 % GFP+ % GFP+

Figure 4.6: Precision and recall for DBM detection of B-lymphocyte cluster applied to 10 concentrations of CB.17 events in a RAG -/- background. distribution. Decreasing either of these factors will lead to an improvement in recall performance. Precision is less sensitive to these factors. Once the density is sufficient to distinguish the events from background, there is a high probability that the low frequency cluster is centered around the most dense region of the cluster, which in this data, are well separated from contaminating events.

4.5 Discussion

Evaluation against the performance of experienced experts will continue to play an important role in validating the practical utility of automated gating methods. It may be the case that people prefer to get the same answer as an expert, whether or not it is correct. This observation is one of the advantages of a 2-dimensional implementation – experts can pick the dimensions as in manual gating, instead of getting a black box result from a clustering on all dimensions simultaneously. Such a system readily incorporates the prior knowledge of the expert into the clustering procedure in a semi-supervised approach. CHAPTER 4. EVALUATING AUTOMATED GATING ALGORITHMS 83

However, disagreements between the most experienced users make it impractical to define a single gold standard by which automated gating methods can be evaluated. In order for the field to advance we need clear, objective and realistic benchmarks with which to gauge new methods as they become available. In this Chapter we proposed the design of an experimental dataset in which cell subsets are labeled with typical immune cell markers and an additional marker that is excluded from consideration by the clustering algorithm. This extra parameter is later examined to assess the performance of an automated clustering algorithms by revealing the origin of the cluster. Specifically, we combined samples from a GFP+ wild-type CB.17 mouse with a RAG knockout mouse (GFP−), stained the cells with a seven parameter assay to identify eight immune cell subsets, and measured the mixture using flow cytometry. Six of the eight subsets of cells identified are present in both strains. The two remain- ing subsets are only found in the wild-type CB.17. We cluster the data files based on the 7 lineage markers and evaluate the results by de-convolving the mixture based on the presence of GFP, which is only present on the cells originating from the wild-type strain. Using this extra marker, we define the “true” cluster labels for cell subsets that are present in the wild-type strain (GFP+), which cannot be produced by the knockout strain (B and T lymphocytes), and measure the classification accuracy and sensitivity of DBM. The B-cell subset is identified with good recall over a range of values of the %-outlier parameter of the DBM algorithm, and almost perfect precision (99%) over the same range. We also measured the ability to detect rare populations by adding different concentrations of the wild-type sample into the knock-out back- ground and assessing the effect on algorithm recall and precision. In these studies, DBM first detected the B-cell cluster at a concentration of 0.5%, a result consistent with the basophil studies presented in 2.6.1, with high precision (∼92%) but poor recall (∼40%). At 2% concentration, recall and precision both improve to ∼82% and ∼97%, respectively. These results demonstrate guidelines for levels at which DBM can be expected to detect a small population, and the performance of the algorithm with increasing concentration of the population. The Cluster Evaluation using Hidden Labels (CEHL) experimental design is a CHAPTER 4. EVALUATING AUTOMATED GATING ALGORITHMS 84

proof of concept illustrating how we can leverage modern high-dimensional FC tech- nology and a bit of biology to objectively evaluate the results of automated gating algorithms. Here we used the absence of specific subsets of cells in one strain to measure the classification accuracy of those subsets when mixed with another strain containing the missing subsets. One can easily imagine many alternative versions to this experiment design. For instance, mixing samples from mice with identical back- ground strains, but different immunoglobulin allotype markers should provide a more consistent distribution of cell origin in clusters contain a mixture of both cell origins. Alternatively, one might utilize cell barcoding techniques pioneered by Krutzik and Nolan [43] to deconvolve samples from more than 2 samples. The common thread to these experiments is the use of additional parameters available on modern FC instru- ments that are excluded from the automated algorithms, but provide a mechanism for revealing an objective “gold standard” for evaluation. Chapter 5

Conclusion and Future Directions

5.1 Conclusion

The immune system is a complex environment involving hundreds of different types of cells, with a tremendously rich infrastructure for signaling, long-term memory, distributed regulation, and self vs.non-self recognition. This diversity is possible because of the activity of specialized subsets of cells with distinct, yet often redundant functions. Understanding when an aspect of this system is disrupted provides insight into the functioning of the healthy immune system and the underlying mechanisms of disease in a disrupted system. Changes in frequency and/or biomarker expression in small subsets of cells pro- vide key diagnostics for disease presence, status and prognosis. Flow cytometry is the workhorse technology that enables rapid, quantitative and high-throughput charac- terization of these cell subsets. At present, flow cytometry instruments measure the joint expression of up to 20 markers in/on large numbers of individual cells. This tech- nology is routinely used to determine the frequencies of various marker-defined cell subsets in patient samples and is often used to inform therapeutic decision-making. Despite the central role of flow cytometry in biomedical research and the rapid im- provements in instrument technology, there has been only modest progress in statis- tical and informatics methods to unravel the resulting data. Thus, even experienced users are commonly unable to exploit the full capabilities of the highly advanced flow

85 CHAPTER 5. CONCLUSION AND FUTURE DIRECTIONS 86

instruments that are available at most modern research and medical centers. In this thesis, I have attempted to provide a general methodology to provide re- liable indices of change in subset representation and marker expression by individual subsets of cells measured by flow cytometry. The methods we have developed utilize a novel non-parametric clustering algorithm based on statistical theory to identify subsets (clusters) of cells that express a common set of markers measured indepen- dently for each cell by flow cytometry. The key properties of the method are that it can detect clusters with irregular shape, the number of clusters is determined au- tomatically from the data, and it is efficient to compute for millions of observations using standard desktop computers. The algorithm has reliably reproduced expert researchers in identifying rare cell subsets, and offers a reproducible alternative to manual analysis. Once cell subsets are identified, it is common to compare varia- tion across samples under different conditions, such as disease state or experimental stimulation. We measure changes in the joint expression of multiple uncoordinated markers measured for a subset under different conditions using the Earth Mover’s Distance, an algorithm used to compare multivariate distributions borrowed from the image retrieval literature. Whereas current methodology treat changes in multiple markers as a collection of univariate measurements, our multivariate approach signif- icantly simplifies flow analyses when more than one marker is differentially expressed and readily adapt to automated analysis of flow data analyses. The combination of automated subset identification and reliable measurement of shifts in joint marker expression offer a powerful framework for multiplexed immuno- logical assays. In this dissertation, I demonstrate how these two methods can be applied to identify a rare population, measure changes in the expression of multiple markers in response to activation, and use that as an index of response to predict a clinical response with exquisite accuracy. At present, flow cytometry studies focus on the behavior of a handful of subsets identified in the data via manual gating. Our results demonstrate the potential for characterizing and measuring differences in all of the subsets identified in an assay, providing a system-wide snapshot of the immune response. CHAPTER 5. CONCLUSION AND FUTURE DIRECTIONS 87

5.2 Future Directions

Automatic, reliable identification of cell subsets coupled with a quantitative mea- sure of change in marker expression on those subsets provides the groundwork for a powerful new paradigm of immunology research. Imagine panels of standardized Hi-Dimensional FACS assays that measure a snapshot of the current state and func- tioning of the immune system. Collecting longitudinal samples with these assays we could study the behavior of finely characterized cell types in the context of aging, environmental exposures, drug regimens and disease. Whereas current FACS studies interrogate cell types in a directed hypothesis driven fashion (e.g. to study a particu- lar type(s) of cell), this new approach enables broader exploratory research, and new hypothesis generation. One of the critical computational components required to make this possible is a method to align cell subsets across samples. Given samples from two sources clus- tered independently, the method would find the correspondence between the clusters from the two samples. The challenging aspect of this work lies in the natural vari- ability in the position (fluorescent intensity), number (populations), and frequency (percent of total sample) of cell populations across samples. Using the output of a clustering procedure (i.e. an assignment of cell observations to a cluster) as the input, a new procedure could use the features of the clusters from each sample to identify correspondences. One approach is to perform a second clustering (meta-clustering) using the cluster features to identify clusters with similar properties. An alterna- tive approach could use network alignment algorithms, treating clusters as nodes and distances between clusters as edges to exploit the shared topology across samples. Given a cluster correspondence algorithm, the computational pipeline for a group of patient samples might look like (i) automatically identify cell subsets in each sam- ple; (ii) align cell subsets across samples; (iii) compare changes in cell subsets. In the resulting matrix, rows are individual patients and columns are changes in frequency and/or marker expression for each of the cell subsets measured by the assay. Any number of existing machine learning techniques could be applied to such a matrix to tease out features that correlate with some external clinical criteria (drug efficacy, CHAPTER 5. CONCLUSION AND FUTURE DIRECTIONS 88

type of cancer, vaccination response). Such a system-wide snapshot of the immune system could reveal changes in combinations of cell subsets that would be impossible to detect using existing methods. Appendix A

Generating synthetic FC-like data

‘Flow cytometry’-like data were simulated for n=10000 events in 4-dimensions con- taining 5 clusters. Four clusters can be seen by examining a plot of the 1st vs. 2nd dimensions. The largest cluster is curved like a kidney-bean, and is generated by mixing two normal distributions together. The smallest cluster is in fact two clusters, which can be seen in a plot of 1st (or 2nd) vs. 3rd dimensions. The 4th dimension contains noise. The following R code by Guenther Walther can be used to generate “FC-like” datasets. n<-10000 library(MASS) p1<-0.6 p2<-0.25 p3<-0.1 p4<-0.02 p5<-0.01 p6<-1-p1-p2-p3-p4-p5 #background labels<-rep(0,n) x1<-mvrnorm(n,mu=c(0,0,0),Sigma=rbind( c(1,0.5,0),

89 APPENDIX A. GENERATING SYNTHETIC FC-LIKE DATA 90

c(0.5,1,0), c(0.5,0,1))) x2<-mvrnorm(n,mu=c(2,1,0),Sigma=rbind( c(2,0,0), c(0,0.5,0), c(0,0,1))) x3<-mvrnorm(n,mu=c(6,-1,0),Sigma=rbind( c(1,0,0), c(0,1,0), c(0,0,1))) x4<-mvrnorm(n,mu=c(6,3,3),Sigma=rbind( c(1,0,0), c(0,1,0), c(0,0,1))) x5<-mvrnorm(n,mu=c(11,-1,6),Sigma=rbind( c(0.2,0,0), c(0,0.2,0), c(0,0,0.2))) x6<-mvrnorm(n,mu=c(11,-1,8),Sigma=rbind( c(0.2,0,0), c(0,0.2,0), c(0,0,0.2)))

# add the labels x1 = cbind( x1, rep(1,n), rep(1,n) ) x2 = cbind( x2, rep(1,n), rep(1,n) ) x3 = cbind( x3, rep(2,n), rep(2,n) ) x4 = cbind( x4, rep(3,n), rep(3,n) ) x5 = cbind( x5, rep(4,n), rep(4,n) ) x6 = cbind( x6, rep(5,n), rep(5,n) ) u<-runif(n) u1<-(u

x[u1,]<-x1[u1,] # 1st half of 1st cluster x[u2,]<-x2[u2,] # 2nd half of 1st cluster x[u3,]<-x3[u3,] # 2nd cluster x[u4,]<-x4[u4,] # 3rd cluster x[u5,]<-x5[u5,] # 4th cluster x[u6,]<-x6[u6,] # 5th cluster x[,4]<-rnorm(n) # last coordinate is noise return x 10 10 8 6 5 3 2 4 IMENSION IMENSION 2 D D 0 0 -2 -5 -4

-5 0 5 10 15 -5 0 5 10 15 DIMENSION 1 DIMENSION 1

Figure A.1: Synthetic FC-like data in 4-dimensions showing true class labels, gener- ated using code in Appendix A Bibliography

[1] Nima Aghaeepour, A Khodabakhshi, and Ryan R Brinkman. An Empirical Study of Cluster Evaluation Metrics using Flow Cytometry Data. Clustering Theory Workshop, Neural Information Processing Systems, December, 2009.

[2] Nima Aghaeepour, Radina Nikolic, Holger H Hoos, and Ryan R Brinkman. Rapid cell population identification in flow cytometry data. Cytometry Part A : the journal of the International Society for Analytical Cytology, 79(1):6–13, 2011.

[3] H. Akaike. A new look at the statistical model identification. Automatic Control, IEEE Transactions on, 19(6):716–723, December 1974.

[4] A V Arkhangel’skii and V V Fedorchuk. General Topology I: Basic Concepts and Constructions. Dimension Theory (Encyclopaedia of Mathematical Sciences) (v. 1). Springer, 1 edition, October 1990.

[5] KA Baggerly. Probability binning and testing agreement between multivariate immunofluorescence histograms: Extending the chi-squared test. Cytometry, 45(2), 2001.

[6] T C Bakker Schut, B G De Grooth, and J Greve. Cluster analysis of flow cytometric list mode data on a personal computer. Cytometry, 14(6):649–659, 1993.

[7] Ali Bashashati and Ryan R Brinkman. A Survey of Flow Cytometry Data Anal- ysis Methods. Advances in Bioinformatics, 2009:1–20, 2009.

92 BIBLIOGRAPHY 93

[8] S C Bendall, E F Simonds, P Qiu, E a D Amir, P O Krutzik, R Finck, R V Bruggner, R Melamed, A Trejo, O I Ornatsky, R S Balderas, S K Plevritis, K Sachs, D Pe’er, S D Tanner, and G P Nolan. Single-Cell Mass Cytometry of Differential Immune and Drug Responses Across a Human Hematopoietic Continuum. Science, 332(6030):687–696, May 2011.

[9] Craig M Bennett, Abigail A Baird, Michael B Miller, and George L Wolford. Neural Correlates of Interspecies Perspective Taking in the Post-Mortem Atlantic Salmon: An Argument For Proper Multiple Comparisons Correction. Journal of Serendipitous and Unexpected Results, 1:1–5, 2011.

[10] J Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 1975.

[11] Michael J Boedigheimer and John Ferbas. Mixture modeling approach to flow cytometry data. Cytometry Part A : the journal of the International Society for Analytical Cytology, 73(5):421–429, May 2008.

[12] A Wesley Burks. Peanut allergy. Lancet, 371(9623):1538–1546, May 2008.

[13] J J S Chafen, S J Newberry, M A Riedl, D M Bravata, M Maglione, M J Suttorp, V Sundaram, N M Paige, A Towfigh, B J Hulley, and P G Shekelle. Diagnosing and Managing Common Food Allergies: A Systematic Review. JAMA: The Journal of the American Medical Association, 303(18):1848–1856, May 2010.

[14] Pratip K Chattopadhyay, Carl-Magnus Hogerkorp, and Mario Roederer. A chro- matic explosion: the development and future of multiparameter flow cytometry. Immunology, 125(4):441–449, December 2008.

[15] Pratip K Chattopadhyay, David A Price, Theresa F Harper, Michael R Betts, Joanne Yu, Emma Gostick, Stephen P Perfetto, Paul Goepfert, Richard A Koup, Stephen C De Rosa, Marcel P Bruchez, and Mario Roederer. Quantum dot BIBLIOGRAPHY 94

semiconductor nanocrystals for immunophenotyping by polychromatic flow cy- tometry. Technical Report 8, Vaccine Research Center, National Institute of Al- lergy and Infectious Diseases, National Institutes of Health, 40 Convent Drive, Bethesda, Maryland 20892, USA., August 2006.

[16] C Cox, J E Reeder, R D Robinson, S B Suppes, and L L Wheeless. Comparison of frequency distributions in flow cytometry. Cytometry, 9(4):291–298, July 1988.

[17] S C De Rosa, L A Herzenberg, L A Herzenberg, and M Roederer. 11-color, 13- parameter flow cytometry: identification of human naive T cells by phenotype, function, and T-cell receptor diversity. Nature Medicine, 7(2):245–248, February 2001.

[18] Richard O. Duda and Peter E. Hart. Pattern classification and scene analysis. Wiley, , 1973.

[19] L Epstein and H Kreth. Fluorescence-activated cell sorting of human T and B lymphocytes* 1:: II. Identification of the cell type responsible for interferon production and cell proliferation in . . . . Cellular Immunology, 1974.

[20] M Ester, HP Kriegel, J Sander, and X Xu. A density-based algorithm for dis- covering clusters in large spatial databases with noise. In Conference, pages 226–231, 1996.

[21] Greg Finak, Ali Bashashati, Ryan Brinkman, and Raphael Gottardo. Merg- ing Mixture Components for Cell Population Identification in Flow Cytometry. Advances in Bioinformatics, 2009:1–12, 2009.

[22] A Franceschini and G Fasano. A multidimensional version of the Kolmogorov- Smirnov test. Monthly Notices of the Royal Astronomy Society, 202:155–170, 1987.

[23] Yael Gernez, Rabindra Tirouvanziam, Grace Yu, Eliver E B Ghosn, Neha Re- shamwala, Tammie Nguyen, Mindy Tsai, Stephen J Galli, Leonard A Herzen- berg, Leonore A Herzenberg, and Kari C Nadeau. Basophil CD203c Levels Are BIBLIOGRAPHY 95

Increased at Baseline and Can Be Used to Monitor Omalizumab Treatment in Subjects with Nut Allergy. International archives of allergy and immunology, 154(4):318–327, October 2010.

[24] Eliver Eid Bou Ghosn, Alexandra A Cassado, Gregory R Govoni, Takeshi Fukuhara, Yang Yang, Denise M Monack, Karina R Bortoluci, Sandro R Almeida, Leonard A Herzenberg, and Leonore A Herzenberg. Two physically, functionally, and developmentally distinct peritoneal macrophage subsets. Pro- ceedings of the National Academy of Sciences, 107(6):2568–2573, February 2010.

[25] F Godtliebsen, J S Marron, and Probal Chaudhuri. Significance in scale space for bivariate density estimation. Journal of Computational and Graphical Statistics, 11(1):1–21, 2002.

[26] S Goodbourn, L Didcock, and R E Randall. Interferons: cell signalling, im- mune modulation, antiviral response and virus countermeasures. The Journal of general virology, 81(Pt 10):2341–2364, October 2000.

[27] K Grauman and T Darrell. Fast contour matching using approximate earth mover’s distance. Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, 1:I–220– I–227 Vol.1, 2004.

[28] John A Hartigan. Clustering algorithms. 1975.

[29] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statis- tical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics). Springer, 2nd ed. 2009. corr. 3rd printing 5th printing. edi- tion, February 2009.

[30] Leonard A Herzenberg. FACS innovation: a view from Stanford. Clinical and investigative medicine M´edecine clinique et experimentale, 27(5):240–252, 2004.

[31] Leonard A Herzenberg, Richard G Sweet, and Leonore A Herzenberg. Fluorescence-activated cell sorting. Scientific American, 234(3):108–117, 1976. BIBLIOGRAPHY 96

[32] Leonore A Herzenberg, James Tung, Wayne A Moore, Leonard A Herzenberg, and David R Parks. Interpreting flow cytometry data: a guide for the perplexed. Nature immunology, 7(7):681–685, July 2006.

[33] R Hightower and S Forrest. The evolution of emergent organization in immune system gene libraries. Proceedings of the 6th . . . , 1995.

[34] JH Holland. Complex adaptive systems. Daedalus, 121(1):17–30, 1992.

[35] John P A Ioannidis. Why most published research findings are false. PLoS medicine, 2(8):e124, August 2005.

[36] Charles Janeway. Immunobiology: the immune system in health and disease. Garland Science, New York, 6th ed edition, 2005.

[37] David Jeffries, Irfan Zaidi, Bouke de Jong, Martin J Holland, and David J C Miles. Analysis of flow cytometry data using an automatic processing tool. Cy- tometry Part A : the journal of the International Society for Analytical Cytology, 73(9):857–867, September 2008.

[38] K K¨arre,H G Ljunggren, G Piontek, and R Kiessling. Selective rejection of H-2-deficient lymphoma variants suggests alternative immune defence strategy. Nature, 319(6055):675–678, January 1986.

[39] S Kauffman. The origins of order. vwl.tuwien.ac.at, 1993.

[40] Thomas J. Kindt, Richard A Goldsby, Barbara Anne Osborne, and Janis Kuby. Kuby immunology. W.H. Freeman, New York, 6th ed. edition, 2007.

[41] Mario Koppen. The curse of dimensionality. pages 1–22, August 2000.

[42] Nikesh Kotecha, Nikki J Flores, Jonathan M Irish, Erin F Simonds, Debbie S Sakai, Sophie Archambeault, Ernesto Diaz-Flores, Marc Coram, Kevin M Shan- non, Garry P Nolan, and Mignon L Loh. Single-cell profiling identifies aberrant STAT5 activation in myeloid malignancies with specific clinical and biologic cor- relates. Cancer cell, 14(4):335–343, October 2008. BIBLIOGRAPHY 97

[43] Peter O Krutzik and Garry P Nolan. Fluorescent cell barcoding in flow cytometry allows high-throughput drug screening and signaling profiling. Nature methods, 3(5):361–368, May 2006.

[44] H Kuhn. The Hungarian method for the assignment problem. Naval research logistics quarterly, 1955.

[45] F Lampariello. On the use of the Kolmogorov-Smirnov statistical test for im- munofluorescence histogram comparison. Cytometry, 39(3):179–188, March 2000.

[46] F Lampariello and A Aiello. Complete mathematical modeling method for the analysis of immunofluorescence distributions composed of negative and weakly positive cells. Cytometry, 32(3):241–254, July 1998.

[47] E Levina and P Bickel. The earth mover’s distance is the Mallows distance: Some insights from statistics. Proc. ICCV, 2:251–256, 2001.

[48] Haibin Ling. An Efficient Earth Moer’s Distance Algorithm for Robust His- togram Comparison. pages 1–35, June 2006.

[49] G´erardLizard. Flow cytometry analyses and bioinformatics: interest in new softwares to optimize novel technologies and to favor the emergence of innovative concepts in cell research. Cytometry Part A : the journal of the International Society for Analytical Cytology, 71(9):646–647, September 2007.

[50] Kenneth Lo, Ryan Remy Brinkman, and Raphael Gottardo. Automated gating of flow cytometry data via robust model-based clustering. Cytometry Part A : the journal of the International Society for Analytical Cytology, 73(4):321–332, April 2008.

[51] RHC Lopes, I Reid, and PR Hobson. The two-dimensional Kolmogorov-Smirnov test. XI International Workshop on Advanced Computing and Analysis Tech- niques in Physics Research, Nikhef, Amsterdam, the Netherlands, April 23-27, 2007, 2007. BIBLIOGRAPHY 98

[52] Holden T Maecker, J Philip McCoy, FOCIS Human Immunophenotyping Consor- tium, Michael Amos, John Elliott, Adolfas Gaigalas, Lili Wang, Richard Aranda, Jacques Banchereau, Chris Boshoff, Jonathan Braun, Yael Korin, Elaine Reed, Judy Cho, David Hafler, Mark Davis, C Garrison Fathman, William Robinson, Thomas Denny, Kent Weinhold, Bela Desai, Betty Diamond, Peter Gregersen, Paola Di Meglio, Paola DiMeglio, Frank O Nestle, Frank Nestle, Mark Peakman, Federica Villanova, Federica Villnova, John Ferbas, Elizabeth Field, Aaron Kan- tor, Thomas Kawabata, Wendy Komocsar, Michael Lotze, Jerry Nepom, Hans Ochs, Raegan O’Lone, Deborah Phippard, Scott Plevy, Stephen Rich, Mario Roederer, Dan Rotrosen, and Jung-Hua Yeh. A model for harmonizing flow cy- tometry in clinical trials. Nature immunology, 11(11):975–978, November 2010.

[53] Holden T Maecker, Aline Rinfret, Patricia D’Souza, Janice Darden, Eva Roig, Claire Landry, Peter Hayes, Josephine Birungi, Omu Anzala, Miguel Gar- cia, Alexandre Harari, Ian Frank, Ruth Baydo, Megan Baker, Jennifer Hol- brook, Janet Ottinger, Laurie Lamoreaux, C Lorrie Epling, Elizabeth Sinclair, Maria A Suni, Kara Punt, Sandra Calarota, Sophia El-Bahi, Gailet Alter, Hazel Maila, Ellen Kuta, Josephine Cox, Clive Gray, Marcus Altfeld, Nolwenn Nougarede, Jean Boyer, Lynda Tussey, Timothy Tobery, Barry Bredt, Mario Roederer, Richard Koup, Vernon C Maino, Kent Weinhold, Giuseppe Pantaleo, Jill Gilmour, Helen Horton, and Rafick P Sekaly. Standardization of cytokine flow cytometry assays. BMC immunology, 6:13, 2005.

[54] CL Mallows. JSTOR: The Annals of Mathematical Statistics, Vol. 43, No. 2 (Apr., 1972), pp. 508-515. The Annals of Mathematical Statistics, 1972.

[55] Christopher D Manning, Prabhakar Raghavan, and Hinrich Sch¨utze. Introduction to information retrieval. Cambridge University Press, New York, 2008.

[56] P Mombaerts, J Iacomini, R S Johnson, K Herrup, S Tonegawa, and V E Pa- paioannou. RAG-1-deficient mice have no mature B and T lymphocytes. Cell, 68(5):869–877, March 1992. BIBLIOGRAPHY 99

[57] R F Murphy. Automated identification of subpopulations in flow cytometric list mode data using cluster analysis. Cytometry, 6(4):302–309, July 1985.

[58] W R Overton. Modified histogram subtraction technique for analysis of flow cytometry data. Cytometry, 9(6):619–626, November 1988.

[59] JA Peacock. Two-dimensional goodness-of-fit testing in astronomy. Monthly Notices of the Royal Astronomical Society, 202:615–627, 1983.

[60] Stephen P Perfetto, Pratip K Chattopadhyay, and Mario Roederer. Seventeen- colour flow cytometry: unravelling the immune system. Nature reviews Immunol- ogy, 4(8):648–655, August 2004.

[61] Zachary S. Pincus. Analysis and applications of quantitative representations of cell morphology. PhD thesis, Stanford University, United States – California, 2007. 304810815.

[62] William H Press. Numerical recipes in C. 1992.

[63] Saumyadipta Pyne, Xinli Hu, Kui Wang, Elizabeth Rossin, Tsung-I Lin, Lisa M Maier, Clare Baecher-Allan, Geoffrey J McLachlan, Pablo Tamayo, David A Hafler, Philip L De Jager, and Jill P Mesirov. Automated high-dimensional flow cytometric data analysis. Proceedings of the National Academy of Sciences of the United States of America, 106(21):8519–8524, May 2009.

[64] Saumyadipta Pyne, Xinli Hu, Kui Wang, Elizabeth Rossin, Tsung-I Lin, Lisa M Maier, Clare Baecher-Allan, Geoffrey J McLachlan, Pablo Tamayo, David A Hafler, Philip L De Jager, and Jill P Mesirov. Automated high-dimensional flow cytometric data analysis. Proceedings of the National Academy of Sciences, 106(21):8519–8524, May 2009.

[65] M Roederer and R R Hardy. Frequency difference gating: a multivariate method for identifying subsets that differ between samples. Cytometry, 45(1):56–64, September 2001. BIBLIOGRAPHY 100

[66] M Roederer, W Moore, A Treister, R R Hardy, and L A Herzenberg. Prob- ability binning comparison: a metric for quantitating multivariate distribution differences. Cytometry, 45(1):47–55, September 2001.

[67] M Roederer, A Treister, W Moore, and L A Herzenberg. Probability binning comparison: a metric for quantitating univariate distribution differences. Cy- tometry, 45(1):37–46, September 2001.

[68] Wade T Rogers and Herbert A Holyst. FlowFP: A Bioconductor Package for Fingerprinting Flow Cytometric Data. Advances in Bioinformatics, page 193947, 2009.

[69] Wade T Rogers, Allan R Moser, Herbert A Holyst, Andrew Bantly, Emile R Mohler, George Scangas, and Jonni S Moore. Cytometric fingerprinting: quan- titative characterization of multivariate distributions. Cytometry Part A : the journal of the International Society for Analytical Cytology, 73(5):430–441, May 2008.

[70] Y Rubner. Empirical Evaluation of Dissimilarity Measures for Color and Texture. Computer Vision and Image Understanding, 84(1):25–43, October 2001.

[71] Y Rubner, LJ Guibas, and C Tomasi. The earth mover’s distance, multi- dimensional scaling, and color-based image retrieval. Proceedings of the ARPA Image Understanding Workshop, pages 661–668, 1997.

[72] Y Rubner, C Tomasi, and L Guibas. A metric for distributions with applications to image databases. Proceedings of the Sixth International Conference on . . . , 1998.

[73] Y Rubner, C Tomasi, and L Guibas. The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision, 2000.

[74] Y Rubner, C Tomasi, and LJ Guibas. A metric for distributions with applications to image databases. Computer Vision, 1998. Sixth International Conference on, pages 59–66, 1998. BIBLIOGRAPHY 101

[75] Scott F Saccone, Anthony L Hinrichs, Nancy L Saccone, Gary A Chase, Karel Konvicka, Pamela A F Madden, Naomi Breslau, Eric O Johnson, Dorothy Hat- sukami, Ovide Pomerleau, Gary E Swan, Alison M Goate, Joni Rutter, Sarah Bertelsen, Louis Fox, Douglas Fugman, Nicholas G Martin, Grant W Mont- gomery, Jen C Wang, Dennis G Ballinger, John P Rice, and Laura Jean Bierut. Cholinergic nicotinic receptor genes implicated in a nicotine dependence associ- ation study targeting 348 candidate genes with 3713 SNPs. Human molecular genetics, 16(1):36–49, January 2007.

[76] Richard H. Scheuermann. FlowCAP-I Debrief. pages 1–19, September 2010.

[77] Gideon Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464, 1978.

[78] Robert A Seder, Patricia A Darrah, and Mario Roederer. T-cell quality in mem- ory and protection: implications for vaccine design. Nature reviews Immunology, 8(4):247–258, April 2008.

[79] Howard M Shapiro. Practical flow cytometry. Wiley-Liss, New York, 3rd ed edition, 1995.

[80] David Sheskin. Handbook of parametric and nonparametric statistical procedures. Chapman & Hall/CRC, Boca Raton, 2nd ed edition, 2000.

[81] Scott H Sicherer. Food allergy. The Lancet, 360(9334):701–710, August 2002.

[82] Bernard W Silverman. Density Estimation for Statistics and Data Analysis (Chapman & Hall/CRC Monographs on Statistics & Applied Probability). Chap- man and Hall/CRC, 1 edition, April 1986.

[83] R Tibshirani, G Walther, and T Hastie. Estimating the number of clusters in a data set via the gap statistic. . . . of the Royal Statistical Society. Series B, 2001.

[84] Rabindra Tirouvanziam, Yael Gernez, Carol K Conrad, Richard B Moss, Iris Schrijver, Colleen E Dunn, Zoe A Davies, Leonore A Herzenberg, and Leonard A BIBLIOGRAPHY 102

Herzenberg. Profound functional and signaling changes in viable inflamma- tory neutrophils homing to cystic fibrosis airways. Proceedings of the National Academy of Sciences of the United States of America, 105(11):4335–4339, March 2008.

[85] C Villani. Optimal transport: old and new. Springer Verlag, 2009.

[86] Guenther Walther, Noah Zimmerman, Wayne Moore, David Parks, Stephen Mee- han, Ilana Belitskaya, Jinhui Pan, and Leonore Herzenberg. Automatic clustering of flow cytometry data with density-based merging. Advances in Bioinformatics, page 686759, 2009.

[87] M Wand. Fast Computation of Multivariate Kernel Estimators. Journal of Computational and Graphical Statistics, 3(4):433–445, December 1994.

[88] D Zhou, J Li, and H Zha. A new Mallows distance based metric for compar- ing clusterings. Proceedings of the 22nd international conference on Machine learning, page 1035, 2005.

[89] Ding Zhou, Jia Li, and Hongyuan Zha. A new Mallows distance based metric for comparing clusterings. ICML ’05: Proceedings of the 22nd international conference on Machine learning, August 2005.