Edinburgh Research Explorer

Eleven grand challenges in single-cell data science Citation for published version: Lähnemann, D, Köster, J, Szczurek, E, McCarthy, DJ, Hicks, SC, Robinson, MD, Vallejos, CA, Campbell, KR, Beerenwinkel, N, Mahfouz, A, Pinello, L, Skums, P, Stamatakis, A, Attolini, CS-O, Aparicio, S, Baaijens, J, Balvert, M, Barbanson, BD, Cappuccio, A, Corleone, G, Dutilh, BE, Florescu, M, Guryev, V, Holmer, R, Jahn, K, Lobo, TJ, Keizer, EM, Khatri, I, Kielbasa, SM, Korbel, JO, Kozlov, AM, Kuo, T-H, Lelieveldt, BPF, Mandoiu, II, Marioni, JC, Marschall, T, Mölder, F, Niknejad, A, Raczkowski, L, Reinders, M, Ridder, JD, Saliba, A-E, Somarakis, A, Stegle, O, Theis, FJ, Yang, H, Zelikovsky, A, McHardy, AC, Raphael, BJ, Shah, SP & Schönhuth, A 2020, 'Eleven grand challenges in single-cell data science', Genome Biology, vol. 21, no. 1, pp. 31. https://doi.org/10.1186/s13059-020-1926-6

Digital Object Identifier (DOI): 10.1186/s13059-020-1926-6

Link: Link to publication record in Edinburgh Research Explorer

Document Version: Peer reviewed version

Published In: Genome Biology

General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights.

Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact [email protected] providing details, and we will remove access to the work immediately and investigate your claim.

Download date: 04. Oct. 2021 11 grand challenges in single-cell data science

David Lähnemann*,1,2,3, Johannes Köster*,+,1,4, Ewa Szczurek*,5, Davis J. McCarthy*,6,7, Stephanie C. Hicks*,8, Mark D. Robinson*,9, Catalina A. Vallejos*,10,11, Kieran R. Campbell*,15,16,17, Niko Beerenwinkel*,12,13, Ahmed Mahfouz*,18,19, Luca Pinello*,20,21,22, Pavel Skums*,23, Alexandros Stamatakis*,24,25, Camille Stephan-Otto Attolini*,26, Samuel Aparicio16,27, Jasmijn Baaijens29, Marleen Balvert29,31, Buys de Barbanson32,33,34, Antonio Cappuccio35, Giacomo Corleone36, Bas E. Dutilh31,38, Maria Florescu32,33,34, Victor Guryev41, Rens Holmer42, Katharina Jahn12,13, Thamar Jessurun Lobo41, Emma M. Keizer45, Indu Khatri46, Szymon M. Kiełbasa47, Jan O. Korbel48, Alexey M. Kozlov24, Tzu-Hao Kuo3, Boudewijn P.F. Lelieveldt49,50, Ion I. Mandoiu51, John C. Marioni52,53,54, Tobias Marschall55,56, Felix Mölder1,59, Amir Niknejad60,61, Łukasz Rączkowski5, Marcel Reinders18,19, Jeroen de Ridder32,33, Antoine-Emmanuel Saliba62, Antonios Somarakis50, Oliver Stegle48,54,63, Fabian J. Theis67, Huan Yang68, Alex Zelikovsky69,70, Alice C. McHardy+,3, Benjamin J. Raphael+,71, Sohrab P. Shah+,72, and Alexander Schönhuth@,+,*,29,31

*Joint first authors, major contributions to manuscript. 1Algorithms for Reproducible , Genome Informatics, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Germany 2Department of Paediatric Oncology, Haematology and Immunology, Medical Faculty, Heinrich Heine University, University Hospital, Düsseldorf, Germany 3Computational Biology of Infection Research Group, Helmholtz Centre for Infection Research, Braunschweig, Germany 4Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, USA 5Institute of Informatics, Faculty of Mathematics, Informatics and Mechanics University of Warsaw, Poland 6Bioinformatics and Cellular Genomics, St Vincent’s Institute of Medical Research, Fitzroy, Australia 7Melbourne Integrative Genomics, School of BioSciences — School of Mathematics & Statistics, Faculty of Science, University of Melbourne, Australia 8Department of Biostatistics, Johns Hopkins University, Baltimore, MD, USA 9Institute of Molecular Life Sciences and SIB Swiss Institute of Bioinformatics, University of Zurich, Switzerland 10MRC Human Genetics Unit, Institute of Genetics and Molecular Medicine, University of Edinburgh, Western General Hospital, UK 11The Alan Turing Institute, British Library, London, UK 15Department of Statistics, University of British Columbia, Vancouver, Canada 16Department of Molecular Oncology, BC Cancer Agency, Vancouver, Canada 17Data Science Institute, University of British Columbia, Vancouver, Canada 12Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland 13SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland 18Leiden Center, Leiden University Medical Center, The Netherlands 19Delft Bioinformatics Lab, Faculty of Electrical Engineering, Mathematics and , Delft University of Technology, The Netherlands 20Molecular Pathology Unit and Center for Cancer Research, Massachusetts General Hospital Research Institute, Charlestown, USA 21Department of Pathology, Harvard Medical School, Boston, USA 22Broad Institute of Harvard and MIT, Cambridge, MA, USA 23Department of Computer Science, Georgia State University, Atlanta, USA 24Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Germany 25Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Germany 26Institute for Research in Biomedicine, The Barcelona Institute of Science and Technology, Spain 27Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, Canada 29Life Sciences and Health, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands 31Theoretical Biology and Bioinformatics, Science for Life, Utrecht University, The Netherlands 32Center for Molecular Medicine, University Medical Center Utrecht, The Netherlands 33Oncode Institute, Utrecht, The Netherlands 34Quantitative biology, Hubrecht Institute, Utrecht, The Netherlands 35Institute for Advanced Study, University of Amsterdam, The Netherlands 36Department of Surgery and Cancer, The Imperial Centre for Translational and Experimental Medicine, Imperial College London, UK 38Centre for Molecular and Biomolecular Informatics, Radboud University Medical Center, Nijmegen, The Netherlands 41European Research Institute for the Biology of Ageing, University Medical Center Groningen, University of Groningen, The Netherlands 42Bioinformatics Group, Wageningen University, The Netherlands 45Biometris, Wageningen University & Research, The Netherlands 46Department of Immunohematology and Blood Transfusion, Leiden University Medical Center, The Netherlands 47Department of Biomedical Data Sciences, Leiden University Medical Center, The Netherlands 48Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany 49PRB lab, Delft University of Technology, The Netherlands 50Division of Image Processing, Department of Radiology, Leiden University Medical Center, The Netherlands 51Computer Science & Engineering Department, University of Connecticut, Storrs, USA 52Cancer Research UK Cambridge Institute, Li Ka Shing Centre, University of Cambridge, UK 53Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, UK 54European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK 55Center for Bioinformatics, Saarland University, Saarbrücken, Germany 56Max Planck Institute for Informatics, Saarbrücken, Germany 59Institute of Pathology, University Hospital Essen, University of Duisburg-Essen, Germany. 60Computation molecular design, Zuse Institute Berlin, Germany 61Mathematics department, Mount Saint Vincent, New York, USA 62Helmholtz Institute for RNA-based Infection Research, Helmholtz-Center for Infection Research, Würzburg, Germany 63Division of Computational Genomics and Systems Genetics, German Cancer Research Center – DKFZ, Heidelberg, Germany 67Institute of Computational Biology, Helmholtz Zentrum München – German Research Center for Environmental Health, Neuherberg, Germany 68Division of Drug Discovery and Safety, Leiden Academic Center for Drug Research – LACDR — Leiden University, The Netherlands 69Department of Computer Science, Georgia State University, Atlanta, USA 70The Laboratory of Bioinformatics, I.M. Sechenov First Moscow State Medical University, Moscow, Russia 71Department of Computer Science, Princeton University, USA 72Computational Oncology, Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, USA

+Joint last authors, workshop organizers.

@Corresponding author: Alexander Schönhuth, [email protected] 1 The recent boom in microfluidics and com- 3 Challenges in single-cell tran- 38 2 binatorial indexing strategies, further en- scriptomics8 39 3 hanced by low sequencing costs, has turned 3.1 Challenge I: Handling sparsity 40 4 single-cell sequencing into an empowering in single-cell RNA sequencing .8 41 5 technology: analyzing thousands—or even 3.2 Challenge II: Defining flexible 42 6 millions—of cells per experimental run is be- statistical frameworks for dis- 43 7 coming a routine assignment in laboratories covering complex differential 44 8 worldwide. As a consequence, we are wit- patterns in gene expression . . . 13 45 9 nessing a data revolution in single-cell biology. 3.3 Challenge III: Mapping single 46 10 Although some issues are similar in spirit to cells to a reference atlas . . . . 15 47 11 those experienced in bulk sequencing, many 3.4 Challenge IV: Generalizing 48 12 of the emerging data science problems are trajectory inference ...... 16 49 13 unique to single-cell analysis. Together, they 3.5 Challenge V: Finding patterns 50 14 give rise to the new realm of Single-Cell Data in spatially resolved measure- 51 15 Science. ments ...... 18 52 16 Here, we outline eleven challenges that will

17 be central in bringing the field forward. For 4 Challenges in single-cell genomics 20 53 18 each challenge, we review the current state of 4.1 Challenge VI: Dealing with er- 54 19 the art in terms of prior work, and formulate rors and missing data in the 55 20 open problems, with an emphasis on the re- identification of variation from 56 21 search goals that motivate them. single-cell DNA sequencing data. 23 57 22 This compendium is meant to serve as a 23 guideline for established researchers, newcom- 5 Challenges in single-cell phy- 58 24 ers and students alike, highlighting interesting logenomics 25 59 25 and rewarding problems in Single-Cell Data 5.1 Challenge VII: Scaling phylo- 60 26 Science for the coming years. genetic models to many cells 61 and many sites ...... 26 62 5.2 Challenge VIII: Integrating 63 multiple types of variation into 64 phylogenetic models ...... 27 65 27 Contents 5.3 Challenge IX: Inferring pop- 66 ulation genetic parameters of 67 28 1 Introduction3 tumor heterogeneity by model 68 integration ...... 29 69 29 2 Single-cell data science: recur-

30 ring themes5 6 Overarching challenges 32 70 31 2.1 Varying levels of resolution . . .5 6.1 Challenge X: Integration of 71 32 2.2 Quantifying uncertainty of single-cell data: across sam- 72 33 measurements and analysis ples, experiments and types of 73 34 results ...... 6 measurement ...... 32 74 35 2.3 Scaling to higher dimensionali- 6.2 Challenge XI: Validating and 75 36 ties: more cells, more features, benchmarking analysis tools 76 37 broader coverage ...... 6 for single-cell measurements . . 37 77

2 1 7 Acknowledgements 40

2 8 Funding 40

3 9 Conflicts of interest 41

4 10 Contributions 41 Box 1: Abbreviations

5 1 Introduction CNV copy number variation

6 Since being highlighted as “Method of the FISH fluorescent in situ hybridization 7 Year” in 2013 [Nature Methods, 2014], se- 8 quencing of the genetic material of individ- ICA independent component analysis 9 ual cells has become routine when investigat- 10 ing cell-to-cell heterogeneity. Single-cell mea- MALBAC multiple annealing and looping- 11 surements of both RNA and DNA, and more based amplification cycles 12 recently also of epigenetic marks and protein 13 levels, can stratify cells at the finest resolution MDA multiple displacement amplification 14 possible. 15 Single-cell RNA sequencing (scRNA-seq) MSA multiple sequence alignment 16 enables transcriptome-wide gene expression 17 measurement at single-cell resolution, allow- NMF non-negative matrix factorization 18 ing for cell type clusters to be distinguished 19 [for an early example, see Anchang et al., PCA principal component analysis 20 2016], the arrangement of populations of cells 21 according to novel hierarchies, and the identi- PCR polymerase chain reaction 22 fication of cells transitioning between states. 23 This can lead to a much clearer view of the dy- sc-seq single-cell sequencing 24 namics of tissue and organism development, scDNA-seq single-cell DNA sequencing 25 and on structures within cell populations that 26 had so far been perceived as homogeneous. In SCDS Single-Cell Data Science 27 a similar vein, analyses based on single-cell 28 DNA sequencing (scDNA-seq) can highlight scRNA-seq single-cell RNA sequencing 29 somatic clonal structures (e.g., in cancer, see SNV single nucleotide variation 30 Francis et al. [2014], Lawson et al. [2018]), 31 thus helping to track the formation of cell WGA whole genome amplification 32 lineages and provide insight into evolutionary 33 processes acting on somatic mutations. 34 The opportunities arising from single-cell 35 sequencing (sc-seq) are enormous: only now 36 is it possible to re-evaluate hypotheses about 37 differences between pre-defined sample groups 38 at the single-cell level—no matter if such sam- 39 ple groups are disease subtypes, treatment

3 1 groups or simply morphologically distinct cell able data analysis models and methods. Fi- 45 2 types. It is therefore no surprise that enthu- nally, no matter how varied the challenges 46 3 siasm about the possibility to screen the ge- are—by research goal, tissue analyzed, experi- 47 4 netic material of the basic units of life has mental setup or just by whether DNA or RNA 48 5 continued to grow. A prominent example is is sequenced—they are all rooted in data sci- 49 6 the Human Cell Atlas [Regev et al., 2017], ence, i.e., are computational or statistical in 50 7 an initiative aiming to map the numerous cell nature. Here, we propose the data science 51 8 types and states comprising a human being. challenges that we believe to be among the 52 9 Encouraged by the great potential of in- most relevant for bringing SCDS forward. 53 10 vestigating DNA and RNA at the single- 11 cell level, the development of the correspond- 12 ing experimental technologies has experienced This catalog of SCDS challenges aims at fo- 54 13 considerable growth. In particular, the emer- cusing the development of data analysis meth- 55 14 gence of microfluidics techniques and com- ods and the directions of research in this 56 15 binatorial indexing strategies [Zilionis et al., rapidly evolving field. It shall serve as a com- 57 16 2017, Vitak et al., 2017, Svensson et al., pendium for researchers of various commu- 58 17 2018b, Luo et al., 2019, Gao et al., 2019] has nities, looking for rewarding problems that 59 18 led to hundreds of thousands of cells routinely match their personal expertise and interests. 60 19 being sequenced in one experiment. This de- To make it accessible to these different com- 61 20 velopment has even enabled a recent publica- munities, we categorize challenges into: tran- 62 21 tion analyzing millions of cells at once [Cao scriptomics (section 3), genomics (section 4) 63 22 et al., 2019a]. Sc-seq datasets comprising and phylogenomics (section 5). For each 64 23 very large cell numbers are becoming avail- challenge, we provide a thorough review of 65 24 able worldwide, constituting a data revolution the status relative to existing approaches and 66 25 for the field of single-cell analysis. point to possible directions of research to 67 26 These vast quantities of data and the re- solve it. 68 27 search hypotheses that motivate them need to 28 be handled in a computationally efficient and 29 statistically sound manner [Amezquita et al., 30 2019]. As these aspects clearly match a recent Several themes and aspects recur across 69 31 definition of “Data Science” [Hicks and Peng, the boundaries of research communities and 70 32 2019], we posit that we have entered the era methodological approaches. We represent 71 33 of Single-Cell Data Science (SCDS). these overlaps in three different ways. First, 72 34 SCDS exacerbates many of the data science we decided to discuss some problems in mul- 73 35 issues arising in bulk sequencing, but it also tiple contexts, highlighting the relevant as- 74 36 constitutes a set of new, unique challenges pects for the respective research communi- 75 37 for the SCDS community to tackle. Limited ties (e.g., data sparsity in transcriptomics and 76 38 amounts of material available per cell lead to genomics). Second, we separately introduce 77 39 high levels of uncertainty about observations. recurring themes (section 2), thereby keep- 78 40 When amplification is used to generate more ing respective discussions in each challenge 79 41 material, technical noise is added to the re- succinct. Third, if challenges were identi- 80 42 sulting data. Further, any increase in reso- fied as independent of the chosen catego- 81 43 lution results in another—rapidly growing— rization, they are discussed as recapitulatory 82 44 dimension in data matrices, calling for scal- challenges at the end (section 6). 83

4 1 2 Single-cell data science: from different time points, treatment groups 43 or even organisms. This integration theme 44 2 recurring themes runs throughout multiple challenges and is so 45 central that we consider it a challenge worth 46 3 A number of challenging themes are common highlighting (section 6.1). Third, the possi- 47 4 to many or all single-cell analyses, regard- bility to operate on the finest levels of resolu- 48 5 less of the particular assay or data modal- tion casts an important, overarching question: 49 6 ity generated. We will start our review by what level of resolution is appropriate relative 50 7 introducing them. Later, when discussing to the particular research question one has in 51 8 the specific challenges, we will refer to these mind (section 2.1)? We will start by qualify- 52 9 broader themes wherever appropriate and ing this last one. 53 10 outline what they mean in the particular con- 11 text. If challenges covered in later sections 2.1 Varying levels of resolution 54 12 are particularly entangled with the broader 13 themes listed here, we will also refer to them Sc-seq allows for a fine-grained definition of 55 14 from within this section. cell types and states. Hence, it allows for 56 15 The themes may reflect issues one also characterizations of cell populations that are 57 16 experiences when analyzing bulk sequenc- significantly more detailed than those sup- 58 17 ing data. However, even if not unique to ported by bulk sequencing experiments. How- 59 18 single-cell experiments, these issues may dom- ever, even though sc-seq operates at the most 60 19 inate the analysis of sc-seq data and there- basic level, mapping cell types and states at 61 20 fore require particular attention. The two a particular level of resolution of interest may 62 21 most urgent elementary themes, not neces- be challenging: Achieving the targeted level of 63 22 sarily unique to sc-seq, are the need to quan- resolution or granularity for the intended map 64 23 tify measurement uncertainty (see section sec- of cells may require substantial methodolog- 65 24 tion 2.2) and the need to benchmark methods ical efforts and will depend on whether the 66 25 systematically, in a way that highlights the research question allows for a certain freedom 67 26 metrics that are particularly critical in sc-seq. in terms of resolution and on the limits im- 68 27 Since the latter is of central importance and posed by the particular experimental setup. 69 28 an aspect that has gained visibility only re- When drawing maps of cell types and 70 29 cently, we not only mention its importance states, it is important that they: (i) have a 71 30 in relevant challenges, but also consider it a structure that recapitulates both tissue devel- 72 31 challenge in its own right (section 6.2). opment and tissue organization; (ii) account 73 32 We identify three sweeping themes that are for continuous cell states in addition to dis- 74 33 more specific to sc-seq, exacerbated by the crete cell types (i.e. reflecting cell state tra- 75 34 rapid advances in experimental technologies. jectories within cell types and smooth tran- 76 35 First, there is a need to scale to higher di- sitions between cell types, as observed in tis- 77 36 mensional data, be it more cells measured or sue generation); (iii) allow for choosing the 78 37 more data measured per cell (section 2.3). level of resolution flexibly (i.e. the map should 79 38 This need often arises in combination with possibly support zoom-type operations, to 80 39 a second one: the need to integrate data let the researcher choose the desired level 81 40 across different types of single-cell measure- of granularity with respect to cell types and 82 41 ments (e.g., RNA, DNA, proteins, methyla- states conveniently, ranging from whole or- 83 42 tion, and so on) and across samples, be it ganisms via tissues to cell populations and 84

5 1 cellular subtypes); (iv) include biological and tially more uncertain and tasks so far consid- 38 2 functional annotation wherever available and ered routine, such as single nucleotide varia- 39 3 helpful in the intended functional context. tion (SNV) calling in bulk sequencing, require 40 4 An exemplary illustration of how maps of considerable methodological care with sc-seq 41 5 cell types and states can support different lev- data. 42 6 els of resolution are the structure-rich topolo- These issues with data quality and in par- 43 7 gies generated by PAGA based on scRNA- ticular missing data pose challenges that are 44 1 8 seq [Wolf et al., 2019], see Figure 1 . At the unique to sc-seq, and are thus at the core of 45 9 highest levels of resolution, these topologies several challenges: regarding scDNA-seq data 46 10 also reflect intermediate cell states and the quality (see section 4) and especially regard- 47 11 developmental trajectories passing through ing missing data in scDNA-seq (section 4.1) 48 12 them. A similar approach that also allows and scRNA-seq (section 3.1). In contrast, the 49 13 for consistently zooming into more detailed non-negligible batch effects that scRNA-seq 50 14 levels of resolution is provided by hierarchical can suffer from reflect a common problem in 51 15 stochastic neighbor embedding (HSNE, Pez- high-throughput data analysis [Leek et al., 52 16 zotti et al. [2016]), a method pioneered on 2010], and thus are not discussed here (al- 53 17 mass cytometry datasets [Unen et al., 2017, though in certain protocols such effects can be 54 18 Höllt et al., 2018]. In addition, manifold alleviated by careful use of negative control 55 19 learning [Welch et al., 2017, Moon et al., 2018] data in the form of spike-in RNA of known 56 20 and metric learning [Hoffer and Ailon, 2015, content and concentration [Severson et al., 57 21 Bromley et al., 1993] may provide further the- 2018, BEARscc]). 58 22 oretical support for even more accurate maps, Optimally, sc-seq analysis tools would accu- 59 23 because they provide sound theories about rately quantify all uncertainties arising from 60 24 reasonable, continuous distance metrics, in- experimental errors and biases. Such tools 61 25 stead of just distinct, discrete clusters. would prevent the uncertainties from propa- 62 gating to the intended downstream analyses 63 in an uncontrolled manner, and rather trans- 64 26 2.2 Quantifying uncertainty of late them into statistically sound and accu- 65 27 measurements and analysis rately quantified qualifiers of final results. 66

28 results

29 The amount of material sampled from single 2.3 Scaling to higher 67 30 cells is considerably less than that used in dimensionalities: more cells, 68 31 bulk experiments. Signals become more sta- more features, broader 69 32 ble when individual signals are summarized coverage 70 33 (such as in a bulk experiment), thus the in- 34 crease in resolution due to sc-seq also means The current blossoming of experimental 71 35 a reduction of the stability of the support- methods poses considerable statistical chal- 72 36 ing signals. The reduction in signal stability, lenges, and would do so even if measure- 73 37 in turn, implies that data becomes substan- ments were not affected by errors and bi- 74 ases. The increase in the number of sin- 75 1Figure 1 was adapted from Wolf et al. [2019], Fig. 3, provided under Creative Commons gle cells analyzed per experiment translates 76 Attribution 4.0 International License (http:// into more data points being generated, requir- 77 creativecommons.org/licenses/by/4.0/). ing methods to scale rapidly. Some scRNA- 78

6 �ssues cell types single cells

intermediate cell states trajectories

Figure 1: Different levels of resolution are of interest, depending on the research question and the data available. Thus, analysis tools and reference systems (such as cell atlases) will have to accommodate multiple levels of resolution from whole organs and tissues over discrete cell types to continuously mappable intermediate cell states, which are indistinguishable even at the microscopic level. A graph abstraction that enables such multiple levels of focus is provided by PAGA [Wolf et al., 2019], a structure that allows for discretely grouping cells, as well as inferring trajectories as paths through a graph.

1 seq SCDS methodology has started to ad- In parallel to the increase in the number 24 2 dress scalability [Sengupta et al., 2016, Sinha of cells queried and the number of different 25 3 et al., 2018, Wolf et al., 2018, Iacono et al., assays possible, the increase of the resolu- 26 4 2018, Amezquita et al., 2019], but the re- tion per cell of specific measurement types 27 5 spective issues have not been fully resolved causes a steady increase of the dimension- 28 6 and experimental methodology will scale fur- ality of corresponding data spaces. For the 29 7 ther. For scDNA-seq, experimental method- field of SCDS this amounts to a severe and 30 8 ology has just been scaling up to more cells recurring case of the “curse of dimensional- 31 9 recently (see Box 2 and section 5.1), making ity” for all types of measurements. Here 32 10 this a pressing challenge in the development again, scRNA-seq based methods are in the 33 11 of data analysis methods. lead when trying to deal with feature dimen- 34 sionality, while scDNA-seq based methodol- 35 12 Beyond basic scRNA-seq and scDNA-seq ogy (which includes epigenome assays) has yet 36 13 experiments, various assays have been pro- to catch up. 37 14 posed to measure chromatin accessibility 15 [Buenrostro et al., 2015, Cusanovich et al., Finally, there are efforts to measure mul- 38 16 2015], DNA methylation [Karemaker and Ver- tiple feature types in parallel, e.g., from 39 17 meulen, 2018], protein levels [Virant-Klun scDNA-seq (see section 5.2). Also, with spa- 40 18 et al., 2016], protein binding, and also for per- tial and temporal sampling becoming avail- 41 19 forming multiple simultaneous measurements able (see section 3.5 and section 5.3), data in- 42 20 [Clark et al., 2018, Cao et al., 2018] in single tegration methods need to scale to more and 43 21 cells. The corresponding increase in experi- new types of context information for individ- 44 22 mental choices means another possible infla- ual cells (see section 6.1 for a comprehensive 45 23 tion of feature spaces. discussion of data integration approaches). 46

7 1 3 Challenges in single-cell further method development. Sparsity per- 42 vades all aspects of scRNA-seq data analy- 43 2 transcriptomics sis, but in this challenge we focus on the 44 linked problems of learning latent spaces and 45 3 3.1 Challenge I: Handling “imputing” expression values from scRNA-seq 46 4 sparsity in single-cell RNA data (Figure 2). Imputation approaches are 47 48 5 sequencing closely linked to the challenges of normal- ization. But whereas normalization gener- 49 6 A comprehensive characterization of the tran- ally aims to make expression values between 50 7 scriptional status of individual cells enables us cells or experiments more comparable to each 51 8 to gain full insight into the interplay of tran- other, imputation approaches aim to achieve 52 9 scripts within single cells. However, scRNA- adjusted data values that better represent the 53 10 seq measurements typically suffer from large true expression values. Imputation methods 54 11 fractions of observed zeros, where a given could therefore be used for normalization, but 55 12 gene in a given cell has no unique molecu- do not entail all possible or useful approaches 56 13 lar identifiers or reads mapping to it. The to normalization. 57 14 term “dropout” is often used to denote ob- 15 served zero values in scRNA-seq data. But 3.1.1 Status 58 16 this term usually conflates two distinct types 17 of zero values: those attributable to method- The imputation of missing values has been 59 18 ological noise, where a gene is expressed but very successful for genotype data [Das et al., 60 19 not detected by the sequencing technology; 2018b]. Crucially, when imputing genotypes 61 20 and those attributable to biologically-true ab- we typically know which data are missing 62 21 sence of expression. Thus, we recommend (e.g., when no genotype call is possible due to 63 22 against the term “dropout” as a catch-all term no coverage of a locus; although see section 64 23 for observed zeros. Beyond biological vari- section 4.1 for the challenges with scDNA-seq 65 24 ation in the number of unexpressed genes, data). In addition, rich sources of external 66 25 the proportion of observed zeros, or degree of information are available (e.g., haplotype ref- 67 26 sparsity, is attributed to technical limitations erence panels). Thus, genotype imputation 68 27 (Hicks et al. [2018], Bacher and Kendziorski is now highly accurate and a commonly-used 69 28 [2016]). Those can result in artificial ze- step in data processing for genetic association 70 29 ros that are either systematic (e.g., sequence- studies [Das et al., 2018a]. 71 30 specific mRNA degradation during cell ly- The situation is somewhat different for 72 31 sis) or that occur by chance (e.g., barely ex- scRNA-seq data, as we do not routinely have 73 32 pressed transcripts that—at the same expres- external reference information to apply (see 74 33 sion level, due to sampling variation—will section 3.3). In addition, we can never be sure 75 34 sometimes be detected and sometimes not). which of the observed zeros represent “missing 76 35 Accordingly, the degree of sparsity depends data” and which accurately represent a true 77 36 on the scRNA-seq platform used, the sequenc- absence of gene expression in the cell [Hicks 78 37 ing depth and the underlying expression level et al., 2018]. 79 38 of the gene. In general, two broad approaches can be 80 39 Sparsity in scRNA-seq data can hinder applied to tackle this problem of sparsity: 81 40 downstream analyses and is still challenging (i) use statistical models that inherently 82 41 to model or handle appropriately, calling for model the sparsity, sampling variation and 83

8 true popula�on manifold denoising / imputa�on manifold measurement error denoising process "true" data point observed data point denoised data point imputed data point Figure 2: Measurement error requires denoising methods or approaches that quantify uncer- tainty and propagate it down analysis pipelines. Where methods cannot deal with abundant missing values, imputation approaches may be useful. While the true population manifold that generated data is never known, one can usually obtain some estimation of it that can be used for both denoising and imputation.

1 noise modes of scRNA-seq data with an ap- biological zeros. They aim to impute expres- 30 2 propriate data generative model (i.e. quanti- sion levels only for the technical zeros, leaving 31 3 fying uncertainty, see section 2.2); or (ii) at- other observed expression levels untouched. 32 4 tempt to “impute” values for observed zeros (B) Data-smoothing methods define a “simi- 33 5 (ideally the technical zeros; sometimes also larity” between cells (e.g., cells that are neigh- 34 6 non-zero values) that better approximate the bors in a graph or occupy a small region in a 35 7 true gene expression levels (Figure 2). We latent space) and adjust expression values for 36 8 prefer to use the first option where possible, each cell based on expression values in sim- 37 9 and for many single-cell data analysis prob- ilar cells. These methods usually adjust all 38 10 lems there already are statistical models ap- expression values, including technical zeros, 39 11 propriate for sparse count data that should biological zeros and observed non-zero val- 40 12 be used or extended (e.g., for differential ex- ues.( C) Data-reconstruction methods typi- 41 13 pression analysis, see section 3.2). However, cally aim to define a latent space representa- 42 14 there are many cases where the appropriate tion of the cells. This is often done through 43 15 models are not available and accurate im- matrix factorization (e.g., principal compo- 44 16 putation of technical zeros would allow bet- nent analysis) or, increasingly, through ma- 45 17 ter results from downstream methods and chine learning approaches (e.g., variational 46 18 algorithms that cannot handle sparse count autoencoders that exploit deep neural net- 47 19 data. For example, depending on the amount works to capture non-linear relationships). 48 20 of sparsity, imputation could potentially im- Both matrix factorization methods and au- 49 21 prove results of dimension reduction, visual- toencoders (among others) are able to “re- 50 22 ization and clustering applications. construct” the observed data matrix from 51 23 We define three broad (and often overlap- low-rank or simplified representations. The 52 24 ping) categories of methods that can be used reconstructed data matrix will typically no 53 25 to “impute” scRNA-seq data in the absence longer be sparse (with many zeros) and the 54 26 of an external reference (Table 1):( A) Mod- implicitly “imputed” data (or estimated la- 55 27 el-based imputation methods of technical zeros tent spaces if using e.g. variational autoen- 56 28 use probabilistic models to identify which ob- coders) can be used for downstream applica- 57 29 served zeros represent technical rather than tions such as clustering or trajectory inference 58

9 1 (section 3.4). A fourth—distinct—category is after suitable data normalization) as are other 45 2 (T) imputation with an external dataset or widely-used general statistical methods like 46 3 reference, using it for transfer learning. independent component analysis (ICA) and 47 4 The first category of methods generally non-negative matrix factorization (NMF). As 48 5 seeks to infer a probabilistic model that cap- (linear) matrix factorization methods, PCA, 49 6 tures the data generation mechanism. Such ICA and NMF decompose the observed data 50 7 generative models can be used to probabilis- matrix into a “small” number of factors in two 51 8 tically determine which observed zeros cor- low-rank matrices, one representing cell-by- 52 9 respond to technical zeros (to be imputed) factor weights and one gene-by-factor load- 53 10 and which correspond to biological zeros (to ings. Many matrix factorization methods 54 11 be left alone). There are many model-based with tweaks for single-cell data have been pro- 55 12 imputation methods already available that posed in recent years (Table 1C), with some 56 13 use ideas from clustering (e.g., k-means), di- specifically intended for imputation (ALRA, 57 14 mension reduction, regression and other tech- ENHANCE, scRMD). 58 15 niques to impute technical zeros, oftentimes Additionally, machine-learning methods 59 16 combining ideas from several of these ap- have been proposed for scRNA-seq data anal- 60 17 proaches (Table 1A). ysis that can, but need not, use proba- 61 18 Data-smoothing methods adjust all gene bilistic data generative processes to capture 62 19 expression levels based on expression levels low-dimensional or latent space representa- 63 20 in “similar” cells, aiming to “denoise” the val- tions of a dataset (Table 1C). Some of them 64 21 ues (Figure 2). Several such methods have are expressly aimed at imputation (e.g., Au- 65 22 been proposed to handle imputation prob- toImpute, DeepImpute, EnImpute, DCA and 66 23 lems (Table 1B). To take a simplified exam- scVI). But even if imputation is not the main 67 24 ple (Figure 2), we might imagine that single focus, such methods can generate “imputed” 68 25 cells originally refer to points along a curve expression values as an upshot of a model pri- 69 26 across a two-dimensional space. Projecting marily focused on other tasks, like learning 70 27 data points onto that curve eventually allows latent spaces, clustering, batch correction, or 71 28 imputation of the “missing” values (but all visualization (and often several of these tasks 72 29 points are adjusted, or smoothed, not just simultaneously). 73 30 true technical zeros). Finally, a small number of scRNA-seq im- 74 31 A major task in the analysis of high- putation methods extend approaches from 75 32 dimensional single-cell data is to find low- any (combination) of the three categories 76 33 dimensional representations of the data that above by incorporating information external 77 34 capture the salient biological signals and ren- to the current dataset (Table 1T). Approaches 78 35 der the data more interpretable and amenable using cell atlas-type reference resources are 79 36 to further analyses. As it happens, the matrix further discussed in section 3.3 and classified 80 37 factorization and latent-space learning meth- as approach +X+S in section 6.1 (see Figure 6 81 38 ods used for that task also provide a third and Table 3). 82 39 route for imputation: they can reconstruct 40 the observed data matrix from simplified rep- 3.1.2 Open problems 83 41 resentations of it. 42 Principal component analysis (PCA) is one A major challenge in this context is the circu- 84 43 standard matrix factorization method that larity that arises when imputation solely relies 85 44 can be applied to scRNA-seq data (preferably on information that is internal to the imputed 86

10 A: model-based imputation bayNorm binomial model, empirical Bayes prior Tang et al. [2018] BISCUIT Gaussian model of log counts, cell- and cluster-specific parameters Azizi et al. [2017] CIDR decreasing logistic model (DO), non-linear least-squares regression Lin et al. [2017b] (imp) SAVER NB model, Poisson LASSO regression prior Huang et al. [2018] ScImpute mixture model (DO), non-negative least squares regression (imp) Li and Li [2018] scRecover ZINB model (DO identification only) Miao et al. [2019] VIPER sparse non-negative regression model Chen and Zhou [2018]

B: data smoothing DrImpute k-means clustering of PCs of correlation matrix Gong et al. [2018] knn-smooth k-nearest neighbor smoothing Wagner et al. [2018b] LSImpute locality sensitive imputation Moussa and Măndoiu [2019] MAGIC diffusion across nearest neighbor graph Dijk et al. [2018] netSmooth diffusion across PPI network Jonathan Ronen [2018]

C: data reconstruction, matrix factorization ALRA SVD with adaptive thresholding Linderman et al. [2018] ENHANCE denoising PCA with aggregation step Wagner et al. [2019] scRMD robust matrix decomposition Chen et al. [2018] consensus NMF meta-analysis approach to NMF Kotliar et al. [2019] f-scLVM sparse Bayesian latent variable model Buettner et al. [2017] GPLVM Gaussian process latent variable model Verma and Engelhardt [2018] pCMF probab. count matrix factorization with Poisson model Durif et al. [2019] scCoGAPS extension of NMF Stein-O’Brien et al. [2019] SDA sparse decomposition of arrays (Bayesian) Jung et al. [2019] ZIFA ZI factor analysis Pierson and Yau [2015] ZINB-WaVE ZINB factor model Risso et al. [2018] C: data reconstruction, machine learning AutoImpute AE, no error back-propagation for zero counts Talwar et al. [2018] BERMUDA AE for cluster batch correction (MMD and MSE loss function) Wang et al. [2019b] DeepImpute AE, parallelized on gene subsets Arisdakessian et al. [2018] DCA deep count AE (ZINB / NB model) Eraslan et al. [2019] DUSC / DAWN denoising AE (PCA determines hidden layer size) Srinivasan et al. [2019] EnImpute ensemble learning consensus of other tools Zhang et al. [2019c] Expression AE (Poisson negative log-likelihood loss function) Kinalis et al. [2019] Saliency LATE non-zero value AE (MSE loss function) Badsha et al. [2018] Lin_DAE denoising AE (imputation across k-nearest neighbor genes) Lin et al. [2017a] SAUCIE AE (MMD loss function) Amodio et al. [2019] scScope iterative AE Deng et al. [2019] scVAE Gaussian-mixture VAE (NB / ZINB / ZIP model) Grønbech et al. [2019] scVI VAE (ZINB model) Lopez et al. [2018] scvis VAE (objective function based on latent variable model and t-SNE) Ding et al. [2018] VASC VAE (denoising layer; ZI layer, double-exponential and Gumbel dis- Wang and Gu [2018] tribution) Zhang_VAE VAE (MMD loss function) Zhang [2019]

T: using external information ADImpute gene regulatory network information Leote et al. [2019] netSmooth PPI network information Jonathan Ronen [2018] SAVER-X transfer learning with atlas-type resources Wang et al. [2019a] SCRABBLE matched bulk RNA-seq data Peng et al. [2019] TRANSLATE transfer learning with atlas-type resources Badsha et al. [2018] URSM matched bulk RNA-seq data Zhu et al. [2018]

Table 1: Short description of methods for the imputation of missing data in scRNA-seq data. Imputation methods using only data from within a dataset are roughly categorized ap- proaches A (model-based), B (data smoothing) and C (data reconstruction), with the latter further differentiated into matrix factorization and machine learning approaches. In contrast to these methods, those in category T (for transfer learning) also use information external to the dataset to be analyzed. AE - autoencoder; DO - dropout; imp - imputation; MMD - maximum mean discrepancy; MSE - mean squared error; NB - negative binomial; NMF - non-negative matrix factorization; P - Poisson; PC - principal 11 component; PCA - principal component analysis; PPI - protein-protein interaction; SVD - singular value decomposition; VAE - variational autoencoder; ZI - zero-inflated 1 dataset. This circularity can artificially am- seq imputation. This idea was adopted in 45 2 plify the signal contained in the data, lead- SCRABBLE and URSM, where an exter- 46 3 ing to inflated correlations between genes or nal reference is defined by bulk expression 47 4 cells. In turn, this can introduce false pos- measurements from the same population of 48 5 itives in downstream analyses such as differ- cells for which imputation is performed. Of 49 6 ential expression testing and gene network in- course, such orthogonal information can also 50 7 ference [Andrews and Hemberg, 2019]. Han- be provided by different types of molecu- 51 8 dling batch effects and potential confounders lar measurements (see section 6.1). Meth- 52 9 requires further work to ensure that imputa- ods designed to integrate multi-omics data 53 10 tion methods do not mistake unwanted varia- could then be extended to enable scRNA- 54 11 tion from technical sources for biological sig- seq imputation, for example through gener- 55 12 nal. In a similar vein, single-cell experiments ative models that explicitly link scRNA-seq 56 13 are affected by various uncertainties (see sec- with other data types [e.g., clonealign, Camp- 57 14 tion 2.2). Approaches that allow quantifica- bell et al., 2019] or by inferring a shared 58 15 tion and propagation of the uncertainties as- low-dimensional latent structure [e.g., MOFA, 59 16 sociated with expression measurements (sec- Argelaguet et al., 2018] that could be used 60 17 tion 2.2), may help to avoid problems associ- within a data-reconstruction framework. 61 18 ated with “overimputation” and the introduc- With the proliferation of alternative meth- 62 19 tion of spurious signals noted by Andrews and ods, comprehensive benchmarking is urgently 63 20 Hemberg [2019]. required—as for all areas of single-cell data 64 21 To avoid this circularity, it is important to analysis (see section 6.2). Early attempts by 65 22 identify reliable external sources of informa- Zhang and Zhang [2018] and Andrews and 66 23 tion that can inform the imputation process. Hemberg [2019] provide valuable insights into 67 24 One possibility is to exploit external reference the performance of methods available at the 68 25 panels (like in the context of genetic asso- time. But many more methods have since 69 26 ciation studies). Such panels are not gener- been proposed and even more comprehensive 70 27 ally available for scRNA-seq data, but ongo- benchmarking platforms are needed. Some 71 28 ing efforts to develop large scale cell atlases methods, especially those using deep learning, 72 29 [Regev et al., 2017, e.g.] could provide a valu- depend strongly on choice of hyperparameters 73 30 able resource for this purpose. Some meth- [Hu and Greene, 2019]. There, more detailed 74 31 ods have been extended to allow the use of comparisons that explore parameter spaces 75 32 such resources (e.g., SAVER-X and TRANS- would be helpful, extending work like that 76 33 LATE), but this will need to be done for all from Sun et al. [2019] comparing dimensional- 77 34 approaches (see section 3.3). ity reduction methods. Such detailed bench- 78 35 A second approach to avoid circularity is marking would also help to establish when 79 36 the systematic integration of known biologi- normalization methods derived from explicit 80 37 cal network structures in the imputation pro- count models [e.g., Hafemeister and Satija, 81 38 cess. This can be achieved by encoding net- 2019, Townes et al., 2019] may be preferable 82 39 work structure knowledge as prior informa- to imputation. 83 40 tion, as proposed by ADImpute, netSmooth Finally, scalability for large numbers of 84 41 and the tool by Lin et al. [2017a]. cells remains an ongoing concern for meth- 85 42 Finally, a third way of avoiding circular- ods allowing for imputation, as for all high- 86 43 ity in imputation is to explore complemen- throughput single-cell methods and software 87 44 tary types of data that can inform scRNA- (see section 2.3). 88

12 1 3.2 Challenge II: Defining ential expression (Figure 3), such as changes 41 42 2 flexible statistical in variability that account for mean expres- sion confounding [Eling et al., 2018], changes 43 3 frameworks for discovering in trajectory along pseudotime [Campbell and 44 4 complex differential patterns Yau, 2018, van den Berge et al., 2019], or 45 5 in gene expression more generally, changes in distributions [Ko- 46 rthauer et al., 2016b]. Furthermore, meth- 47 6 Beyond simple changes in average gene ex- ods for cross-sample comparisons of gene ex- 48 7 pression between cell types (or across bulk- pression (e.g., cell-type-specific changes in cell 49 8 collected libraries), scRNA-seq enables a state across samples; see section 6.1, Fig- 50 9 high granularity of changes in expression to ure 6 and Table 3) are now emerging, such 51 10 be unraveled. Interesting and informative as pseudo-bulk analyses [L. Lun et al., 2016, 52 11 changes in expression patterns can be re- Kang et al., 2018a, Crowell et al., 2019], where 53 12 vealed, as well as cell-type-specific changes in expression is aggregated over multiple cells 54 13 cell state across samples (Figure 6, approach within each sample, or mixed models, where 55 14 +S). Further understanding of gene expression both within- and between-sample variation is 56 15 changes will enable deeper knowledge across captured [Tung et al., 2017, Crowell et al., 57 16 a myriad of applications, such as immune 2019]. With the expanding capacity of exper- 58 17 responses [Kang et al., 2018b, Stubbington imental techniques to generate multi-sample 59 18 et al., 2017], development [Karaiskos et al., scRNA-seq datasets, further general and flex- 60 19 2017a], drug responses [Kim et al., 2015]. ible statistical frameworks will be required to 61 identify complex differential patterns across 62 63 20 3.2.1 Status samples. This will be particularly critical in clinical applications, where cells are collected 64 21 Currently, the vast majority of differential ex- from multiple patients. 65 22 pression detection methods assume that the 23 groups of cells to be compared are known in 3.2.2 Open problems 66 24 advance (e.g., experimental conditions or cell 25 types). However, current analysis pipelines Accounting for uncertainty in cell type as- 67 26 typically rely on clustering or cell type as- signment and for double use of data will 68 27 signment to identify such groups, before require, first of all, a systematic study of 69 28 downstream differential analysis is performed, their impact. Integrative approaches in which 70 29 without propagating the uncertainty in these clustering and differential testing are simul- 71 30 assignments or accounting for the double use taneously performed [Vavoulis et al., 2015] 72 31 of data (clustering, differential testing be- can address both issues. However, integra- 73 32 tween clusters). tive methods typically require bespoke imple- 74 33 In this context, most methods have fo- mentations, precluding a direct combination 75 34 cused on comparing average expression be- between arbitrary clustering and differential 76 35 tween groups [Kharchenko et al., 2014, Fi- testing tools. In such cases, the adaptation 77 36 nak et al., 2015], but it appears that single- of selective inference methods [Reid et al., 78 37 cell-specific methods do not uniformly outper- 2018] could provide an alternative solution, 79 38 form the state-of-the-art bulk methods [Sone- with an approach based on correcting the se- 80 39 son and Robinson, 2018]. Some attention has lection bias recently proposed [Zhang et al., 81 40 been given to more general patterns of differ- 2019b]. 82

13 popula�on differences in mean variability pseudo�me Q Q R R S S ...... Figure 3: Differential expression of a gene or transcript between cell populations. The top row labels the specific gene or transcript, as is also done in Figure 6. A difference in mean gene expression manifests in a consistent difference of gene expression across all cells of a population (e.g., high vs. low). A difference in variability of gene expression means that in one population, all cells have a very similar expression level, whereas in another population some cells have a much higher expression and some a much lower expression. The resulting average expression level may be the same and in such cases, only single-cell measurements can find the difference between populations. A difference across pseudotime is a change of expression within a population, for example along a developmental trajectory (compare Figure 1). This also constitutes a difference between cell populations that is not apparent from population averages, but requires a pseudo-temporal ordering of measurements on single cells.

1 While some methods exist to identify more etry datasets [Lun et al., 2017, Bruggner 21 2 general patterns of gene expression changes et al., 2014, Weber et al., 2018, Nowicka 22 3 (e.g., variability, distributions), these meth- et al., 2017, Arvaniti and Claassen, 2017]. 23 4 ods could be further improved by integrat- These approaches could be expanded to the 24 5 ing with existing approaches that account for higher dimensions and characteristic aspects 25 6 confounding effects such as cell cycle [Ste- of scRNA-seq data. Alternatively, there 26 7 gle et al., 2015] and complex batch effects is a large space to explore other general 27 8 [Butler et al., 2018a, Haghverdi et al., 2018]. and flexible approaches, such as hierarchi- 28 9 Moreover, our capability to discover interest- cal models where information is borrowed 29 10 ing gene expression patterns will be vastly across samples or exploring changes in full 30 11 expanded by connecting with other aspects distributions, while allowing for sample-to- 31 12 of single-cell expression dynamics, such as sample variability and subpopulation-specific 32 13 cell type composition, RNA velocity [Manno patterns [Crowell et al., 2019]. 33 14 et al., 2018], splicing and allele-specificity. 15 This will allow us to fully exploit the granu- 16 larity contained in single-cell level expression 17 measurements. 18 In the multi-donor setting, several promis- 19 ing methods have been applied to discover 20 state transitions in high-dimensional cytom-

14 1 3.3 Challenge III: Mapping in between clearly annotated cell clusters (see 43 44 2 single cells to a reference Figure 1 for an idea of what cell atlas type reference systems could look like). 45 3 atlas

4 Classifying cells into cell types or states is 3.3.1 Status 46 5 essential for many secondary analyses. As See Table 2 for a list of cell atlas type refer- 47 6 an example, consider studying and classifying ences that have recently been published. For 48 7 how expression within a cell type varies across human, similar endeavors as for the mouse are 49 8 different biological conditions (for differential under way, with the intention to raise a Hu- 50 9 expression analyses, see section 3.2 and data man Cell Atlas [Regev et al., 2017]. Towards 51 10 integration approach +S in Figure 6). To put this end, initial consortia focus on specific or- 52 11 the results of such studies on a map, reliable gans, for example the lung [Schiller et al., 53 12 reference systems with a resolution down to 2019]. 54 13 cell states are required—and depending on The availability of these reference atlases 55 14 the research question at hand, even intermedi- has led to the active development of methods 56 15 ate transition states might be of interest (see that make use of them in the context of su- 57 16 section 2.1). pervised classification of cell types and states 58 17 The lack of appropriate, available refer- [Lieberman et al., 2018, Srivastava et al., 59 18 ences has so far implied that only reference- 2018, Cao et al., 2019b, DePasquale et al., 60 19 free approaches were conceivable. Here, unsu- 2019, Kanter et al., 2019, Sato et al., 2019, 61 20 pervised clustering approaches were the pre- Zhang et al., 2019a]. Also, the systematic 62 21 dominant option (see data integration ap- benchmarking of this dynamic field of tools 63 22 1S proach in Figure 6). Method development has begun [Abdelaal et al., 2019]. A field 64 23 for such unsupervised clustering of cells has that can serve as a source of further inspi- 65 24 already reached a certain level of maturity; for ration is flow/mass cytometry, where several 66 25 a systematic identification of available tech- methods already address the classification of 67 26 niques, we refer to the respective reviews Duò high-dimensional cell type data [Chester and 68 27 et al. [2018], Freytag et al. [2018], Kiselev Maecker, 2015, Weber and Robinson, 2016, 69 28 et al. [2019]. Saeys et al., 2016, Guilliams et al., 2016]. 70 29 However, unsupervised approaches involve 30 manual cluster annotation. There are two 3.3.2 Open problems 71 31 major caveats: (i) manual annotation is a 32 time-consuming process, which also (ii) puts Cell atlases can still be considered under ac- 72 33 certain limits to the reproducibility of the re- tive development, with several computational 73 34 sults. Cell atlases, as reference systems that challenges still open, in particular referring to 74 35 systematically capture cell types and states, the fundamental themes from above [Regev 75 36 either tissue-specific or across different tis- et al., 2017, Schiller et al., 2019, Hon et al., 76 37 sues, remedy this issue (see data integration 2018]. Here, we focus on the mapping of cells 77 38 approach +X+S in Figure 6). They will need to or rather their molecular profiles onto stable 78 39 be able to embed new data points into a stable existing reference atlases to further highlight 79 40 reference framework that allows for different the importance of these fundamental themes. 80 41 levels of resolution and will have to eventu- A computationally and statistically sound 81 42 ally capture transitional cell states that fall method for mapping cells onto atlases for a 82

15 organism scale of cell atlas citation nematode whole organism at larval stage [Cao et al., 2017] Caenorhabditis ele- L2 gans planaria whole organism of the adult an- [Fincher et al., 2018, Plass et al., Schmidtea mediter- imal 2018] ranea fruit fly whole organism at embryonic [Karaiskos et al., 2017b] Drosophila stage melanogaster Zebrafish whole organism at embryonic [Farrell et al., 2018, Wagner stage et al., 2018a] frog whole organism at embryonic [Briggs et al., 2018] Xenopus tropicalis stage Mouse whole adult brain [Rosenberg et al., 2018, Saunders et al., 2018, Zeisel et al., 2018] Mouse whole adult organism [Tabula Muris Consortium, 2018, Han et al., 2018]

Table 2: Published cell atlases of whole tissues or whole organisms.

1 range of conceivable research questions will jaard et al., 2018, Plass et al., 2018, Fincher 23 2 need to: (i) enable operation at various levels et al., 2018, Farrell et al., 2018, Briggs et al., 24 3 of resolution of interest, and also cover con- 2018]. Importantly, additional information 25 4 tinuous, transient cell states (see section 2.1); available from lineage tracing of such simpler 26 5 (ii) quantify the uncertainty of a particular organisms can provide a cross-check with re- 27 6 mapping of cells of unknown type/state (see spect to the transcriptome-profile-based clas- 28 7 section 2.2); (iii) scale to ever more cells and sification [Briggs et al., 2018, Kester and van 29 8 broader coverage of types and states (see sec- Oudenaarden, 2018]. 30 9 tion 2.3); and (iv) eventually integrate in- 10 formation generated not only through scR- 3.4 Challenge IV: Generalizing 31 11 NA-seq experiments, but also through other 12 types of measurements, for example scD- trajectory inference 32 13 NA-seq or protein expression data (see sec- Several biological processes, such as differen- 33 14 tion 6.1 for a discussion of data integration, tiation, immune response or cancer expansion 34 15 especially approaches +M+C and +all in Fig- can be described and represented as continu- 35 16 ure 6). ous dynamic changes in cell type/state space 36 17 Finally, for further benchmarking of meth- using tree, graphical or probabilistic models. 37 18 ods that map cells of unknown type or state A potential path that a cell can undergo in 38 19 onto reference atlases (see section 6.2 for this continuous space is often referred to as 39 20 benchmarking in general), atlases of model a trajectory (Trapnell et al. [2014] and Fig- 40 21 organisms where full lineages of cells have ure 1), and the ordering induced by this path 41 22 been determined can form the basis [Span- is called pseudotime. Several models have 42

16 1 been proposed to describe cell state dynam- this, the following aspects are crucial. First, 41 2 ics starting from transcriptomic data [Saelens it is necessary to define the features to use. 42 3 et al., 2019]. Trajectory inference is in princi- For transcriptomic data the features are well 43 4 ple not limited to transcriptomics. Neverthe- annotated and correspond to expression lev- 44 5 less, modeling of other measurements, such as els of genes. In contrast, clear-cut features 45 6 proteomic, metabolomic, and epigenomic, or are harder to determine for data such as 46 7 even integrating multiple types of data (see methylation profiles and chromatin accessibil- 47 8 section 6.1), is still at its infancy. We be- ity where signals can refer to individual ge- 48 9 lieve the study of complex trajectories inte- nomic sites, but also be pooled over sequence 49 10 grating different data types, especially epige- features or sequence regions. Second, many 50 11 netics and proteomics information in addition of those recent technologies only allow mea- 51 12 to transcriptomics data, will lead to a more surement of a quite limited number of cells 52 13 systematic understanding of the processes de- compared to transcriptomic assays [Macosko 53 14 termining cell fate. et al., 2015, Klein et al., 2015, Zheng et al., 54 2017]. Third, some of those measurements are 55 technically challenging since the input ma- 56 15 3.4.1 Status terial for each cell is limited (for example 57 16 Trajectory methods start from a count matrix two copies of each chromosome for methyla- 58 17 where genes are rows and cells are columns. tion or chromatin accessibility), giving rise 59 18 First, a feature selection or dimensionality re- to more sparsity than scRNA-seq. In the 60 19 duction step is used to explore a subspace latter case it is necessary to define distance 61 20 where distances between cells are more reli- or similarity metrics that take this into ac- 62 21 able. Next, clustering and minimum spanning count. An alternative approach consists of 63 22 trees [Trapnell et al., 2014, Ji and Ji, 2016], pooling/combining information from several 64 23 principal curve or graph fitting [Qiu et al., cells or data imputation (see section 3.1). For 65 24 2017, Chen et al., 2019, Rizvi et al., 2017], example, imputation has been used for single- 66 25 or random walks and diffusion operations on cell DNA methylation [Angermueller et al., 67 26 graphs (Haghverdi et al. [2016], Setty et al. 2017], aggregation over chromatin accessibil- 68 27 [2016] among others) are used to infer pseudo- ity peaks from bulk or pseudo-bulk sample 69 28 time and/or branching trajectories. Alterna- [Cusanovich et al., 2018], and k-mer-based 70 29 tive probabilistic descriptions can be obtained approaches have been proposed [Buenrostro 71 30 using optimal transport analysis [Schiebinger et al., 2018, de Boer and Regev, 2018, Chen 72 31 et al., 2017] or approximation of the Fokker- et al., 2019]. However, so far, no systematic 73 32 Planck equations [Weinreb et al., 2018] or evaluation (see section 6.2) of those choices 74 33 by estimating pseudotime through dimension- has been performed and it is not clear how 75 34 ality reduction with a Gaussian process la- many cells are necessary to reliably define 76 35 tent variable model [Campbell and Yau, 2016, those features. 77 36 Reid and Wernisch, 2016, Ahmed et al., 2019]. A pressing challenge is to assess how the 78 various trajectory inference methods perform 79 on different data types and importantly, to 80 37 3.4.2 Open problems define metrics that are suitable. Also, it is 81 38 Many of the above-mentioned methods for necessary to reason on the ground truth or 82 39 trajectory inference can be extended to data propose reasonable surrogates (e.g., previous 83 40 obtained with non-transcriptomic assays. For knowledge about developmental processes). 84

17 1 Some recent papers explore this idea using tissues can be decomposed using the Niche- 41 2 scATAC-seq data, an assay to measure chro- seq approach [Medaglia et al., 2017]: here 42 3 matin accessibility [Buenrostro et al., 2018, a group of cells are specifically labeled with 43 4 Chen et al., 2019, Pliner et al., 2018]. a fluorescent signal, sorted and subjected to 44 5 Having defined robust methods to recon- scRNA-seq. The Slide-seq approach uses an 45 6 struct trajectories from each data type, an- array of Drop-seq beads with known barcodes 46 7 other future challenge is related to their com- to dissolve corresponding slide sites and se- 47 8 parison or alignment. Here, some ideas from quence them with the respective barcodes 48 9 recent methods used to align transcriptomic [Rodriques et al., 2019]. Ultimately, one 49 10 datasets could be extended [Butler et al., would like to sequence inside a tissue with- 50 11 2018b, Haghverdi et al., 2018, Welch et al., out dissociating the cells and without com- 51 12 2018]. A related unsolved problem is that promising on the unbiased nature of scRNA- 52 13 of comparing different trajectories obtained seq. First approaches aiming to implement 53 14 from the same data type but across individu- sequencing by synthesis in situ were proposed 54 15 als or conditions, in order to highlight unique by Ke et al. [2013] and Lee et al. [2015], the 55 16 and common aspects. latter being referred to as FISSEQ (Fluores- 56 cent in situ sequencing). Recently, starMAP 57 [Wang et al., 2018] was presented. Here, 58 17 3.5 Challenge V: Finding RNA within an intact 3D tissue can be ampli- 59 18 patterns in spatially resolved fied and transferred into a hydrogel. Within 60 19 measurements the hydrogel, amplified DNA barcodes can 61 be sequenced in situ, in order to distinguish 62 20 Single-cell spatial transcriptomics or pro- RNA species while retaining spatial coordi- 63 21 teomics [Crosetto et al., 2015, Strell et al., nates. Instead of performing a direct identi- 64 22 2018, Moffitt et al., 2018] technologies can fication of (parts of) the RNA sequence, flu- 65 23 obtain transcript abundance measurements orescent in situ hybridization (FISH) based 66 24 while retaining spatial coordinates of cells or methods require to design probes for target- 67 25 even transcripts within a tissue (this can be ing RNA species of interest. When multi- 68 26 seen as an additional feature space to inte- plexing several rounds of FISH in combina- 69 27 grate, see approach +M1C in section 6.1, Fig- tion with designed barcodes for each RNA 70 28 ure 6 and Table 3). With such data, the ques- species, it becomes possible to measure hun- 71 29 tion arises of how spatial information can best dreds to thousands of RNA species simulta- 72 30 be leveraged to find patterns, infer cell types neously. Lubeck et al. [2014] have shown a 73 31 or functions, and classify cells in a given tissue first approach of multiplexed, barcoded FISH 74 32 [Tanay and Regev, 2017]. to measure tens of RNA species simultane- 75 ously, called seqFISH. Later, MERFISH was 76 33 3.5.1 Status proposed by Chen et al. [2015], which enabled 77 34 Experimental approaches have been tailored the measurement of hundreds to thousands of 78 35 to either systematically extract foci of cells transcripts in single cells simultaneously while 79 36 and analyze them with scRNA-seq, or to mea- retaining spatial coordinates [Moffitt et al., 80 37 sure RNA and proteins in situ. Histological 2016]. Subsequently, Shah et al. [2016b] have 81 38 sections can be projected in two dimensions scaled seqFISH to hundreds of RNA species 82 39 while preserving spatial information using se- as well. This year, Eng et al. [2019] presented 83 40 quencing arrays [Ståhl et al., 2016]. Whole SeqFISH+, which scales the FISH barcoding 84

18 1 strategy to 10,000 RNA species by splitting marked point processes [Jacobsen, 2005], and 45 2 each of four barcode locations to be scanned Svensson et al. [2018a] provided a method to 46 3 into 20 separate readings to avoid optical sig- perform a spatially resolved differential ex- 47 4 nal crowding. The latter can also be an is- pression analysis. Here, spatial dependence 48 5 sue when fewer RNA species are measured, in for each gene is learned by non-parametric re- 49 6 particular at densely populated regions such gression, enabling the testing of the statistical 50 7 as the nucleus [Chen et al., 2015]. To solve significance for a gene to be differentially ex- 51 8 such issues at the expense of measuring fewer pressed in space. 52 9 RNA species, Codeluppi et al. [2018] have 10 proposed osmFISH, which uses a single fluo- 3.5.2 Open problems 53 11 rescent probe per RNA species and leverages 12 FISH iterations to measure different species The central problem is to consider gene or 54 13 instead of building up a barcode. This leads transcript expression and spatial coordinates 55 14 to a number of recognizable RNA species that of cells, and derive an assignment of cells to 56 15 is linear in the number of FISH iterations. classes, functional groups or cell types. De- 57 16 In addition to the methods that provide in pending on the studied biological question, it 58 17 situ measurements of RNA, mass cytometry can be useful to constrain assignments with 59 18 [Giesen et al., 2014, Angelo et al., 2014] and expectations on the homogeneity of the tis- 60 19 multiplexed immunofluorescence [Lin et al., sue. For example, a set of cells grouped to- 61 20 2018, Saka et al., 2019, Goltsev et al., 2018] gether might be required to appear in one 62 21 can be used to quantify the abundance of pro- or multiple clusters where little to no other 63 22 teins while preserving subcellular resolution. cells are present. Such constraints might de- 64 23 Finally, the recently described Digital Spatial pend on the investigated cell types or tissues. 65 24 Profiling [DSP, Merritt et al., 2019, Van and For example, in cancer, spatial patterns can 66 25 Blank, 2019] promises to provide both RNA occur on multiple scales, ranging from sin- 67 26 and protein measurements with spatial reso- gle infiltrating immune cells [Fridman et al., 68 27 lution. 2011] and minor subclones [Swanton, 2012] to 69 28 For determining cell types, or clustering larger subclonal structures or the embedding 70 29 cells into groups that conduct a common func- in surrounding normal tissue and the tumor 71 30 tion, several methods are available [Zhang microenvironment [Cretu and Brooks, 2007]. 72 31 et al., 2019a, Kiselev et al., 2018, Butler et al., Currently, to the best of our knowledge, there 73 32 2018b], but none of these currently use spa- is no method available that would allow the 74 33 tial information directly. In contrast, spatial encoding of such prior knowledge while infer- 75 34 correlation methods have been used to de- ring cell types by integrating spatial informa- 76 35 tect the aggregation of proteins [Shivanandan tion with transcript or gene expression. The 77 36 et al., 2016]. Shah et al. [2016a] use seqFISH expected tissue heterogeneity therefore also 78 37 to measure transcript abundance of a set of impacts the desired properties of the assign- 79 38 marker genes while retaining the spatial co- ment method itself. For example, in order to 80 39 ordinates of the cells. Cells are clustered by also recognize groups or types of interest that 81 40 gene expression profiles and then assigned to are expected to occur at multiple locations, 82 41 regions in the brain based on their coordinates applicable methods should not strictly rely on 83 42 in the sample. Recently, Edsgärd et al. [2018] co-localization of transcriptional profiles. 84 43 presented a method to detect spatial differ- Another important aspect when modeling 85 44 ential expression patterns per gene based on the relation between space and expression 86

19 1 is whether uncertainty in the measurements and therapeutic choices (Figure 4). 41 2 can be propagated to downstream analyses. Classic bulk sequencing data of tumor sam- 42 3 For example, it is desirable to rely on tran- ples taken during surgery are always a mix- 43 4 script quantification methods that provide ture of tumor and normal cells (including, 44 5 the posterior distribution of transcript expres- e.g., invading immune cells). This means 45 6 sion [Kharchenko et al., 2014, Köster et al., that disentangling mutational profiles of tu- 46 7 2019a] and propagate this information to the mor subclones will always be challenging, 47 8 spatial analysis. Since many spatial measure- which especially holds for rare subclones that 48 9 ment approaches entail an optical, microscopy could nevertheless be the ones bearing resis- 49 10 based component, it would be beneficial to ex- tance mutation combinations prior to a treat- 50 11 tract additional information from these mea- ment. Here, the sequencing of single cells 51 12 surements. For example, cell shape and size, holds the exciting promise of directly identify- 52 13 as well as the subcellular spatial distribution ing and characterizing those subclone profiles 53 14 of transcripts or proteins could be used to ad- (Figure 4). 54 15 ditionally guide the clustering or classification Ideally, scDNA-seq should provide informa- 55 16 process. Finally, in light of issues with spar- tion about the entire repertoire of distinct 56 17 sity in single-cell measurements (section 3.1), events that occurred in the genome of a sin- 57 18 it appears desirable to integrate spatial in- gle cell, such as copy number alterations, ge- 58 19 formation into the quantification itself, and, nomic rearrangements, together with SNVs 59 20 for example, use neighboring cells within the and smaller insertion and deletion variants. 60 21 same tissue for imputation or the inference of However, scDNA-seq requires WGA of the 61 22 a posterior distribution of transcript expres- DNA extracted from single cells and this am- 62 23 sion. plification introduces errors and biases that 63 present a serious challenge to variant call- 64 ing [de Bourcy et al., 2014, Hou et al., 2015, 65 Huang et al., 2015, Estévez-Gómez et al., 66 24 4 Challenges in single-cell 2018]. It is broadly accepted that different 67 25 genomics WGA technologies should be used to detect 68 different types of variation. PCR-based ap- 69 26 With every cell division in an organism, the proaches [Telenius et al., 1992, Zhang et al., 70 27 genome can be altered through mutational 1992, Klein et al., 1999, Arneson et al., 2008] 71 28 events ranging from point mutations, over are best suited for CNV calling, as they 72 29 short insertions and deletions, to large scale achieve a more uniform coverage. But they 73 30 copy number variations and complex struc- require thermostable polymerases that with- 74 31 tural variants. In cancer, the entire reper- stand the temperature maxima during PCR 75 32 toire of these genetic events can occur during cycling, and all such polymerases have rela- 76 33 disease progression (Figure 4). The resulting tively high error rates. In contrast, MDA- 77 34 tumor cell populations are highly heteroge- based techniques are the method of choice for 78 35 neous. As tumor heterogeneity can predict SNV calling, as they achieve much lower error 79 36 patient survival and response to therapy [Mc- rates with the high-fidelity Φ29 DNA poly- 80 37 Granahan and Swanton, 2017, Lawson et al., merase [Blanco et al., 1989, Dean et al., 2002, 81 38 2018], including immunotherapy, quantifying Spits et al., 2006b, Picher et al., 2016a, Paez 82 39 this heterogeneity and understanding its dy- et al., 2004, Spits et al., 2006a] (in an isother- 83 40 namics are crucial for improving diagnosis mal reaction, as it would not be stable at com- 84

20 selec�on pressures small muta�on sampled cell theore�cal (SNV / indel) subclone phylogeny large dele�on amplifica�on

subclone compe��on

biopsy immune detec�on

tumour resec�on

therapy

metastasis resec�on space �me

Figure 4: A tumor evolves somatically—from initiation to detection, to resection, and to pos- sible metastasis. New genomic mutations can confer a selective advantage to the resulting new subclone, that allows it to outperform other tumor subclones (subclone competition). At the same time, the acting selection pressures can change over time (e.g., due to new subclones arising, the immune system detecting certain subclones, or as a result of therapy). Understanding such selective regimes—and how specific mutations alter a subclone’s sus- ceptibility to changes in selection pressures—will help construct an evolutionary model of tumorigenesis. And it is only within this evolutionary model, that more efficient and more patient-specific treatments can be developed. For such a model, unambiguously identifying mutation profiles of subclones via scDNA-seq of resected or biopsied single cells is crucial.

21 Box 2: Whole genome amplification: recent improvements Recent improvements of whole genome amplification (WGA) methods promise to reduce am- plification biases and errors, while scaling throughput to larger cell numbers:

(i) Improved coverage uniformity for multiple displacement amplification (MDA) has been achieved using droplet microfluidics-based methods (eWGA Fu et al. [2015]; sd-MDA, Hosokawa et al. [2017]); ddMDA, Sidore et al. [2016]). A second approach has been to couple the Φ29 DNA polymerase to a primase to reduce priming bias [Picher et al., 2016a].

(ii) One way to reduce the amplification error rate of the polymerase chain reaction (PCR)- based methods (including multiple annealing and looping-based amplification cycles (MALBAC)) would be to employ a thermostable polymerase (necessary for use in PCR) with proof-reading activity similar to Φ29 DNA polymerase, but we are not aware of any PCR DNA polymerases with a fidelity in the range of Φ29 DNA polymerase [Potapov and Ong, 2017].

(iii) Three newer methods use an entirely different approach: they randomly insert transposons into the whole genome and then leverage these as priming sites for amplification and library preparation. Transposon Barcoded (TnBC) library preparation (with a PCR amplifica- tion, [Xi et al., 2017]) and direct library preparation (DLP) (with a shallow library without any amplification, Zahn et al. [2017a]), allow only for copy number variation (CNV) call- ing, but DLP scales up to 80,000 single cells [Laks et al., 2018]. Linear—as opposed to exponential—Amplification via Transposon Insertion (LIANTI, [Chen et al., 2017]) also addresses amplification errors: All copies are generated based on the original genomic DNA through in vitro transcription. With errors unique to individual barcoded copies, the authors report a false positive rate that is even lower than for MDA [Chen et al., 2017].

22 1 mon PCR temperature maxima). But MDA 2 suffers from stronger allelic bias in the amplifi- 3 cation, possibly because it is more sensitive to 10000 4 DNA input quality [Bäumer et al., 2018] and 5 biased priming [Picher et al., 2016b]. The goal 6 must thus be to (i) improve the coverage uni- 11000 7 formity of MDA-based methods, (ii) reduce 8 the error rate of the PCR-based methods, or 9 (iii) create new methods that exhibit both a 11100 10 low error rate and a more uniform amplifica- 11 tion of alleles. Recent years witnessed inten- 12 sive research in these directions (see Box 2), 13 promising scalable methodology for scDNA- 14 seq comparable to that already available for 1?101 1100? 11010 100?0 15 scRNA-seq, while at the same time reducing 16 previously limiting errors and biases. While Figure 5: Mutations (colored stars) accumu- 17 this is not a SCDS challenge, it remains cen- late in cells during somatic cell divisions 18 tral to continuously and systematically evalu- and can be used to reconstruct the develop- 19 ate the whole range of promising WGA meth- mental lineages of individual cells within an 20 ods for the identification of all types of genetic organism (leaf nodes of the tree with muta- 21 variation from SNVs over smaller insertions tional presence / absence profiles attached). 22 and deletions up to copy number variation However, insufficient or unbalanced WGA 23 and structural variants. can lead to the dropout of one or both alle- les at a genomic site. This can be mitigated by better amplification methods, but also 24 4.1 Challenge VI: Dealing with by computational and statistical methods 25 errors and missing data in that can account for or impute the missing

26 the identification of variation values.

27 from single-cell DNA

28 sequencing data. ify cancer patients for the presence of resistant 42 43 29 The aim of scDNA-seq usually is to track so- subclones—it is instrumental to genotype and 44 30 matic evolution at the cellular level, that is, also phase genetic variants in single cells with 45 31 at the finest resolution possible relative to the sufficiently high confidence. 32 laws of reproduction (cell division, Figure 5). The major disturbing factor in scDNA-seq 46 33 Examples are identifying heterogeneity and data is the WGA process (see above). All 47 34 tracking evolution in cancer, as the likely methodologies introduce amplification errors 48 35 most predominant use case (also see below (false positive alternative alleles), but more 49 36 in section 5), but also monitoring the inter- drastic is the effect of amplification bias: 50 37 action of somatic mutation with developmen- the insufficient or complete failure of am- 51 38 tal and differentiation processes. To track ge- plification, which leads to imbalanced pro- 52 39 netic drifts, selective pressures, or other phe- portions or complete lack of variant alleles. 53 40 nomena inherent to the development of cell Overall, one can distinguish between three 54 41 clones or types (Figure 4)—but also to strat- cases: (i) an imbalanced proportion of alle- 55

23 1 les, i.e. loci harboring heterozygous mutations nucleotide polymorphisms. It exploits the 43 2 where preferential amplification of one of the fact that allele drop-out affects contiguous re- 44 3 two alleles leads to distorted read counts; gions of the genome large enough to harbor 45 4 (ii) allele drop-out, i.e. loci harboring het- several, and not only one, heterozygous mu- 46 5 erozygous mutations where only one of the tation loci. SCAN-SNV works along simi- 47 6 alleles was amplified and sequenced, and lar lines, fitting a region-specific allelic bal- 48 7 (iii) site drop-out, which is the complete fail- ance model to germline heterozygous variants 49 8 ure of amplification of both alleles at a site called in a reference bulk sample. Mono- 50 9 and the resulting lack of any observation of a var uses an orthogonal approach to variant 51 10 certain position of the genome. Note that (ii) calling. It does not assume any dependency 52 11 can be considered an extreme case of (i). across sites, but instead handles low and un- 53 12 A sound imputation of missing alleles and even coverage and false positive alternative 54 13 a sufficiently accurate quantification of un- alleles by integrating the sequencing informa- 55 14 certainties will yield massive improvements in tion across multiple cells. While Monovar 56 15 geno- and haplotyping (phasing) somatic vari- merely creates a consensus across cells, in- 57 16 ants. This, in turn, is necessary to substan- tegrating across cells is particularly powerful 58 17 tially improve the identification of subclonal if further knowledge about the dependency 59 18 genotypes and the tracking of evolutionary structure among cells is incorporated. As 60 19 developments. Potential improvements in this pointed out above, due to the lack of recom- 61 20 area include (i) more explicit accounting for bination, any sample of cells derived from an 62 21 possible scDNA-seq error types, (ii) integrat- organism shares an evolutionary history that 63 22 ing with different data types with error pro- can be described by a cell lineage tree (see sec- 64 23 files different from scDNA-seq (e.g., bulk se- tion 5). This tree, however, is in general un- 65 24 quencing or RNA sequencing), or (iii) inte- known and can in turn only be reconstructed 66 25 grating further knowledge of the process of from single-cell mutation profiles. A possible 67 26 somatic evolution, such as the constraints of solution is to infer both mutation calls and 68 27 phylogenetic relationships among cells, into a cell lineage tree at the same time, an ap- 69 28 variant calling models. In this latter context, proach taken by a number of existing tools: 70 29 it is important to realize that somatic evolu- single-cell Genotyper [Roth et al., 2016], Sci- 71 30 tion is asexual. Thus, no recombination oc- CloneFit [Zafar et al., 2018] and SciΦ [Singer 72 31 curs during mitosis, eliminating a major dis- et al., 2018]. Finally, SSrGE, identifies SNVs 73 32 turbing factor usually encountered when aim- correlated with gene expression from scRNA- 74 33 ing to reconstruct species or population trees seq data [Poirion et al., 2018]. 75 34 from germline mutation profiles. Some basic approaches to CNV calling from 76 scDNA-seq data are available. These are usu- 77 ally based on hidden markov models (HMMs) 78 35 4.1.1 Status where the hidden variables correspond to copy 79 36 Current single-cell specific SNV callers in- number states, as, for example, in Aneufinder 80 37 clude Monovar [Zafar et al., 2016], SCcaller [Bakker et al., 2016]. Another tool, Ginkgo, 81 38 [Dong et al., 2017] and SCAN-SNV [Luquette provides interactive CNV detection using cir- 82 39 et al., 2019]. SCcaller detects somatic vari- cular binary segmentation, but is only avail- 83 40 ants independently for each cell, but accounts able as a web-based tool [Garvin et al., 2015]. 84 41 for local allelic amplification biases by inte- ScRNA-seq data, which does not suffer from 85 42 grating across neighboring germline single- the errors and biases of WGA, can also be 86

24 1 used to call CNVs or loss of heterozygosity seq variant callers with those respective capa- 43 2 events: an approach called HoneyBADGER bilities. 44 3 [Fan et al., 2018] utilizes a probabilistic hid- For copy number variation calling, software 45 4 den Markov model, whereas the R package has previously been published mostly in con- 46 5 inferCNV simply averages the expression over junction with data-driven studies. Here, a 47 6 adjacent genes [Patel et al., 2014]. systematic analysis of biases in the most com- 48 mon WGA methods for copy number vari- 49 50 7 4.1.2 Open problems ation calling (including newer methods to come) could further inform method devel- 51 8 SNV callers for scDNA-seq data have already opment. The already mentioned approach 52 9 incorporated amplification error rates and al- of leveraging amplification bias for phasing 53 10 lele dropout in their models. Beyond these could also be informative [Satas and Raphael, 54 11 rates, the challenge remains to further extend 2018]. 55 12 this by directly modeling the amplification The final challenge is a systematic compar- 56 13 process using statistics, that would inherently ison of tools beyond the respective software 57 14 account for both errors and biases, and more publications, which is still lacking for both 58 15 accurately quantify the resulting uncertain- SNV and CNV callers. This requires system- 59 16 ties (see section 2.2). This could be achieved atic benchmarks, which in turn require sim- 60 17 by expanding models that accurately quan- ulation tools to generate synthetic datasets, 61 18 tify uncertainties in related settings [Köster as well as real sample based benchmarking 62 19 et al., 2019b] and would ultimately even al- datasets with a reasonably reliable ground 63 20 low reliable control of false discovery rates in truth (see section 6.2). 64 21 the variant discovery and genotyping process. 22 Such expanded models can build on a number 23 of recent studies in this context, for example 5 Challenges in single-cell 65 24 on a formalization in a recent preprint [Kop- 25 tagel et al., 2018]. Furthermore, such mod- phylogenomics 66 26 els could integrate the structure of cell lin- 27 eage trees with the structure implicit in haplo- Single-cell variation profiles from scDNA-seq, 67 28 types that link alleles. For haplotype phasing, as described above (section 4.1), can be used 68 29 Satas and Raphael [2018] recently proposed in computational models of somatic evolution, 69 30 an approach based on contiguous stretches including cancer evolution as an important 70 31 of amplification bias (similar to SCcaller, see special case (Figure 4). For cancer, there is an 71 32 above), whereas others propose read-backed ongoing, lively discussion about the very na- 72 33 phasing in two recent studies [Bohrson et al., ture of evolutionary processes at play, with 73 34 2019, Hård et al., 2019]. In addition, the in- competing theories such as linear, branch- 74 35 tegration with deep bulk sequencing data, as ing, neutral, and punctuated evolution [Davis 75 36 well as with scRNA-seq data remains unex- et al., 2017]. 76 37 plored, although it promises to improve the Models of cancer evolution may range from 77 38 precision of callers without compromising sen- a simple binary representation of the pres- 78 39 sitivity. ence versus the absence of a particular mu- 79 40 Identification of short insertions and dele- tational event (Figure 5), to elaborate models 80 41 tions (indels) is another major challenge to be of the mechanisms and rates of distinct muta- 81 42 addressed: we are not aware of any scDNA- tional events. There are two main modeling 82

25 1 approaches that lend themselves to the anal- fitness advantage of driver mutations they are 45 2 ysis of tumor evolution [Altrock et al., 2015]: linked to via their haplotype. The parameters 46 3 phylogenetics and population genetics. of population genetic models describe inher- 47 4 Phylogenetics comes with a rich reper- ent features of individual cells that are rele- 48 5 toire of computational methods for likelihood- vant for the evolution of their populations, for 49 6 based inference of phylogenetic trees [Felsen- example fitness and the rates of birth, death, 50 7 stein, 1981]. Traditionally, these methods are and mutations. Such cell-specific parameters 51 8 used to reconstruct the evolutionary history should more naturally apply to and be derived 52 9 of a set of distinct species. However, they can from information gathered by sequencing of 53 10 also be applied to cancer cells or subclones individual cells, as opposed to sequencing of 54 11 (Figure 4). In this setting, tips of the phy- bulk tissue samples. Models using these pa- 55 12 logeny (also called leaves or taxa) represent rameters will, for example, be essential in the 56 13 sampled and sequenced cells or subclones, design of adaptive cancer treatment strategies 57 14 whereas inner nodes (also called ancestral) that aim at managing subclonal tumor com- 58 15 represent their hypothetical common ances- position [Acar et al., 2019, Zhang et al., 2017]. 59 16 tors. The input for a phylogenetic inference 17 commonly consists of a multiple sequence 5.1 Challenge VII: Scaling 60 18 alignment (MSA) of molecular sequences for phylogenetic models to many 61 19 the species of interest. For cancer phyloge- 20 nies, one would concatenate the SNVs (and cells and many sites 62 21 possibly other variant types) to assemble the Even if given perfect data, phylogenetic mod- 63 22 input MSA. The key challenge for phyloge- els of tumor evolution would still face the 64 23 netic method development comprises design- challenge of computational tractability, which 65 24 ing sequence evolution models that are (i) bio- is mainly induced by: (i) the increasing num- 66 25 logically realistic and yet (ii) computationally bers of cells that are sequenced in cancer 67 26 tractable for the increasingly large number of studies, and (ii) the increasing numbers of 68 27 sequenced cells per patient and study. sites that can be queried per genome (see sec- 69 28 In population genetics, the tumor is under- tion 2.3). 70 29 stood as a population of evolving cells (Fig- 30 ure 4). To date, population genetic theory 5.1.1 Open problems 71 31 has been used to model the initiation, pro- 32 gression and spread of tumors from bulk se- (i) While adding data from more single cells 72 33 quencing data [Foo et al., 2011, Beerenwinkel will help improve the resolution of tumor 73 34 et al., 2007, Haeno et al., 2012]. The general phylogenies [Graybeal, 1998, Pollock et al., 74 35 mathematical framework behind these mod- 2002], this exacerbates one of the main chal- 75 36 els are branching processes [Kimmel and Ax- lenges of phylogenetic inference in general: 76 37 elrod, 2015], for example in models of the the immense space of possible tree topologies 77 38 accumulation of driver and passenger muta- that grows super-exponentially with the num- 78 39 tions [Bozic et al., 2016, 2010]. Here, the ber of taxa—in our case the number of sin- 79 40 driver mutations carry a fitness advantage, gle cells. Phylogenetic inference is NP-hard 80 41 as might epistatic interactions among them [Roch, 2006] under most scoring criteria (a 81 42 [Bauer et al., 2014]. In contrast, passenger scoring criterion takes a given tree and MSA 82 43 mutations are assumed to be neutral regard- to calculate how well the tree explains the 83 44 ing fitness; they merely hitchhike along the observed data). Calculating the given score 84

26 1 on all possible trees to find the tree that ure 4). In turn, this should increase the reso- 42 2 best explains the data is computationally not lution and reliability of the resulting trees. 43 3 feasible for MSAs containing more than ap- 4 proximately 20 single cells, and thus requires 5.2.1 Status 44 5 heuristic approaches to explore only promis- 6 ing parts of the tree search space. For phylogenetic tree inference from SNVs of 45 7 (ii) In addition to the growing number of single cells, a considerable number of tools 46 8 cells (taxa), the breadth of genomic sites and exist. The early tools OncoNEM [Ross and 47 9 genomic alterations that can be queried per Markowetz, 2016] and SCITE [Jahn et al., 48 10 genome also increases. Classical approaches 2016] use a binary representation of presence 49 11 thus need not only scale with the number of or absence of a particular SNV. They account 50 12 single cells queried (see above), but also with for false negatives, false positives and missing 51 13 the length of the input MSA. Here, previ- information in SNV calls, where false neg- 52 14 ous efforts for parallelization [Aberer et al., atives are orders of magnitude more likely 53 15 2014, Ayres, 2017] and other optimisation ef- to occur than false positives. The more re- 54 16 forts [Ogilvie et al., 2017] exist and can be cent tool SiFit [Zafar et al., 2017] also uses 55 17 built upon. The breadth of sequencing data a binary SNV representation, but infers tu- 56 18 also allows determination of large numbers of mor phylogenies allowing for both noise in 57 19 invariant sites, which further raises the ques- the calls and for violations of the infinite sites 58 2 20 tion of whether including them will change re- assumption . Another approach allowing for 59 21 sults of phylogenetic inferences in the context violations of the infinite sites assumption is 60 22 of cancer. Excluding invariant sites from the the extension of the Dollo parsimony model 61 23 inference has been coined ascertainment bias. to allow for k losses of a mutation (Dollo- 62 24 For phylogenetic analyses of closely related in- k) [El-Kebir, 2018, Ciccolella et al., 2018]. 63 25 dividuals from a few populations it has been Single-cell genotyper [Roth et al., 2016], Sci- 64 26 shown that accounting for ascertainment bias CloneFit [Zafar et al., 2018], or SciΦ [Singer 65 27 alters branch lengths, but not the resulting et al., 2018] jointly call mutations in indi- 66 28 tree topologies per se [Leaché et al., 2015]. vidual cells and estimate the tumor phy- 67 logeny of these cells, directly from single- 68 cell raw sequencing data. In a recent work 69 29 5.2 Challenge VIII: Integrating [Kozlov, 2018], a standard phylogenetic infer- 70

30 multiple types of variation ence tool RAxML-NG [Kozlov et al., 2019] 71 has been extended to handle single-cell SNV 72 31 into phylogenetic models data. In particular, this implements (i) a 10-s- 73 32 Naturally, downstream analyses—like charac- tate substitution model to represent all pos- 74 33 terizing intratumoral heterogeneity and in- sible unphased diploid genotypes and (ii) an 75 34 ferring its evolutionary history—suffer from explicit error model for allelic dropout and 76 35 the unreliable variant detection in single cells. genotyping/amplification errors. Initial ex- 77 36 However, the better the quality of the variant periments showed that—although a 10-state 78 37 calls becomes, the more important it becomes model incorporates more information—it out- 79 38 to model all types of available signal in math- 2The infinite sites assumption posits a genome with 39 ematical models of tumor evolution: from an infinite number of sites, thus rendering a re- 40 SNVs, over smaller insertions and deletions, peated mutational hit of the same genomic site 41 to large structural variation and CNVs (Fig- along a phylogeny impossible.

27 1 performed the ternary model (as used by vated transition probabilities for changes in 45 2 SiFit) only slightly and only in simulations copy numbers. Towards that goal, approaches 46 3 with very high error rates (10%-50%). How- to calculate the distance between whole copy 47 4 ever, further analysis suggests that benefits of number profiles [Zeira and Shamir, 2018] are 48 5 the genotype model become much more pro- a first step. But for them, a number of chal- 49 6 nounced with an increasing number of cells lenges remain, with several of the underlying 50 7 and, in particular, an increasing number of problems known to be NP-hard [Zeira and 51 8 SNVs (preliminary analysis by Kozlov). Shamir, 2018]. 52 9 While there are no tools yet available to Co-occurrence of all of the above variation 53 10 identify insertions and deletions from scDNA- types further complicates mathematical mod- 54 11 seq (see section 4.1), it is only a matter of time eling, as these events are not independent. 55 12 until such callers will become available. As For example, multiple SNVs that occurred in 56 13 they can already be identified from bulk se- the process of tumor evolution may disappear 57 14 quencing data, some precious efforts to incor- at once via a deletion of a large genomic re- 58 15 porate indels in addition to substitutions into gion. In addition, recent analyses revealed re- 59 16 classical phylogenetic models exist: A decade currence and loss of particular mutational hits 60 17 ago, a simple probabilistic model of indel evo- at specific sites in the life histories of tumors 61 18 lution was proposed [Rivas and Eddy, 2008]. [Kuipers et al., 2017]. This undermines the 62 19 But although some progress has been made validity of the so called infinite sites assump- 63 20 since then, such models are less tractable than tion, commonly made by phylogenetic mod- 64 21 the respective substitution models [Holmes, els. 65 22 2017]. 23 Incorporating CNVs in the reconstruction 5.2.2 Open problems 66 24 of tumor phylogeny can be helpful for un- 25 derstanding tumor progressions, as they rep- For phylogenetic reconstruction from SNVs, 67 26 resent one of the most common mutation we anticipate a shift towards leveraging im- 68 27 types associated to tumor hypermutability provements in input data quality as they are 69 28 [Kim et al., 2013]. CNVs in single cells were achieved through better amplification meth- 70 29 extensively studied in the context of tumor ods and SNV callers (see Box 2 and sec- 71 30 evolution and clonal dynamics [Navin et al., tion 4.1). For indels, variant callers for 72 31 2011, Eirew et al., 2015]. Reconstructing a scDNA-seq data are anticipated but remain 73 32 phylogeny with CNVs is not straightforward. to be developed (see section 4.1). Thus, in- 74 33 The challenges are not only related to ex- del modeling efforts for phylogenetic recon- 75 34 perimental limits, such as the complexity of struction from bulk sequencing data should 76 35 bulk sequencing data [Zaccaria et al., 2017] be adapted. For phylogenetic inference from 77 36 and amplification biases [Gawad et al., 2016], CNVs, the major challenges are (i) determin- 78 37 but also involve computational constraints. ing correct mutational profiles and (ii) com- 79 38 First of all, the causal mechanisms, such as puting realistic transition probabilities be- 80 39 breakage-fusion-bridge cycles [Bignell et al., tween those profiles. 81 40 2007] and chromosome missegregation [San- The final problem will be to incorporate all 82 41 taguida et al., 2017], can lead to overlapping of the above phenomena into a holistic model 83 42 copy number events [Schwarz et al., 2014]. of cancer evolution. However, this will sub- 84 43 Secondly, inferring a phylogeny with CNV stantially increase the computational cost of 85 44 data requires quantifying biologically moti- reconstructing the evolutionary history of tu- 86

28 1 mor cells. Thus, one needs to carefully de- mary tumor (seeding multiple locations with 42 2 termine which phenomena actually do mat- a genotype closer to the late-stage primary 43 3 ter (e.g., which parameters even affect the fi- tumor). Moreover, it is unknown whether a 44 4 nal tree topology) and which features can be single cell can seed a metastasis, or whether 45 5 measured and called (section 4.1) with suffi- the joint migration of a set of cells is required. 46 6 cient accuracy to actually improve modeling Here, sc-seq can provide invaluable resolution 47 7 results. As a consequence one might be able [Navin et al., 2011]. 48 8 to devise more lightweight models for answer- Although many mathematical models of 49 9 ing specific questions and invest considerable tumor evolution have been proposed [Bozic 50 10 effort into optimizing novel tools at the algo- et al., 2010, 2016, Altrock et al., 2015, Foo 51 11 rithmic and technical level (see section 5.3). et al., 2011, Michor et al., 2004, Williams 52 et al., 2016], fundamental parameters char- 53 acterizing the evolutionary processes remain 54 12 5.3 Challenge IX: Inferring elusive. To quantitatively describe the tumor 55 13 population genetic evolution process and evaluate different pos- 56 14 parameters of tumor sible modes against each other (e.g., modes of 57

15 heterogeneity by model metastatic seeding), we would like to estimate 58 fitness values of individual mutations and mu- 59 16 integration tation combinations, as well as rates of muta- 60 17 Tumor heterogeneity is the result of an evo- tion, cell birth and cell death—if possible, on 61 18 lutionary journey of tumor cell populations the level of subclones. These parameters de- 62 19 through both time and space [Swanton, 2012, termine the underlying fitness landscape of in- 63 20 McGranahan and Swanton, 2017]. Microen- dividual cells within their microenvironment, 64 21 vironmental factors like access to the vascular which in turn determines the evolutionary dy- 65 22 system and infiltration with immune cells dif- namics of cancer progression. 66 23 fer greatly—for regions within the original tu- 24 mor as well as between the main tumor and 5.3.1 Status 67 25 metastases, and across different time points 26 [Yang and Lin, 2017]. This imposes different Recent technological advances already allow 68 27 selective pressures on different tumor cells, for measuring the arrangement and relation- 69 28 driving the formation of tumor subclones and ships of tumor cells in space, with cell loca- 70 29 thus determining disease progression (includ- tion basically amounting to a second measure- 71 30 ing metastatic potential), patient outcome ment type requiring data integration within 72 31 and susceptibility to treatment (Junttila and a cell (approach +M1C in section 6.1, Fig- 73 32 de Sauvage [2013], Corredor et al. [2018] and ure 6 and Table 3). While in vivo imag- 74 33 Figure 4). However, even the basic questions ing techniques might also become interesting 75 34 about the resulting dynamics remain unan- for obtaining time series data in the future 76 35 swered [Turajlic and Swanton, 2016]. For ex- [Larue et al., 2017], the automated analysis 77 36 ample, it is unclear whether metastatic seed- of whole slide immunohistochemistry images 78 37 ing from the primary tumor occurs early and [Ghaznavi et al., 2013, Saco et al., 2016] seems 79 38 multiple times in parallel (with metastases the most promising in the context of cancer 80 39 diverging genetically from the primary tu- and mutational profiles from scDNA-seq. It 81 40 mor), or whether seeding of metastases occurs is already amenable to single-cell extraction 82 41 late, from a far-developed subclone in the pri- of characterized cells with known spatial con- 83

29 1 text and subsequent scDNA-seq. Using laser dict single-cell expression from microenviron- 45 2 capture microdissection [Datta et al., 2015] mental features [Goltsev et al., 2018, Battich 46 3 hundreds of single cells have recently been et al., 2015]. 47 4 isolated from tissue sections and analyzed for Regarding temporal resolution, it is already 48 5 copy number variation [Casasent et al., 2018]. common to sequence tumor material from dif- 49 6 For cell and tissue characterization in im- ferent timepoints: biopsies used for diagnosis, 50 7 munohistochemical images, machine learning resected tumors, lymph nodes and metastases 51 8 models are trained to segment the images and upon surgery and tumors after relapse. These 52 9 recognize structures within tissues and cells time-points already lend themselves to tem- 53 10 [Gurcan et al., 2009, Irshad et al., 2014, Ko- poral analyses of clonal dynamics using bulk 54 11 mura and Ishikawa, 2018]: They can, for ex- DNA sequencing data [Johnson et al., 2014], 55 12 ample, determine the densities and quanti- but scDNA-seq is required for a higher resolu- 56 13 ties of mitotic nuclei, vascular invasion, im- tion of subclonal genotypes. In addition, time 57 14 mune cell infiltration on the tissue level, as resolved measurements and resulting prolifer- 58 15 well as stained biomarkers on the level of the ation and death rates promise a higher ac- 59 16 individual cell. These are key parameters of curacy in detecting epistatic interactions in 60 17 the tumor microenvironment, characterizing cancer genomes than available from previous 61 18 the interaction of tumor cells with their envi- analyses of bulk sequenced tumor genomes 62 19 ronment in space [Yuan, 2016, Heindl et al., [Szczurek et al., 2013, Jerby-Arnon et al., 63 20 2015], that are key to mathematical mod- 2014, Matlak and Szczurek, 2017, Wilkins 64 21 els of cancer evolution. Development of re- et al., 2018]. 65 22 liable classifiers for immunohistochemical im- Eventually, population genetic methods 66 23 ages, however, is challenging due to scarcity of and models should be integrated with ap- 67 24 training data. Solutions such as active learn- proaches from phylogenetics, to also lever- 68 25 ing can speed up the training process and re- age the kinship relationships between cells. 69 26 duce the workload of annotating pathologists One prominent example of this recent trend— 70 27 [Rączkowski et al., 2019]. albeit on bulk data—is the use of the multi- 71 28 Classically, mathematical models of tu- species coalescent model for analyzing MSAs 72 29 mor population genetics have assumed well that contain several individuals for several 73 30 mixed populations, ignoring any spatial struc- populations [Rannala and Yang, 2017, Liu 74 31 ture, let alone evolutionary microenviron- et al., 2015]. This naturally translates into 75 32 ments. Recently, methods have been ex- analyzing tumor subclones as populations of 76 33 tended to account for some spatial structure single cells, capturing some of the population 77 34 and have already led to refined predictions of structure seen in cancers. Another recent ex- 78 35 the waiting time to cancer [Martens et al., ample, is a computational model for inference 79 36 2011] and intratumor heterogeneity [Waclaw of fitness landscapes of cancer clone popula- 80 37 et al., 2015]. In particular, spatial statistics tions using scDNA-seq data, SCIFIL [Skums 81 38 have been proposed for the quantitative sta- et al., 2019]. It estimates the maximum likeli- 82 39 tistical analysis of cancer digital pathology hood fitness of clone variants by fitting a repli- 83 40 imaging [Heindl et al., 2015], but the idea cator equation model onto a character-based 84 41 is applicable to other spatially resolved read- tumor phylogeny. 85 42 outs. Further, a number of methods were pro- For a comprehensive integration, key pa- 86 43 posed to model cell-cell interactions [Schapiro rameters will need to be quantified with 87 44 et al., 2017, Arnol et al., 2018] or to pre- higher resolution. For the detection of pos- 88

30 1 itive selection—for example important in the termine subclone-specific model parameters 43 2 discussion whether the evolution of tumors and their variability in more detail. For ex- 44 3 is driven by selection or neutral—a number ample, rates of proliferation, mutation and 45 4 of phylogenetic and population genetic ap- death could be obtained by measuring num- 46 5 proaches have been proposed in a bulk con- bers of mitotic and apoptotic cells per sub- 47 6 text. Phylogenetic trees may be used for clone or by integrating subclone abundance 48 7 detecting branches on which positive [Zhang profiles across time points. Good estimates 49 8 et al., 2005] or diversifying episodic selection of these basic parameters will greatly ben- 50 9 [Smith et al., 2015] is acting. efit the detection of positive and negative 51 10 In this setting, we will have to account selection in cancer, and improve the pre- 52 11 for heterotachy (e.g., see Kolaczkowski and diction of subclone resistance (and thus ex- 53 12 Thornton [2008]), that is, we cannot assume pected treatment success) from subclone fit- 54 13 a single model of substitution for the entire ness estimates. The fitness of individual sub- 55 14 tree, but have to allow different models to act clones could be calculated from comparing ex- 56 15 on distinct branches or subtrees/subclones. panded subclones in drug screens under differ- 57 16 Here, anything from a simple model of rate ent treatment regimes. 58 17 heterogeneity (e.g., Yang [1994]) to an empir- For some of the rates, for example subclone- 59 18 ical mixture model as used for protein evolu- specific rates of mutation, the integration of 60 19 tion [Le et al., 2012] could be considered. models from population genetics and phy- 61 logenetics holds promise and poses a gen- 62 uine SCDS challenge. But for all of these 63 20 5.3.2 Open problems rates, having better estimates implies follow- 64 21 With an increased resolution of scDNA-seq up challenges. 65 22 (section 4, Box 2) and more work on the One of these resulting challenges will be to 66 23 scDNA-seq challenges described in other sec- detect positive or diversifying selection with 67 24 tions, it will be possible to determine sub- greater resolution, building on approaches 68 25 clone genotypes in more detail. The first from the bulk context. Here, tests from the 69 26 challenge will be to integrate this with the area of “classic” phylogenetics might serve as 70 27 spatial location of single cells obtained from a starting point for exploring and adapting 71 28 other measurements. This will enable de- appropriate methods that will allow to asso- 72 29 termining whether cells from the same sub- ciate positive selection events to branches of 73 30 clones are co-located, whether metastases the tumor tree or specific evolutionary events. 74 31 are founded recurrently by the same sub- Evolutionary pressures are often quantified by 75 32 clone(s) and whether individual metastases the dN/dS ratio of non-synonymous and syn- 76 33 are founded by individual or multiple sub- onymous substitutions. In application to tu- 77 34 clones. Studies utilizing multiple region sam- mor cell populations, however, this ratio may 78 35 ples from the same tumor and from distant not be applicable, as it has been shown to be 79 36 metastases already paved the way in inves- relatively insensitive when applied to popula- 80 37 tigating these questions [e.g., Turajlic and tions within the same species [Kryazhimskiy 81 38 Swanton, 2016]. Still, only single-cell spatial and Plotkin, 2008]. Other measures have been 82 39 resolution will allow identification of specific proposed as better suited for detecting selec- 83 40 individual genotypes in specific locations and tion within populations based on time-series 84 41 drawing precise conclusions. data [Neher et al., 2014, Gray et al., 2011, 85 42 In addition, it will become possible to de- Steinbrück and McHardy, 2011] and could po- 86

31 1 tentially be transferred to tumor cell popula- 6 Overarching challenges 45 2 tions. 3 A particular problem with the detection of 6.1 Challenge X: Integration of 46 4 positive or diversifying selection is, to which single-cell data: across 47 5 extent the above tests will be sensitive to samples, experiments and 48 6 errors in cancer data—the tests are already 7 known to produce high false positive rates in types of measurement 49 8 the classic phylogenetic setting when the er- 50 9 ror rate in the input data is too high [Fletcher Biological processes are complex and dy- 51 10 and Yang, 2010]. Computationally intense so- namic, varying across cells and organisms. To 52 11 lutions for decreasing the high false positive comprehensively analyze such processes, dif- 53 12 rate have been proposed [Redelings, 2014], ferent types of measurements from multiple 54 13 but they might not scale to single-cell cancer experiments need to be obtained and inte- 55 14 datasets. grated. Depending on the actual research 56 15 Another resulting problem will be to adapt question, such experiments can be different 57 16 models for the detection of epistatic interac- time points, tissues or organisms. For their 58 17 tions to single-cell data. As some of these integration, we need flexible but rigorous sta- 59 18 epistatic interactions can be hard to spot in tistical and computational frameworks. Fig- 60 19 bulk sequencing data (they may simply dis- ure 6 and Table 3 provide an overview of 61 20 appear because of a low frequency), time- the promises and challenges of creating such 62 21 resolved scDNA-seq might be the only way frameworks, that we outline here in terms of 3 63 22 to spot them. If integrated across individuals six approaches of data integration . All of 64 23 and cells (see section 6.1), it will be possible these approaches are affected by the issues 65 24 to identify pairs or even larger combinations that influence single-cell data analysis in gen- 66 25 of mutations that often occur simultaneously eral, namely: (i) the varying resolution levels 67 26 in the same genome, and combinations that that are of interest depending on the research 68 27 rarely or never do. That is, cells affected by question at hand (section 2.1); (ii) the un- 69 28 negatively selected or synthetic lethal muta- certainty of any measurements and how to 70 29 tions will go extinct in the tumor population quantify them for and during the analyses 71 30 and thus their genotype with the synthetic (section 2.2) and (iii) the scaling of single-cell 72 31 lethal mutations occurring together will not methodology to more cells and more features 73 32 be observed. At the same time, cell death measured at once (section 2.3). All of these 74 33 can be the result of mere chance, so to de- further compound the most important chal- 75 34 tect significant negative pressures, large co- lenge in the integration of single-cell data: to 76 35 horts of repeated time resolved experiments link data from different sources in a way that 77 36 would have to be performed, resulting in an is biologically meaningful and supports the in- 78 37 even larger data integration challenge (see tended analysis. The maps that describe how 79 38 section 6.1). data from different sources is linked will in- 80 39 A final step will then be to integrate all crease in complexity on increasing amounts 40 these parameters with further information 3Graph representation in Figure 6 approaches 41 about local microenvironments (such as vas- +X+S and +all taken from Wolf et al. [2019], 42 cular invasion and immune cell infiltration), Fig. 3, provided under Creative Commons 43 to estimate the selection potential of such lo- Attribution 4.0 International License (http:// 44 cal factors for or against different subclones. creativecommons.org/licenses/by/4.0/)

32 1S +M1C

loca�on chroma�n accessibility ABCDEFG ... A B C DE F G ... x y z gene methylamuta�onal�on profile p1p2p3p4p5 ...... gene expression ...... ABCDEFG ... ? ... ABCDEFG ...... ? ...... ? ...... ? ...... ? ...... ? ......

+S +M+C clonal ABCDEFG ... scDNA-seq profiles scRNA-seq ... ABCDEFG ... 3 5 ABCDEFG ...... 3 1 ......

+X+S +all

ABCDEFG ... ABCDEFG ... p1p2p3p4p5 ...... Figure 6: Approaches for integrating single-cell measurement datasets across measurement types, samples and experiments, as also described in Table 3. 1S: Clustering of cells from one sample from one experiment requires no data integration. +S: Integration of one measurement type across samples requires the linking of cell popula- tions / clusters. +X+S: Integration of one measurement type across experiments conducted in separate laboratories, requires stable reference systems like cell atlases (compare Figure 1). +M1C: Integration of multiple measurement types obtained from the same cell highlights the problem of data sparsity of all available measurement types and the dependency of measure- ment types that needs to be accounted for. +M+C: Integration of different measurement types from different cells of the same cell population requires special care in matching cells through meaningful profiles. +all: One possibility for easing data integration across measurement types from separate cells would be to have a stable reference (cell atlas) across multiple mea- surement types, capturing different cell states, cell populations and organisms. Effectively, this combines the challenges and promises of the approaches +X+S, +M1C and +M+C.

33 Integration example MT example AMs Promises Challenges combination 1S none scDNA-seq clustering / discover new sub- technical noise ↓; unsupervised clones, cell types or data sparsity ↓ cell states +S within 1 MT, scRNA-seq differential analy- identify effects across batch effects ↓; within 1 exp, ses, time series, sample groups, time validate cell type assign- across > 1 smps spatial sampling and space ments ↓ +X+S within 1 MT, merFISH map cells to stable accelerate analyses; standards across experi- across > 1 exp, reference (cell at- increase sample size; mental centers across > 1 smps las) generalize observa- tions +M1C across > 1 MTs, scM&T-seq MOFA, holistic view of cell scaling cell throughput; within 1 exp, (scRNA-seq + DIABLO, state; MT combinations lim- within 1 cell methylome) MINT quantify dependency ited; of MTs dependency of MTs ↓ +M+C across > 1 MTs, scDNA-seq + Cardelino, use existing datasets validate cell/data within 1 exp, scRNA-seq Clonealign, (faster than +M1C); matching; across > 1 cells, MATCHER flexible experimental test assumptions for within 1 cell pop design integrating data +all across > 1 MTs, hypothetical hypothetical holistic view of biolog- all from approaches across > 1 exps, (any combina- (map cells to ical systems +X+S, +M1C and +M+C across > 1 smps, tion) multi-omic HCA, within cells single-cell TCGA)

Table 3: Approaches for data integration, highlighting their promises and challenges. The labelling corresponds to Figure 6. For each approach, one (combination of) measurement type(s) that is available is given, but more exist and several are discussed in the text. As example analysis methods, actual tool names are given where few tools exist to date; otherwise broader categories or imaginable methodologies are described. Abbreviations: ↓– same challenge also applies to all approaches below; AM – analysis method; exp(s) – experiment(s); HCA – human cell atlas; MT – measurement type; smps – samples; TCGA – The Cancer Genome Atlas 1 of samples, time points and types of measure- of intracellular biological processes, as well as 45 2 ments. their mutual dependencies, so as to draw a 46 3 In the simplest setup, we obtain one mea- comprehensive picture of a single cell. Here, 47 4 surement type from multiple cells of a sin- an optimal setup is to collect several types 48 5 gle sample, to identify subpopulations of cells of measurements from each cell at once; for 49 6 (e.g., subclones or cell types). As any anal- example, both scDNA-seq and scRNA-seq 50 7 ysis of sc-seq data, it needs to take into ac- captured from the same cell, possibly fur- 51 8 count the data’s sparsity (see section 3.1 and ther augmented by measurements of chro- 52 9 section 4.1; approach 1S in Figure 6 and Ta- matin accessibility, gene methylation, pro- 53 10 ble 3). teins or metabolites (approach +M1C). The 54 11 When aiming at identifying patterns of dif- most prominent challenge for this setup 55 12 ferential expression or characterizing variabil- is to model inherent dependencies between 56 13 ity across organisms, individuals, or locations, measurement types wherever phenomena are 57 14 the same measurement type (for example, concurrent (e.g., measuring CNV through 58 15 only scRNA-seq) is taken from multiple sam- scDNA-seq at the same time as obtaining 59 16 ples from different time points, different loca- scRNA-seq, with CNV impacting transcrip- 60 17 tions (e.g., different tissues or sites in a tu- tion levels). 61 18 mor), or different organisms (approach +S). However, co-measuring different types of 62 19 Any such combination of samples requires ac- quantities in the same cell can be experimen- 63 20 counting for batch effects among those sam- tally challenging or even just impossible at 64 21 ples and calls for a validation cell type assign- this point in time. An exit strategy to this 65 22 ments across samples. problem is to analyze a population of cells 66 23 Such batch effects are further aggravated that is homogeneous in terms of some cell type 67 24 when integrating across multiple experiments, or state, taking different measurement types 68 25 possibly run in different experimental centers in different single cells (approach +M+C). After 69 26 with similar but distinct setups (approach collecting different measurement types in dif- 70 27 +X+S). But standardizing experimental pro- ferent single cells, one needs to combine the 71 28 cedures and statistically accounting for batch data in a way that is biologically meaningful. 72 29 effects will be well worth the effort wherever An example is to group cells based on com- 73 30 this enables a significant increase in sample monalities in their genotype profile (Figure 6), 74 31 size, so as to generalize (and statistically cor- having become evident only after the applica- 75 32 roborate) observations. Nevertheless, even if tion of a scDNA-seq experiment. This will 76 33 standards have been successfully established require careful validation of the assumptions 77 34 and known batches accounted for, additional made when matching cells via such a group- 78 35 validation of, for example, assignments of cells ing, possibly including functional validation 79 36 to types and states may be required. Even- of group differences. 80 37 tually, an increase in generality will support Finally, the most comprehensive goal will 81 38 the construction of reference systems, such as be a holistic view of the complexity of 82 39 a cell atlas, the existence of which can sup- (intra-)cellular circuits, and charting their 83 40 port decisive speed-ups when classifying cells variability across time, tissues, populations 84 41 or cell states in subsequent experiments (see and organisms (approach +all). Mapping 85 42 section 3.3). cellular circuits in this comprehensive man- 86 43 Yet another scenario manifests when try- ner requires integrating complementary and 87 44 ing to unravel complexity and coordination possibly interdependent measurements in sin- 88

35 1 gle cells and across multiple single cells from scRNA-seq [Angermueller et al., 2016], all 43 2 diverse samples. of scRNA-seq, scDNA-seq, methylation and 44 chromatin accessibility data [Clark et al., 45 46 3 6.1.1 Status 2018], or targeted queries on a cell’s geno- type, expression (scRNA-seq) and methyla- 47 4 For unsupervised clustering (approach 1S in tion status (sc-GEM, Cheow et al. [2016]). 48 5 Figure 6 and Table 3), method development For these single-cell specific approaches, bulk 49 6 is a well-established field. Remaining chal- approaches that address the integration of 50 7 lenges have already been identified system- data from different types of experiments have 51 8 atically, see Duò et al. [2018], Freytag et al. the potential to be adapted to single-cell spe- 52 9 [2018], Kiselev et al. [2019]. cific noise characteristics (MOFA, Argelaguet 53 10 For integrating datasets across samples in et al. [2018], DIABLO, Singh et al. [2018], 54 11 one experiment (approach +S), a few ap- mixOmics, Rohart et al. [2017b] and MINT, 55 12 proaches are available. See for example MNN Rohart et al. [2017a]). 56 13 [Haghverdi et al., 2018], and the methodolo- For integrating across multiple mea- 57 14 gies included in the Seurat package [Satija surement types from separate cells (ap- 58 15 et al., 2015, Butler et al., 2018b, Stuart et al., proach +M+C), all of which stem from a 59 16 2018]. For the challenges and promises refer- population of cells that is homogeneous 60 17 ring to the integration of sc-seq data that vary with respect to some selection criterion, 61 18 in terms of spatial and temporal origin, see technologies such as 10X genomics [Zheng 62 19 the discussions in section 3.5 and section 5.3. et al., 2017] for scRNA-seq and direct library 63 20 For integrating datasets across experiments preparation (DLP, Zahn et al. [2017b]) for 64 21 (approach +X+S), mapping cells to reference scDNA-seq establish a scalable experimental 65 22 datasets such as the Human Cell Atlas [Regev basis. The greater analytical challenge is 66 23 et al., 2017] is currently emerging as the most to identify subpopulations that had so far 67 24 promising strategy. We refer the reader to remained invisible, and whose identification 68 25 more particular and detailed discussions in is crucial so as to not combine different types 69 26 section 3.3. While applicable reference sys- of data in mistaken ways. An example for 70 27 tems are not (fully) available, assembling cell this is the identification of distinct cancer 71 28 type clusters from different experiments is a clones from cells sampled from seemingly 72 29 reasonable strategy, as implemented by sev- homogeneous tumor tissue. Here, only 73 30 eral recently published tools [Zhang et al., performing scDNA-seq experiments can 74 31 2018, Barkas et al., 2018, Gao et al., 2018, definitively reveal the clonal structure of 75 32 Kiselev et al., 2018, Park et al., 2018, Wag- a tumor. If one wishes to correctly link 76 33 ner and Yanai, 2018, Boufea et al., 2019, Jo- mutation with transcription profiles, ignoring 77 34 hansen and Quon, 2019, Johnson et al., 2019]. the clonal structure of a tumor could be 78 35 Integrating across multiple measurement misleading. Several analytical methods that 79 36 types from the same cell (approach +M1C) has address this problem have recently emerged: 80 37 become necessary (and possible) with the ad- (i) clonealign [Campbell et al., 2019] assumes 81 38 vent of experimental protocols that enable a copy-number dosage effect on transcription 82 39 the collection of such data [Macaulay et al., to assign gene expression states to clones; 83 40 2017]. Such protocols combine scDNA-seq (ii) cardelino [McCarthy et al., 2018] aligns 84 41 and scRNA-seq (Dey et al. [2015], Macaulay clone-specific SNVs in scRNA-seq to those 85 42 et al. [2016, 2017]), methylation data and inferred from bulk exome data in order 86

36 1 to infer clone-specific expression patterns; for this in +M+C exists (clonealign, Campbell 43 2 (iii) MATCHER [Welch et al., 2017] uses et al. [2019]) and could be extended to +M1C 44 3 manifold alignment to combine scM&T-seq datasets. For approach +M+C, the possibil- 45 4 [Angermueller et al., 2016] with sc-GEM ity to integrate data from single cells with 46 5 [Cheow et al., 2016], leveraging the common data from bulk sequencing of the same cell 47 6 set of loci. All of these methods are based population also holds promise; for example 48 7 on biologically meaningful assumptions on by using bulk genotypes for imputation of 49 8 how to summarize data measurements across sites with no sequencing coverage in single 50 9 different measurement types and samples, cells. Finally, knowing how to link (differ- 51 10 despite their different physical origin. ent) measurement types acquired from differ- 52 ent cells is essential for building reference sys- 53 tems across experiments, such as cell atlases 54 11 6.1.2 Open problems (see also approaches +X+S and +all, and sec- 55 12 Experimental technologies that enable taking tion 3.3). Thus, exploring further combina- 56 13 multiple measurement types in the same cell tions of measurement types and their mea- 57 14 (approach +M1C in Figure 6 and Table 3) are surement linkage in +M+C datasets remains as 58 15 on the rise and will allow to assay more cells a central SCDS challenge. 59 16 at higher fidelity and reduced cost. While No matter which combinations of measure- 60 17 this type of data naturally links measurement ment types become available—the amounts of 61 18 types within single cells, the SCDS challenge material underlying most measurements will 62 19 is to account for dependencies among those remain tiny, limited by the amounts within a 63 20 measurement types for any obtainable com- single cell as well as by a limited number of 64 21 binations of them. As a prominent example cells available from a particular cell popula- 65 22 consider how gene expression increases with tion. This means that one overarching theme 66 23 higher genomic copy number, a phenomenon will persist: analyses like training models or 67 24 known as measurement linkage [Loper et al., mapping quantities on one another will suf- 68 25 2019], which has not been addressed for dif- fer from missing entire views—samples, time 69 26 ferent measurement types taken in the same points, or measurement types. Thus, inte- 70 27 cell. Statistical models for leveraging those grating data across experiments and differ- 71 28 measurement type combinations thus pose ent measurement types will further compound 72 29 formidable SCDS challenges. the challenge of missing data that we already 73 30 While progress on the approach +M1C may discussed for non-integrative approaches (see 74 31 gradually render approach +M+C obsolete, section 3.1 and section 4.1). 75 32 +M+C will remain the easier—or the only 33 feasible—approach for many measurement 34 type combinations for a while. At the same 6.2 Challenge XI: Validating and 76 35 time, any advances in characterizing depen- benchmarking analysis tools 77 36 dencies between different measurement types for single-cell measurements 78 37 acquired from separate cells (+M+C) provide 38 further ground work for linking them when With the advances in sc-seq and other single- 79 39 acquired from the same cell (+M1C). Take cell technologies, more and more analysis 80 40 the example from above, where copy num- tools become available for researchers, and 81 41 ber profiles will impact gene expression mea- even more are being developed and will be 82 42 surements. Here, an approach that accounts published in the near future. Thus, the need 83

37 1 for datasets and methods that support sys- erated under different assumptions. However, 45 2 tematic benchmarking and evaluation of these development of reliable simulation tools re- 46 3 tools is becoming increasingly pressing. To be quires design and implementation of models 47 4 useful and reliable, algorithms and pipelines that capture the essence of underlying bio- 48 5 should be able to pass the following quality logical processes and technological details of 49 6 control tests: (i) They should produce the ex- single-cell technologies and high-throughput 50 7 pected results (e.g., reconstruct phylogenies, sequencing platforms, establishing single-cell 51 8 estimate differential expressions or cluster the data simulation as a methodologically in- 52 9 data) of high quality and outperform exist- volved challenge. 53 10 ing methods, if such methods exist. (ii) They 11 should be robust to high levels of sequencing 6.2.1 Status 54 12 noise and technological biases, including PCR 13 bias, allele dropout and chimeric signals. In Recent studies [Soneson and Robinson, 2018, 55 14 addition, benchmarking should be conducted Saelens et al., 2019, Abdelaal et al., 2019, 56 15 in a systematic way, following established rec- Crowell et al., 2019, Vieth et al., 2019] show 57 16 ommendations [Mangul et al., 2019, Weber that systematic benchmarking of different 58 17 et al., 2019]. single-cell analysis methodologies has begun. 59 18 Evaluation of tool performance requires However, to the best of our knowledge, there 60 19 benchmarking datasets with known ground is still a shortage of single-cell data simula- 61 20 truth. Such data should include cell pop- tion tools, for all the possible use cases. Many 62 21 ulations with known genomic compositions single-cell data analysis packages include their 63 22 and population structures, in other words own ad hoc data simulators [Vallejos et al., 64 23 where frequencies of clones and alleles are 2015, Korthauer et al., 2016a, Lun et al., 2016, 65 24 known. Currently, such datasets are scarce— Lun and Marioni, 2017, Jahn et al., 2016, 66 25 with some notable exceptions [Grün et al., Satas and Raphael, 2018, Rizzetto et al., 67 26 2014, Tian et al., 2019]—because generating 2017, Köster et al., 2019a, Crowell et al., 68 27 them in genuine laboratory settings is time- 2019]. However, these simulators are usu- 69 28 , labor- and cost-intensive. Experimental ally not available as separate tools or even as 70 29 benchmark datasets for evolutionary analysis a source code, tailored to specific problems 71 30 of single-cell populations are even harder to studied in corresponding papers and some- 72 31 obtain, as they require follow-up samples with times not comprehensively documented, thus 73 32 known information about evolutionary trajec- limiting their utility for the broad research 74 33 tories and developmental times. With lack of community. Furthermore, since such simu- 75 34 time-resolved measurements, only anecdotal lators are used only as auxiliary subroutines 76 35 evidence exists on, for instance, how the ac- inside particular projects and are not pub- 77 36 curacy of phylogenetic inferences is affected lished as stand-alone tools, they themselves 78 37 by data quality. Availability of such gold- are usually not guaranteed to be evaluated, 79 38 standard datasets would benefit single-cell ge- and therefore the accuracy of their reflec- 80 39 nomics research enormously. tion of real biological and technological pro- 81 40 Due to aforementioned difficulties, the most cesses can remain unclear. There are few 82 41 affordable sources of benchmarking and vali- exceptions known to us, including the tools 83 42 dation data are in silico simulations. Sim- Splatter [Zappia et al., 2017], powsimR [Vi- 84 43 ulations provide ground truth test examples eth et al., 2017], and SymSim [Zhang et al., 85 44 that can be rapidly and cost-effectively gen- 2019d], which provide frameworks for simula- 86

38 1 tion of scRNA-seq data and whose accuracy larly complicated, since many analysis tools 43 2 has been validated by comparison of its re- deal with heterogeneous clone populations, 44 3 sults with real data. For single-cell phyloge- which possess multiple biological character- 45 4 nomics, cancer genome evolution simulators istics to be inferred and analyzed. Develop- 46 5 are being designed [Semeraro et al., 2018, Xia ment of a single measure that captures several 47 6 et al., 2018, Meng and Chen, 2018]. of these characteristics is complicated, and in 48 many cases impossible. For example, valida- 49 50 7 6.2.2 Open problems tion of tools for imputation of cellular and transcriptional heterogeneity should simulta- 51 8 Current simulation tools mostly concentrate neously evaluate two measures: (i) how close 52 9 on differential expression analysis, while com- are the reconstructed and true cellular ge- 53 10 prehensive simulation methods for other im- nomic profiles and (ii) how close are recon- 54 11 portant aspects of sc-seq analysis are still to structed and true SNV/haplotype frequency 55 12 be developed. In particular, to the best of distributions. Development of synthetic mea- 56 13 our knowledge, no such tool is available for sures that capture several such characteristics 57 14 scDNA-seq data. (e.g., based on utilization of earth mover’s dis- 58 15 With single-cell phylogenomics, one would tance [Knyazev et al., 2018]) is highly impor- 59 16 like to assess the accuracy of methods for tant. 60 17 phylogenetic inference and subclone identifi- When simulating datasets in general, the 61 18 cation, or the power of population genetics circularity of simulating and inferring pa- 62 19 methods for estimating parameters of interest rameters under the same—possibly simplis- 63 20 (e.g., tests for selection and epistatic interac- tic model—should be critically assessed, as 64 21 tions in cancer, see section 5.3). To this end, should potential biases. Thus, further eval- 65 22 realistic and comprehensive (w.r.t. the evolu- uation on empirical datasets for which some 66 23 tionary phenomena) simulation tools are re- ground truth is known will be invaluable. Ide- 67 24 quired. ally, all single-cell analysis fields should define 68 25 Another interesting computational problem a standard set of benchmark datasets that 69 26 is the development of tools for validation of will allow for assessing and comparing meth- 70 27 simulated sc-seq datasets themselves by their ods or come up with a regular data analysis 71 28 comparison with real data using a compre- challenge. This approach has been very suc- 72 29 hensive set of biological parameters. The first cessful, for example in protein structure pre- 73 4 5 30 such tool for scRNA-seq data is countsimQC diction and metagenomic analyses . A first 74 31 [Soneson and Robinson, 2017], but similar step in this direction was the recent single-cell 75 6 32 tools for scDNA-seq data are needed. Finally, transcriptomics DREAM challenge . 76 33 most of the simulators concentrate on model- Finally, drawing on all the exemplary 77 34 ing of biologically meaningful data, while ig- benchmarking studies mentioned above, it 78 35 noring or simplifying models for sc-seq errors would be immensely beneficial to bring all 79 36 and artifacts. the required efforts together in a community- 80 37 Another important challenge in single-cell supported benchmarking platform: (i) simu- 81 38 analysis tool validation is the selection of com- lating datasets and validating that they cap- 82 39 prehensive evaluation metrics, which should 4http://predictioncenter.org/ 40 be used for comparison of different analysis 5https://data.cami-challenge.org 41 results with each other and with the ground 6https://www.synapse.org/#!Synapse: 42 truth. For single-cell data, it is particu- syn15665609/wiki/582909

39 1 ture important characteristics of real data; the National Science Foundation (NSF: DBI- 39 2 (ii) curating ground-truths for real datasets; 1564899, CCF-1619110) and the National 40 3 (iii) agreeing on comprehensive evaluation Institutes of Health (NIH: 1R01EB025022- 41 4 metrics. Ideally, such a benchmarking frame- 01). BdB was supported by the Oncode In- 42 5 work would remain dynamic beyond an initial stitute (220-H72009 – KWF/2016-1/10158) 43 6 publication—to allow ongoing comparison of BED was supported by the Netherlands Or- 44 7 methods as new approaches are proposed and ganisation for Scientific Research (NWO: 45 8 to easily extend it to entirely new fields of Vidi grant 864.14.004). CAV was supported 46 9 method development. by the University of Edinburgh (Chancel- 47 lor´s Fellowship) and by The Alan Tur- 48 ing Institute (EPSRC grant EP/N510129/1). 49 10 7 Acknowledgements DJM was supported by the National Health 50 and Medical Research Council of Australia 51 11 We are deeply grateful to the Lorentz Cen- (GNT1112681 and GNT1162829). DL was 52 12 ter for hosting the workshop “Single Cell Data supported by Deutsche Krebshilfe funding 53 13 Science: Making Sense of Data from Billions for the national Network Genomic Medicine 54 14 of Single Cells” (4–8 June 2018). In par- (nNGM) Lung Cancer. GC was supported 55 15 ticular, we would like to thank the Lorentz by Marie Skłodowska-Curie grant (agree- 56 16 Center staff, who turned organizing and at- ment No 642691, Epipredict). IIM was 57 17 tending the workshop into a great pleasure. supported by the NSF (Award 1564936). 58 18 For a week, the authors of this review came JCM was supported by core funding from 59 19 together—researchers from the fields of statis- the European Molecular Biology Labora- 60 20 tics and medicine, computer science and biol- tory (EMBL) and core support from Can- 61 21 ogy, and any combinations thereof. In inter- cer Research UK (CRUK: C9545/A29580) 62 22 active workshop sessions, we brought together JdR was supported by the NWO (Vidi 63 23 our knowledge of single-cell analyses, ranging grant 639.072.715). JK was supported by 64 24 from the wet-lab to the server cluster, from the NWO (Veni grant 016.173.076). KJ 65 25 statistical models to algorithms, from can- was supported by SystemsX.ch (RTD Grant 66 26 cer biology to evolutionary genetics. During 2013/150). KRC was funded by postdoc- 67 27 these sessions, we formulated an initial set of toral fellowships from the Canadian Institutes 68 28 challenges that was further systematized and of Health Research, the Canadian Statistical 69 29 refined in the following months, and substan- Sciences Institute (CANSSI), and the UBC 70 30 tiated with extensive literature research of the Data Science Institute. LP was supported 71 31 respective state-of-the-art for this review. by the NIH – National Human Genome Re- 72 search Institute (NHGRI) Career Develop- 73 ment Award (R00HG008399), the Genomic 74 32 8 Funding Innovator Award (R35HG010717) and by 75 the Chan Zuckerberg Initiative (CZI) Donor- 76 33 AC was supported by an IAS Fellowship Advised Fund (DAF) (2018-182734), an ad- 77 34 for external researchers at the University of vised fund of the Silicon Valley Commu- 78 35 Amsterdam. ACM was supported by the nity Foundation. MB was supported by 79 36 Helmholtz Incubator (Sparse2Big ZT-I-0007). the NWO (Vidi grant 639.072.309 and Vidi 80 37 AMK and ASt were supported by the Klaus grant 864.14.004). MDR was supported 81 38 Tschira Foundation. AZ was supported by by the Swiss National Science Foundation 82

40 1 (310030 175841, CRSII5 177208) and the CZI References 36 2 DAF (2018-182828) and acknowledges sup- 3 port from the University Research Priority Tamim Abdelaal, Lieke Michielsen, Davy 37 4 Program Evolution in Action at the Univer- Cats, Dylan Hoogduin, Hailiang Mei, Mar- 38 5 sity of Zurich. PS was supported by the NIH cel J. T. Reinders, and Ahmed Mahfouz. A 39 6 (1R01EB025022). SCH was supported by the comparison of automatic cell identification 40 7 NIH – NHGRI (R00HG009007) and by the methods for single-cell RNA sequencing 41 8 CZI DAF (182891, 193161, 356-01). THK was data. Genome Biology, 20(1):194, Septem- 42 9 supported by the Award of a Research Fellow- ber 2019. ISSN 1474-760X. doi: 10. 43 10 ships for the Promotion of Scientific Cooper- 1186/s13059-019-1795-z. URL https:// 44 11 ation at the Helmholtz-Centre for Infection doi.org/10.1186/s13059-019-1795-z. 45 12 Research. Andre J. Aberer, Kassian Kobert, and 46 Alexandros Stamatakis. ExaBayes: Mas- 47 sively Parallel Bayesian Tree Inference 48 for the Whole-Genome Era. Molec- 49 13 9 Conflicts of interest ular Biology and Evolution, 31(10): 50 2553–2556, October 2014. ISSN 0737- 51 14 IIM is a co-founder and holds an interest in 4038. doi: 10.1093/molbev/msu236. 52 15 SmplBio LLC, a company developing cloud- URL https://academic.oup.com/mbe/ 53 16 based scRNA-Seq analysis software. No prod- article/31/10/2553/1016562. 54 17 ucts, services, or technologies of SmplBio have 55 18 been evaluated or tested in this work. JdR Ahmet Acar, Daniel Nichol, Javier 56 19 is co-founder of Cyclomics BV. All other au- Fernandez-Mateos, George D. Cress- 57 20 thors declare no conflicts of interest. well, Iros Barozzi, Sung Pil Hong, Inmaculada Spiteri, Mark Stubbs, Rose- 58 mary Burke, Adam Stewart, Georgios 59 Vlachogiannis, Carlo C. Maley, Luca 60 Magnani, Nicola Valeri, Udai Banerji, and 61 21 10 Contributions Andrea Sottoriva. Exploiting evolution- 62 ary herding to control drug resistance 63 22 DL, JK, ES, KRC, DJM, SCH, MDR, CAV, in cancer. bioRxiv, page 566950, March 64 23 NB, AM, LP, PS, ASt, CSOA, AMK, THK, 2019. doi: 10.1101/566950. URL 65 24 IIM, ACM, and ASch authored or reviewed https://www.biorxiv.org/content/10. 66 25 substantial parts of the paper. DL, JK, ES, 1101/566950v1. 67 26 KRC, DJM, SCH, MDR, CAV, NB, LP, PS, 27 CSOA, TJL, FM, and ASch prepared figures Sumon Ahmed, Magnus Rattray, and 68 28 and/or tables. JK, ACM, BJR, SPS, and Alexis Boukouvalas. GrandPrix: scaling 69 29 ASch organized and coordinated the work- up the Bayesian GPLVM for single- 70 30 shop. All authors actively participated in the cell data. Bioinformatics, 35(1):47– 71 31 discussions underlying this review, which took 54, January 2019. ISSN 1367-4803. 72 32 place in working groups at the Lorentz work- doi: 10.1093/bioinformatics/bty533. 73 33 shop "Single Cell Data Science: Making Sense URL https://academic.oup.com/ 74 34 of Data from Billions of Single Cells". All au- bioinformatics/article/35/1/47/ 75 35 thors approved the final manuscript. 5047752. 76

41 1 Philipp M. Altrock, Lin L. Liu, and Franziska Tallulah S. Andrews and Martin Hem- 42 2 Michor. The mathematics of cancer: in- berg. False signals induced by single-cell 43 3 tegrating quantitative models. Nature Re- imputation. F1000Research, 7:1740, 44 4 views Cancer, 15(12):730–745, December March 2019. ISSN 2046-1402. doi: 45 5 2015. ISSN 1474-1768. doi: 10.1038/ 10.12688/f1000research.16613.2. URL 46 6 nrc4029. URL https://www.nature.com/ https://f1000research.com/articles/ 47 7 articles/nrc4029. 7-1740/v2. 48

49 8 Robert A. Amezquita, Vince J. Carey, Lind- Michael Angelo, Sean C. Bendall, Rachel 50 9 say N. Carpp, Ludwig Geistlinger, Aaron Finck, Matthew B. Hale, Chuck Hitzman, 51 10 T. L. Lun, Federico Marini, Kevin Rue- Alexander D. Borowsky, Richard M. Lev- 52 11 Albrecht, Davide Risso, Charlotte Sone- enson, John B. Lowe, Scot D. Liu, Shuchun 53 12 son, Levi Waldron, Hervé Pagès, Mike Zhao, Yasodha Natkunam, and Garry P. 54 13 Smith, Wolfgang Huber, Martin Mor- Nolan. Multiplexed ion beam imaging of Nature Medicine 55 14 gan, Raphael Gottardo, and Stephanie C. human breast tumors. , 20 56 15 Hicks. Orchestrating Single-Cell Anal- (4):436–442, April 2014. ISSN 1546-170X. https://www. 57 16 ysis with Bioconductor. bioRxiv, page doi: 10.1038/nm.3488. URL nature.com/articles/nm.3488 58 17 590562, March 2019. doi: 10.1101/ . 18 590562. URL https://www.biorxiv.org/ Christof Angermueller, Stephen J. Clark, 59 19 content/10.1101/590562v1. Heather J. Lee, Iain C. Macaulay, Mabel J. 60 Teng, Tim Xiaoming Hu, Felix Krueger, 61 20 Matthew Amodio, David van Dijk, Krish- Sebastien Smallwood, Chris P. Ponting, 62 21 nan Srinivasan, William S. Chen, Hus- Thierry Voet, Gavin Kelsey, Oliver Ste- 63 22 sein Mohsen, Kevin R. Moon, Allison gle, and Wolf Reik. Parallel single-cell se- 64 23 Campbell, Yujiao Zhao, Xiaomei Wang, quencing links transcriptional and epige- 65 24 Manjunatha Venkataswamy, Anita Desai, netic heterogeneity. Nature Methods, 13(3): 66 25 V. Ravi, Priti Kumar, Ruth Montgomery, 229–232, March 2016. ISSN 1548-7105. doi: 67 26 Guy Wolf, and Smita Krishnaswamy. Ex- 10.1038/nmeth.3728. 68 27 ploring Single-Cell Data with Deep Mul- 28 bioRxiv titasking Neural Networks. , page Christof Angermueller, Heather J. Lee, 69 29 237065, January 2019. doi: 10.1101/ Wolf Reik, and Oliver Stegle. Deep- 70 30 https://www.biorxiv.org/ 237065. URL CpG: accurate prediction of single-cell 71 31 content/10.1101/237065v4 . DNA methylation states using deep learn- 72 ing. Genome Biology, 18(1):67, April 73 32 Benedict Anchang, Tom D. P. Hart, Sean C. 2017. ISSN 1474-760X. doi: 10.1186/ 74 33 Bendall, Peng Qiu, Zach Bjornson, Michael s13059-017-1189-z. URL https://doi. 75 34 Linderman, Garry P. Nolan, and Sylvia K. org/10.1186/s13059-017-1189-z. 76 35 Plevritis. Visualization and cellular 36 hierarchy inference of single-cell data Ricard Argelaguet, Britta Velten, Damien 77 37 using SPADE. Nature Protocols, 11(7): Arnol, Sascha Dietrich, Thorsten Zenz, 78 38 1264–1279, July 2016. ISSN 1754-2189. John C. Marioni, Florian Buettner, Wolf- 79 39 doi: 10.1038/nprot.2016.066. URL http: gang Huber, and Oliver Stegle. Multi- 80 40 //www.nature.com/nprot/journal/v11/ Omics Factor Analysis—a framework for 81 41 n7/full/nprot.2016.066.html. unsupervised integration of multi-omics 82

42 1 data sets. Molecular Systems Biology, thesis, University of Maryland, 2017. 41 2 14(6):e8124, June 2018. ISSN 1744- URL http://drum.lib.umd.edu/handle/ 42 3 4292, 1744-4292. doi: 10.15252/msb. 1903/19951. 43 4 20178124. URL http://msb.embopress. 44 5 org/content/14/6/e8124. Elham Azizi, Sandhya Prabhakaran, Ambrose Carr, and Dana Pe’er. Bayesian Infer- 45 6 C Arisdakessian, O Poirion, B Yunits, ence for Single-cell Clustering and Imput- 46 7 X Zhu, and L Garmire. DeepIm- ing. Genomics and Computational Biol- 47 8 pute: an accurate, fast and scalable ogy, 3(1):46, January 2017. ISSN 2365- 48 9 deep neural network method to im- 7154. doi: 10.18547/gcb.2017.vol3.iss1.e46. 49 10 pute single-cell RNA-Seq data. bioRxiv, URL https://genomicscomputbiol.org/ 50 11 2018. URL https://www.biorxiv.org/ ojs/index.php/GCB/article/view/46. 51 12 content/10.1101/353607v1.abstract. Rhonda Bacher and Christina Kendziorski. 52 13 Nona Arneson, Simon Hughes, Richard Design and computational analysis 53 14 Houlston, and Susan Done. Whole- of single-cell RNA-sequencing ex- 54 15 Genome Amplification by Improved periments. Genome Biology, 17(1): 55 16 Primer Extension Preamplification PCR 63, April 2016. ISSN 1474-760X. 56 17 (I-PEP-PCR). Cold Spring Harbor doi: 10.1186/s13059-016-0927-y. 57 18 Protocols, 2008(1):pdb.prot4921, Jan- URL https://doi.org/10.1186/ 58 19 uary 2008. ISSN 1940-3402, 1559- s13059-016-0927-y. 59 20 6095. doi: 10.1101/pdb.prot4921. URL Md Bahadur Badsha, Rui Li, Boxiang Liu, 60 21 http://cshprotocols.cshlp.org/ Yang I. Li, Min Xian, Nicholas E. Banovich, 61 22 content/2008/1/pdb.prot4921. and Audrey Qiuyan Fu. Imputation of 62 23 Damien Arnol, Denis Schapiro, Bernd Bo- single-cell gene expression with an au- 63 24 denmiller, Julio Saez-Rodriguez, and Oliver toencoder neural network. bioRxiv, page 64 25 Stegle. Modelling cell-cell interactions 504977, December 2018. doi: 10.1101/ 65 26 from spatial molecular data with spatial 504977. URL https://www.biorxiv.org/ 66 27 variance component analysis. bioRxiv, content/10.1101/504977v1. 67 28 page 265256, March 2018. doi: 10.1101/ Bjorn Bakker, Aaron Taudt, Mirjam E. 68 29 265256. URL https://www.biorxiv.org/ Belderbos, David Porubsky, Diana C. J. 69 30 content/10.1101/265256v3. Spierings, Tristan V. de Jong, Nancy 70

31 Eirini Arvaniti and Manfred Claassen. Sen- Halsema, Hinke G. Kazemier, Karina 71 32 sitive detection of rare disease-associated Hoekstra-Wakker, Allan Bradley, Eve- 72 33 cell subsets via representation learning. line S. J. M. de Bont, Anke van den 73 34 Nature Communications, 8(1):1–10, April Berg, Victor Guryev, Peter M. Lans- 74 35 2017. ISSN 2041-1723. doi: 10. dorp, Maria Colomé-Tatché, and Floris Foi- 75 36 1038/ncomms14825. URL https://www. jer. Single-cell sequencing reveals kary- 76 37 nature.com/articles/ncomms14825. otype heterogeneity in murine and human 77 malignancies. Genome Biology, 17:115, 78 38 Daniel L. Ayres. Research And Applica- 2016. ISSN 1474-760X. doi: 10.1186/ 79 39 tion Of Parallel Computing Algorithms For s13059-016-0971-7. URL http://dx.doi. 80 40 Statistical Phylogenetic Inference. PhD org/10.1186/s13059-016-0971-7. 81

43 1 Nikolas Barkas, Viktor Petukhov, Daria Niko- 29 DNA polymerase. symmetrical mode of 42 2 laeva, Yaroslav Lozinsky, Samuel Demhar- DNA replication. J. Biol. Chem., 264(15): 43 3 ter, Konstantin Khodosevich, and Peter V. 8935–8940, May 1989. 44 4 Kharchenko. Wiring together large single- 5 cell RNA-seq sample collections. bioRxiv, Craig L. Bohrson, Alison R. Barton, 45 6 page 460246, November 2018. doi: 10.1101/ Michael A. Lodato, Rachel E. Rodin, 46 7 460246. URL https://www.biorxiv.org/ Lovelace J. Luquette, Vinay V. Viswanad- 47 8 content/10.1101/460246v1. ham, Doga C. Gulhan, Isidro Cortés- 48 Ciriano, Maxwell A. Sherman, Min- 49 9 Nico Battich, Thomas Stoeger, and Lucas seok Kwon, Michael E. Coulter, Alon 50 10 Pelkmans. Control of transcript variabil- Galor, Christopher A. Walsh, and 51 11 Cell ity in single mammalian cells. , 163(7): Peter J. Park. Linked-read analysis 52 12 1596–1610, December 2015. identifies mutations in single-cell DNA- 53 sequencing data. Nature Genetics, 54 13 Benedikt Bauer, Reiner Siebert, and Arne page 1, March 2019. ISSN 1546-1718. 55 14 Traulsen. Cancer initiation with epistatic doi: 10.1038/s41588-019-0366-2. URL 56 15 interactions between driver and passenger https://www.nature.com/articles/ 57 16 mutations. J. Theor. Biol., 358:52–60, Oc- s41588-019-0366-2. 58 17 tober 2014.

18 Niko Beerenwinkel, Tibor Antal, David Katerina Boufea, Sohan Seth, and Nizar N. 59 19 Dingli, Arne Traulsen, Kenneth W Kinzler, Batada. scID: Identification of equivalent 60 20 Victor E Velculescu, , and transcriptional cell populations across 61 21 Martin A Nowak. Genetic progression and single cell RNA-seq data using discrim- 62 bioRxiv 63 22 the waiting time to cancer. PLoS Comput. inant analysis. , page 470203, 23 Biol., 3(11):e225, November 2007. January 2019. doi: 10.1101/470203. URL 64 https://www.biorxiv.org/content/10. 65 24 Graham R. Bignell, Thomas Santarius, Jes- 1101/470203v2. 66 25 sica C.M. Pole, Adam P. Butler, Janet 26 Perry, Erin Pleasance, Chris Greenman, Ivana Bozic, Tibor Antal, Hisashi Ohtsuki, 67 27 Andrew Menzies, Sheila Taylor, Sarah Hannah Carter, Dewey Kim, Sining Chen, 68 28 Edkins, Peter Campbell, Michael Quail, Rachel Karchin, Kenneth W Kinzler, Bert 69 29 Bob Plumb, Lucy Matthews, Kirsten Vogelstein, and Martin A Nowak. Accu- 70 30 McLay, Paul A.W. Edwards, Jane Rogers, mulation of driver and passenger muta- 71 31 Richard Wooster, P. Andrew Futreal, tions during tumor progression. Proc. Natl. 72 32 and Michael R. Stratton. Architec- Acad. Sci. U. S. A., 107(43):18545–18550, 73 33 tures of somatic genomic rearrangement October 2010. 74 34 in human cancer amplicons at sequence- 35 level resolution. Genome Research, 17 Ivana Bozic, Jeffrey M Gerold, and Mar- 75 36 (9):1296–1303, 2007. doi: 10.1101/gr. tin A Nowak. Quantifying clonal and sub- 76 37 6522707. URL http://genome.cshlp. clonal passenger mutations in cancer evolu- 77 PLoS Comput. Biol. 78 38 org/content/17/9/1296.abstract. tion. , 12(2):e1004731, February 2016. 79 39 L Blanco, A Bernad, J M Lázaro, G Martín, 40 C Garmendia, and M Salas. Highly ef- James A. Briggs, Caleb Weinreb, Daniel E. 80 41 ficient DNA synthesis by the phage phi Wagner, Sean Megason, Leonid Peshkin, 81

44 1 Marc W. Kirschner, and Allon M. Klein. scape of Human Hematopoietic Differen- 42 2 The dynamics of gene expression in ver- tiation. Cell, 173(6):1535–1548.e16, 2018. 43 3 tebrate embryogenesis at single-cell res- ISSN 1097-4172. doi: 10.1016/j.cell.2018. 44 4 olution. Science, 360(6392):eaar5780, 03.074. 45 5 June 2018. ISSN 0036-8075, 1095- Florian Buettner, Naruemon Pratanwanich, 46 6 9203. doi: 10.1126/science.aar5780. Davis J McCarthy, John C Marioni, and 47 7 URL http://science.sciencemag.org/ Oliver Stegle. f-scLVM: scalable and ver- 48 8 content/360/6392/eaar5780. satile factor analysis for single-cell RNA- 49 9 Jane Bromley, James W. Bentz, Léon Bot- seq. Genome biology, 18(1):212, Novem- 50 10 tou, Isabelle Guyon, Yann Lecun, Cliff ber 2017. ISSN 1465-6906. doi: 10.1186/ 51 11 Moore, Eduard Säckinger, and Roopak s13059-017-1334-8. URL http://dx.doi. 52 12 Shah. Signature verification using a org/10.1186/s13059-017-1334-8. 53 13 “siamese” time delay neural network. Andrew Butler, Paul Hoffman, Peter Smib- 54 14 International Journal of Pattern Recog- ert, Efthymia Papalexi, and Rahul Satija. 55 15 nition and Artificial Intelligence, 07(04): Integrating single-cell transcriptomic data 56 16 669–688, August 1993. ISSN 0218-0014. across different conditions, technologies, 57 17 doi: 10.1142/S0218001493000339. URL and species. Nat. Biotechnol., 36(5):411– 58 18 https://www.worldscientific.com/ 420, June 2018a. 59 19 doi/10.1142/S0218001493000339. Andrew Butler, Paul Hoffman, Peter Smib- 60 20 Robert V Bruggner, Bernd Bodenmiller, ert, Efthymia Papalexi, and Rahul Satija. 61 21 David L Dill, Robert J Tibshirani, and Integrating single-cell transcriptomic data 62 22 Garry P Nolan. Automated identification across different conditions, technologies, 63 23 of stratifying signatures in cellular subpop- and species. Nature Biotechnology, 36(5): 64 24 ulations. Proc. Natl. Acad. Sci. U. S. A., 411–420, May 2018b. ISSN 1546-1696. 65 25 111(26):E2770–7, July 2014. doi: 10.1038/nbt.4096. URL https:// 66 www.nature.com/articles/nbt.4096. 67 26 Jason D. Buenrostro, Beijing Wu, Ulrike M. 27 Litzenburger, Dave Ruff, Michael L. Christiane Bäumer, Evelyn Fisch, Holger 68 28 Gonzales, Michael P. Snyder, Howard Y. Wedler, Frank Reinecke, and Chris- 69 29 Chang, and William J. Greenleaf. Single- tian Korfhage. Exploring DNA qual- 70 30 cell chromatin accessibility reveals princi- ity of single cells for genome analysis 71 31 ples of regulatory variation. Nature, 523 with simultaneous whole-genome am- 72 32 (7561):486–490, July 2015. ISSN 1476- plification. Scientific Reports, 8(1): 73 33 4687. doi: 10.1038/nature14590. URL 1–10, May 2018. ISSN 2045-2322. 74 34 https://www.nature.com/articles/ doi: 10.1038/s41598-018-25895-7. URL 75 35 nature14590. https://www.nature.com/articles/ 76 s41598-018-25895-7. 77 36 Jason D. Buenrostro, M. Ryan Corces, 37 Caleb A. Lareau, Beijing Wu, Alicia N. Kieran R. Campbell and Christopher 78 38 Schep, Martin J. Aryee, Ravindra Ma- Yau. Order Under Uncertainty: Robust 79 39 jeti, Howard Y. Chang, and William J. Differential Expression Analysis Using 80 40 Greenleaf. Integrated Single-Cell Analy- Probabilistic Models for Pseudotime In- 81 41 sis Maps the Continuous Regulatory Land- ference. PLOS Computational Biology, 12 82

45 1 (11):e1005212, November 2016. ISSN 1553- Pliner, Andrew J. Hill, Riza M. Daza, 42 2 7358. doi: 10.1371/journal.pcbi.1005212. Jose L. McFaline-Figueroa, Jonathan S. 43 3 URL https://journals.plos.org/ Packer, Lena Christiansen, Frank J. 44 4 ploscompbiol/article?id=10.1371/ Steemers, Andrew C. Adey, Cole Trap- 45 5 journal.pcbi.1005212. nell, and Jay Shendure. Joint profil- 46 ing of chromatin accessibility and gene 47 6 Kieran R. Campbell and Christopher Yau. expression in thousands of single cells. 48 7 Uncovering pseudotemporal trajectories Science, 361(6409):1380–1385, Septem- 49 8 with covariates from single cell and bulk ber 2018. ISSN 0036-8075, 1095- 50 9 Nature Communications expression data. , 9203. doi: 10.1126/science.aau0730. 51 10 9(1):2442, June 2018. ISSN 2041-1723. URL https://science.sciencemag.org/ 52 11 doi: 10.1038/s41467-018-04696-6. URL content/361/6409/1380. 53 12 https://www.nature.com/articles/ 13 s41467-018-04696-6. Junyue Cao, Malte Spielmann, Xiaojie Qiu, 54 Xingfan Huang, Daniel M. Ibrahim, An- 55 14 Kieran R. Campbell, Adi Steif, Emma Laks, drew J. Hill, Fan Zhang, Stefan Mundlos, 56 15 Hans Zahn, Daniel Lai, Andrew McPher- Lena Christiansen, Frank J. Steemers, Cole 57 16 son, Hossein Farahani, Farhia Kabeer, Trapnell, and Jay Shendure. The single- 58 17 Ciara O’Flanagan, Justina Biele, Jazmine cell transcriptional landscape of mam- 59 18 Brimhall, Beixi Wang, Pascale Walters, malian organogenesis. Nature, 566(7745): 60 19 IMAXT Consortium, Alexandre Bouchard- 496, February 2019a. ISSN 1476-4687. 61 20 Côté, Samuel Aparicio, and Sohrab P. doi: 10.1038/s41586-019-0969-x. URL 62 21 Shah. clonealign: statistical integra- https://www.nature.com/articles/ 63 22 tion of independent single-cell RNA and s41586-019-0969-x. 64 23 DNA sequencing data from human can- 24 cers. Genome Biology, 20(1):54, March Zhi-Jie Cao, Lin Wei, Shen Lu, De-Chang 65 25 2019. ISSN 1474-760X. doi: 10.1186/ Yang, and Ge Gao. Cell BLAST: Search- 66 26 s13059-019-1645-z. URL https://doi. ing large-scale scRNA-seq database via 67 27 org/10.1186/s13059-019-1645-z. unbiased cell embedding. bioRxiv, page 68 587360, March 2019b. doi: 10.1101/ 69 28 Junyue Cao, Jonathan S. Packer, Vijay Ra- 587360. URL https://www.biorxiv.org/ 70 29 mani, Darren A. Cusanovich, Chau Huynh, content/10.1101/587360v1. 71 30 Riza Daza, Xiaojie Qiu, Choli Lee, Scott N. 31 Furlan, Frank J. Steemers, Andrew Adey, Anna K Casasent, Aislyn Schalck, Ruli 72 32 Robert H. Waterston, Cole Trapnell, and Gao, Emi Sei, Annalyssa Long, William 73 33 Jay Shendure. Comprehensive single- Pangburn, Tod Casasent, Funda Meric- 74 34 cell transcriptional profiling of a multi- Bernstam, Mary E Edgerton, and 75 35 cellular organism. Science, 357(6352): Nicholas E Navin. Multiclonal invasion in 76 36 661–667, August 2017. ISSN 0036-8075, breast tumors identified by topographic 77 37 1095-9203. doi: 10.1126/science.aam8940. single cell sequencing. Cell, 172(1-2): 78 38 URL http://science.sciencemag.org/ 205–217.e12, January 2018. 79 39 content/357/6352/661. Chong Chen, Changjing Wu, Linjie Wu, 80 40 Junyue Cao, Darren A. Cusanovich, Vijay Yishu Wang, Minghua Deng, and Ruibin 81 41 Ramani, Delasa Aghamirzaie, Hannah A. Xi. scRMD: Imputation for single cell 82

46 1 RNA-seq data via robust matrix decom- William F. Burkholder. Single-cell multi- 41 2 position. bioRxiv, page 459404, November modal profiling reveals cellular epigenetic 42 3 2018. doi: 10.1101/459404. URL heterogeneity. Nature Methods, 13(10):833– 43 4 https://www.biorxiv.org/content/10. 836, October 2016. ISSN 1548-7105. doi: 44 5 1101/459404v2. 10.1038/nmeth.3961. URL https://www. 45 nature.com/articles/nmeth.3961. 46 6 Chongyi Chen, Dong Xing, Longzhi Tan,

7 Heng Li, Guangyu Zhou, Lei Huang, and Cariad Chester and Holden T Maecker. Algo- 47 8 X Sunney Xie. Single-cell whole-genome rithmic tools for mining High-Dimensional 48 9 analyses by linear amplification via trans- cytometry data. J. Immunol., 195(3):773– 49 10 Science poson insertion (LIANTI). , 356 779, August 2015. 50 11 (6334):189–194, April 2017.

Simone Ciccolella, Mauricio Soto Gomez, 51 12 Huidong Chen, Luca Albergante, Murray Patterson, Gianluca Della Ve- 52 13 Jonathan Y. Hsu, Caleb A. Lareau, dova, Iman Hajirasouliha, and Paola 53 14 Giosuè Lo Bosco, Jihong Guan, Shuigeng Bonizzoni. Inferring Cancer Progres- 54 15 Zhou, Alexander N. Gorban, Daniel E. sion from Single-cell Sequencing while Al- 55 16 Bauer, Martin J. Aryee, David M. Lan- lowing Mutation Losses. bioRxiv, page 56 17 genau, Andrei Zinovyev, Jason D. Buen- 268243, April 2018. doi: 10.1101/ 57 18 rostro, Guo-Cheng Yuan, and Luca Pinello. 268243. URL https://www.biorxiv.org/ 58 19 Single-cell trajectories reconstruction, ex- content/10.1101/268243v2. 59 20 ploration and mapping of omics data with 21 STREAM. Nature Communications, 10 Stephen J. Clark, Ricard Argelaguet, 60 22 (1):1903, April 2019. ISSN 2041-1723. Chantriolnt-Andreas Kapourani, 61 23 doi: 10.1038/s41467-019-09670-4. URL Thomas M. Stubbs, Heather J. Lee, Celia 62 24 https://www.nature.com/articles/ Alda-Catalinas, Felix Krueger, Guido San- 63 25 s41467-019-09670-4. guinetti, Gavin Kelsey, John C. Marioni, 64 Oliver Stegle, and Wolf Reik. scNMT-seq 65 26 Kok Hao Chen, Alistair N Boettiger, Jef- enables joint profiling of chromatin accessi- 66 27 frey R Moffitt, Siyuan Wang, and Xiaowei bility DNA methylation and transcription 67 28 Zhuang. RNA imaging. spatially resolved, in single cells. Nature Communications, 9 68 29 highly multiplexed RNA profiling in sin- (1):781, February 2018. ISSN 2041-1723. 69 30 gle cells. Science, 348(6233):aaa6090, April doi: 10.1038/s41467-018-03149-4. URL 70 31 2015. https://www.nature.com/articles/ 71

32 Mengjie Chen and Xiang Zhou. VIPER: s41467-018-03149-4. 72 33 variability-preserving imputation for accu- 34 rate gene expression recovery in single-cell Simone Codeluppi, Lars E. Borm, Amit 73 35 RNA sequencing studies. Genome Biol., 19 Zeisel, Gioele La Manno, Josina A. van 74 36 (1):196, November 2018. Lunteren, Camilla I. Svensson, and 75 Sten Linnarsson. Spatial organization 76 37 Lih Feng Cheow, Elise T. Courtois, Yuliana of the somatosensory cortex revealed 77 38 Tan, Ramya Viswanathan, Qiaorui Xing, by osmFISH. Nature Methods, 15(11): 78 39 Rui Zhen Tan, Daniel S. W. Tan, Paul Rob- 932–935, November 2018. ISSN 1548-7105. 79 40 son, Yuin-Han Loh, Stephen R. Quake, and doi: 10.1038/s41592-018-0175-z. URL 80

47 1 https://www.nature.com/articles/ Darren A. Cusanovich, James P. Redding- 41 2 s41592-018-0175-z. ton, David A. Garfield, Riza M. Daza, De- 42 lasa Aghamirzaie, Raquel Marco-Ferreres, 43 3 Germán Corredor, Xiangxue Wang, Yu Zhou, Hannah A. Pliner, Lena Christiansen, Xi- 44 4 Cheng Lu, Pingfu Fu, Konstantinos Syri- aojie Qiu, Frank J. Steemers, Cole Trap- 45 5 gos, David L Rimm, Michael Yang, Ed- nell, Jay Shendure, and Eileen E. M. Fur- 46 6 uardo Romero, Kurt A Schalper, Vam- long. The cis-regulatory dynamics of em- 47 7 sidhar Velcheti, and Anant Madabhushi. bryonic development at single-cell resolu- 48 8 Spatial architecture and arrangement of tion. Nature, 555(7697):538–542, March 49 9 Tumor-Infiltrating lymphocytes for pre- 2018. ISSN 1476-4687. doi: 10.1038/ 50 10 dicting likelihood of recurrence in Early- nature25981. URL https://www.nature. 51 11 Clin. Stage Non-Small cell lung cancer. com/articles/nature25981. 52 12 Cancer Res., September 2018. Sayantan Das, Gonçalo R Abecasis, and 53 13 Alexandra Cretu and Peter C Brooks. Im- Brian L Browning. Genotype Impu- 54 14 pact of the non-cellular tumor microenvi- tation from Large Reference Panels. 55 15 ronment on metastasis: potential thera- Annual review of genomics and hu- 56 16 peutic and imaging opportunities. J. Cell. man genetics, 19:73–96, August 2018a. 57 17 Physiol., 213(2):391–402, November 2007. ISSN 1527-8204, 1545-293X. doi: 58 10.1146/annurev-genom-083117-021602. 59 18 Nicola Crosetto, Magda Bienko, and Alexan- URL http://dx.doi.org/10.1146/ 60 19 der van Oudenaarden. Spatially resolved annurev-genom-083117-021602. 61 20 transcriptomics and beyond. Nat. Rev. 21 Genet., 16(1):57–66, January 2015. Sayantan Das, Gonçalo R. Abecasis, and 62 Brian L. Browning. Genotype Impu- 63 22 Helena L. Crowell, Charlotte Soneson, Pierre- tation from Large Reference Panels. 64 23 Luc Germain, Daniela Calini, Ludovic Annual Review of Genomics and Hu- 65 24 Collin, Catarina Raposo, Dheeraj Malho- man Genetics, 19(1):73–96, August 66 25 tra, and Mark D. Robinson. On the dis- 2018b. ISSN 1527-8204, 1545-293X. doi: 67 26 covery of population-specific state tran- 10.1146/annurev-genom-083117-021602. 68 27 sitions from multi-sample multi-condition URL https://www. 69 28 bioRxiv single-cell RNA sequencing data. , annualreviews.org/doi/10.1146/ 70 29 page 713412, July 2019. doi: 10.1101/ annurev-genom-083117-021602. 71 30 713412. URL https://www.biorxiv.org/ 31 content/10.1101/713412v1. Soma Datta, Lavina Malhotra, Ryan Dicker- 72 son, Scott Chaffee, Chandan K Sen, and 73 32 Darren A. Cusanovich, Riza Daza, Andrew Sashwati Roy. Laser capture microdissec- 74 33 Adey, Hannah A. Pliner, Lena Chris- tion: Big data from small samples. Histol. 75 34 tiansen, Kevin L. Gunderson, Frank J. Histopathol., 30(11):1255–1269, November 76 35 Steemers, Cole Trapnell, and Jay Shen- 2015. 77 36 dure. Multiplex single cell profiling of chro- 37 matin accessibility by combinatorial cellu- Alexander Davis, Ruli Gao, and Nicholas 78 38 lar indexing. Science (New York, N.Y.), Navin. Tumor evolution: Linear, branch- 79 39 348(6237):910–914, May 2015. ISSN 1095- ing, neutral or punctuated? Biochim. Bio- 80 40 9203. doi: 10.1126/science.aab1601. phys. Acta, 1867(2):151–161, April 2017. 81

48 1 Carl G. de Boer and . BROCK- der van Oudenaarden. Integrated genome 42 2 MAN: deciphering variance in epige- and transcriptome sequencing of the same 43 3 nomic regulators by k-mer factoriza- cell. Nature Biotechnology, 33(3):285–289, 44 4 tion. BMC Bioinformatics, 19(1):253, July March 2015. ISSN 1546-1696. doi: 10.1038/ 45 5 2018. ISSN 1471-2105. doi: 10.1186/ nbt.3129. URL https://www.nature. 46 6 s12859-018-2255-6. URL https://doi. com/articles/nbt.3129. 47 7 org/10.1186/s12859-018-2255-6. David van Dijk, Roshan Sharma, Juozas 48 8 Charles F A de Bourcy, Iwijn De Vlam- Nainys, Kristina Yim, Pooja Kathail, 49 9 inck, Jad N Kanbar, Jianbin Wang, Charles Ambrose J. Carr, Cassandra Burdziak, 50 10 Gawad, and Stephen R Quake. A quantita- Kevin R. Moon, Christine L. Chaf- 51 11 tive comparison of single-cell whole genome fer, Diwakar Pattabiraman, Brian Bierie, 52 12 PLoS One amplification methods. , 9(8): Linas Mazutis, Guy Wolf, Smita Krish- 53 13 e105585, August 2014. naswamy, and Dana Pe’er. Recovering 54 Gene Interactions from Single-Cell Data 55 14 Frank B Dean, Seiyu Hosono, Linhua Fang, Using Data Diffusion. Cell, 174(3):716– 56 15 Xiaohong Wu, A Fawad Faruqi, Patricia 729.e27, July 2018. ISSN 0092-8674, 57 16 Bray-Ward, Zhenyu Sun, Qiuling Zong, 1097-4172. doi: 10.1016/j.cell.2018.05. 58 17 Yuefen Du, Jing Du, Mark Driscoll, Wan- 061. URL https://www.cell.com/cell/ 59 18 min Song, Stephen F Kingsmore, Michael abstract/S0092-8674(18)30724-4. 60 19 Egholm, and Roger S Lasken. Compre- 20 hensive human genome amplification using Jiarui Ding, Anne Condon, and Sohrab P 61 21 multiple displacement amplification. Proc. Shah. Interpretable dimensionality reduc- 62 22 Natl. Acad. Sci. U. S. A., 99(8):5261–5266, tion of single cell transcriptome data with 63 23 April 2002. deep generative models. Nat. Commun., 9 64

24 Yue Deng, Feng Bao, Qionghai Dai, Lani F (1):2002, May 2018. 65 25 Wu, and Steven J Altschuler. Scalable anal- Xiao Dong, Lei Zhang, Brandon Milholland, 66 26 ysis of cell-type composition from single- Moonsook Lee, Alexander Y. Maslov, Tao 67 27 cell transcriptomics using deep recurrent Wang, and Jan Vijg. Accurate iden- 68 28 learning. Nature methods, March 2019. tification of single-nucleotide variants in 69 29 ISSN 1548-7091, 1548-7105. doi: 10.1038/ whole-genome-amplified single cells. Na- 70 30 s41592-019-0353-7. URL https://doi. ture Methods, 14(5):491–493, May 2017. 71 31 org/10.1038/s41592-019-0353-7. ISSN 1548-7105. doi: 10.1038/nmeth. 72 32 Erica AK DePasquale, Kyle Ferchen, Stuart 4227. URL https://www.nature.com/ 73 33 Hay, H. Leighton Grimes, and Nathan articles/nmeth.4227. 74 34 Salomonis. cellHarmony: Cell-level 35 matching and comparison of single-cell Angelo Duò, Mark D Robinson, and Char- 75 36 transcriptomes. bioRxiv, page 412080, lotte Soneson. A systematic performance 76 37 January 2019. doi: 10.1101/412080. URL evaluation of clustering methods for single- 77 38 https://www.biorxiv.org/content/10. cell RNA-seq data. F1000Res., 7, July 78 39 1101/412080v4. 2018. 79

40 Siddharth S. Dey, Lennart Kester, Bastiaan G Durif, L Modolo, J E Mold, S Lambert- 80 41 Spanjaard, Magda Bienko, and Alexan- Lacroix, and F Picard. Probabilistic 81

49 1 Count Matrix Factorization for Single URL https://academic.oup.com/ 43 2 Cell Expression Data Analysis. Bioin- bioinformatics/article/34/17/i671/ 44 3 formatics, March 2019. ISSN 1367-4803, 5093218. 45 4 1367-4811. doi: 10.1093/bioinformatics/ 46 5 btz177. URL http://dx.doi.org/10. Nils Eling, Arianne C. Richard, Sylvia 47 6 1093/bioinformatics/btz177. Richardson, John C. Marioni, and Catalina A. Vallejos. Correcting the 48 7 Daniel Edsgärd, Per Johnsson, and Rickard Mean-Variance Dependency for Differen- 49 8 Sandberg. Identification of spatial expres- tial Variability Testing Using Single-Cell 50 9 sion trends in single-cell gene expression RNA Sequencing Data. Cell Systems, 7(3): 51 10 data. Nat. Methods, 15(5):339–342, May 284–294.e12, September 2018. ISSN 2405- 52 11 2018. 4712. doi: 10.1016/j.cels.2018.06.011. URL 53 https://www.cell.com/cell-systems/ 54 12 Peter Eirew, Adi Steif, Jaswinder Khattra, abstract/S2405-4712(18)30278-3. 55 13 Gavin Ha, Damian Yap, Hossein Fara- 14 hani, Karen Gelmon, Stephen Chia, Colin Chee-Huat Linus Eng, Michael Lawson, 56 15 Mar, Adrian Wan, Emma Laks, Justina Qian Zhu, Ruben Dries, Noushin Koulena, 57 16 Biele, Karey Shumansky, Jamie Rosner, Yodai Takei, Jina Yun, Christopher 58 17 Andrew McPherson, Cydney Nielsen, Cronin, Christoph Karp, Guo-Cheng 59 18 Andrew J. L. Roth, Calvin Lefebvre, Ali Yuan, and Long Cai. Transcriptome- 60 19 Bashashati, Camila de Souza, Celia Siu, scale super-resolved imaging in tissues by 61 20 Radhouane Aniba, Jazmine Brimhall, RNA seqFISH+. Nature, 568(7751): 62 21 Arusha Oloumi, Tomo Osako, Alejandra 235, April 2019. ISSN 1476-4687. 63 22 Bruna, Jose L. Sandoval, Teresa Algara, doi: 10.1038/s41586-019-1049-y. URL 64 23 Wendy Greenwood, Kaston Leung, Hong- https://www.nature.com/articles/ 65 24 wei Cheng, Hui Xue, Yuzhuo Wang, s41586-019-1049-y. 66 25 Dong Lin, Andrew J. Mungall, Richard 26 Moore, Yongjun Zhao, Julie Lorette, Long Gökcen Eraslan, Lukas M. Simon, Maria 67 27 Nguyen, David Huntsman, Connie J. Mircea, Nikola S. Mueller, and Fabian J. 68 28 Eaves, Carl Hansen, Marco A. Marra, Car- Theis. Single-cell RNA-seq denois- 69 29 los Caldas, Sohrab P. Shah, and Samuel ing using a deep count autoencoder. 70 30 Aparicio. Dynamics of genomic clones Nature Communications, 10(1):390, 71 31 in breast cancer patient xenografts at January 2019. ISSN 2041-1723. doi: 72 32 single-cell resolution. Nature, 518(7539): 10.1038/s41467-018-07931-2. URL 73 33 422–426, February 2015. ISSN 0028-0836. https://www.nature.com/articles/ 74 34 doi: 10.1038/nature13952. URL http: s41467-018-07931-2. 75 35 //www.nature.com/nature/journal/ Nuria Estévez-Gómez, Tamara Prieto, Amy 76 36 v518/n7539/full/nature13952.html. Guillaumet-Adkins, Holger Heyn, Sonia 77 37 Mohammed El-Kebir. SPhyR: tumor Prado-López, and David Posada. Com- 78 38 phylogeny estimation from single-cell parison of single-cell whole-genome am- 79 39 sequencing data under loss and er- plification strategies. bioRxiv, page 80 40 ror. Bioinformatics, 34(17):i671–i679, 443754, October 2018. doi: 10.1101/ 81 41 September 2018. ISSN 1367-4803. 443754. URL https://www.biorxiv.org/ 82 42 doi: 10.1093/bioinformatics/bty589. content/10.1101/443754v1. 83

50 1 Jean Fan, Hae-Ock Lee, Soohyun Lee, Da- May 2018. ISSN 0036-8075, 1095- 42 2 Eun Ryu, Semin Lee, Catherine Xue, 9203. doi: 10.1126/science.aaq1736. 43 3 Seok Jin Kim, Kihyun Kim, Nikolaos URL http://science.sciencemag.org/ 44 4 Barkas, Peter J Park, Woong-Yang Park, content/360/6391/eaaq1736. 45 5 and Peter V Kharchenko. Linking tran- William Fletcher and Ziheng Yang. The ef- 46 6 scriptional and genetic tumor heterogeneity fect of insertions, deletions, and alignment 47 7 through allele analysis of single-cell RNA- errors on the branch-site test of positive se- 48 8 seq data. Genome Res., 28(8):1217–1227, lection. Mol. Biol. Evol., 27(10):2257–2267, 49 9 August 2018. October 2010. 50 10 Jeffrey A. Farrell, Yiqun Wang, Saman- Jasmine Foo, Kevin Leder, and Franziska Mi- 51 11 tha J. Riesenfeld, Karthik Shekhar, chor. Stochastic dynamics of cancer initi- 52 12 Aviv Regev, and Alexander F. Schier. ation. Phys. Biol., 8(1):015002, February 53 13 Single-cell reconstruction of develop- 2011. 54 14 mental trajectories during zebrafish 15 embryogenesis. Science, 360(6392): Joshua M. Francis, Cheng-Zhong Zhang, 55 16 eaar3131, June 2018. ISSN 0036-8075, Cecile L. Maire, Joonil Jung, Veronica E. 56 17 1095-9203. doi: 10.1126/science.aar3131. Manzo, Viktor A. Adalsteinsson, Heather 57 18 URL http://science.sciencemag.org/ Homer, Sam Haidar, Brendan Blumenstiel, 58 19 content/360/6392/eaar3131. Chandra Sekhar Pedamallu, Azra H. 59 Ligon, J. Christopher Love, Matthew 60 20 Joseph Felsenstein. Evolutionary trees from Meyerson, and Keith L. Ligon. EGFR 61 21 DNA sequences: A maximum likelihood Variant Heterogeneity in Glioblastoma 62 22 approach. J. Mol. Evol., 17(6):368–376, Resolved through Single-Nucleus Sequenc- 63 23 1981. ing. Cancer Discovery, 4(8):956–971, 64 August 2014. ISSN 2159-8274, 2159-8290. 65 24 Greg Finak, Andrew McDavid, Masanao doi: 10.1158/2159-8290.CD-13-0879. 66 25 Yajima, Jingyuan Deng, Vivian Gersuk, URL http://cancerdiscovery. 67 26 Alex K. Shalek, Chloe K. Slichter, Han- aacrjournals.org/content/4/8/956. 68 27 nah W. Miller, M. Juliana McElrath, 28 Martin Prlic, Peter S. Linsley, and Raphael Saskia Freytag, Luyi Tian, Ingrid Lönnst- 69 29 Gottardo. MAST: a flexible statistical edt, Milica Ng, and Melanie Bahlo. 70 30 framework for assessing transcriptional Comparison of clustering tools in R for 71 31 changes and characterizing heterogene- medium-sized 10x Genomics single-cell 72 32 ity in single-cell RNA sequencing data. RNA-sequencing data. F1000Research, 73 33 Genome Biology, 16, 2015. ISSN 1474- 7, December 2018. ISSN 2046-1402. doi: 74 34 7596. doi: 10.1186/s13059-015-0844-5. 10.12688/f1000research.15809.2. URL 75 35 URL https://www.ncbi.nlm.nih.gov/ https://www.ncbi.nlm.nih.gov/pmc/ 76 36 pmc/articles/PMC4676162/. articles/PMC6124389/. 77

37 Christopher T. Fincher, Omri Wurtzel, Wolf H Fridman, Jérome Galon, Marie- 78 38 Thom de Hoog, Kellie M. Kravarik, and Caroline Dieu-Nosjean, Isabelle Cremer, 79 39 Peter W. Reddien. Cell type transcriptome Sylvain Fisson, Diane Damotte, Franck 80 40 atlas for the planarian Schmidtea mediter- Pagès, Eric Tartour, and Catherine Sautès- 81 41 ranea. Science, 360(6391):eaaq1736, Fridman. Immune infiltration in human 82

51 1 cancer: Prognostic significance and disease Charlotte Giesen, Hao A. O. Wang, Denis 41 2 control. In Glenn Dranoff, editor, Cancer Schapiro, Nevena Zivanovic, Andrea Ja- 42 3 Immunology and Immunotherapy, pages 1– cobs, Bodo Hattendorf, Peter J. Schüffler, 43 4 24. Springer Berlin Heidelberg, Berlin, Hei- Daniel Grolimund, Joachim M. Buhmann, 44 5 delberg, 2011. Simone Brandt, Zsuzsanna Varga, Peter J. 45 Wild, Detlef Günther, and Bernd Boden- 46 6 Yusi Fu, Chunmei Li, Sijia Lu, Wenxiong miller. Highly multiplexed imaging of tu- 47 7 Zhou, Fuchou Tang, X Sunney Xie, and mor tissues with subcellular resolution by 48 8 Yanyi Huang. Uniform and accurate mass cytometry. Nature Methods, 11(4): 49 9 single-cell sequencing based on emulsion 417–422, April 2014. ISSN 1548-7105. doi: 50 10 whole-genome amplification. Proc. Natl. 10.1038/nmeth.2869. URL https://www. 51 11 Acad. Sci. U. S. A., 112(38):11923–11928, nature.com/articles/nmeth.2869. 52 12 September 2015. Yury Goltsev, Nikolay Samusik, Julia 53 13 Dan Gao, Feng Jin, Min Zhou, and Yuyang Kennedy-Darling, Salil Bhate, Matthew 54 14 Jiang. Recent advances in single cell ma- Hale, Gustavo Vazquez, Sarah Black, and 55 15 nipulation and biochemical analysis on mi- Garry P. Nolan. Deep Profiling of Mouse 56 16 crofluidics. Analyst, 144(3):766–781, Jan- Splenic Architecture with CODEX Multi- 57 17 uary 2019. plexed Imaging. Cell, 174(4):968–981.e15, 58

18 Xin Gao, Deqing Hu, Madelaine Gogol, August 2018. ISSN 0092-8674. doi: 59 19 and Hua Li. ClusterMap: Com- 10.1016/j.cell.2018.07.010. URL http: 60 20 paring analyses across multiple Single //www.sciencedirect.com/science/ 61 21 Cell RNA-Seq profiles. bioRxiv, page article/pii/S0092867418309048. 62 22 331330, June 2018. doi: 10.1101/ Wuming Gong, Il-Youp Kwak, Pruthvi Pota, 63 23 331330. URL https://www.biorxiv.org/ Naoko Koyano-Nakagawa, and Daniel J. 64 24 content/10.1101/331330v2. Garry. DrImpute: imputing dropout 65 25 Tyler Garvin, Robert Aboukhalil, Jude events in single cell RNA sequencing 66 26 Kendall, Timour Baslan, Gurinder S At- data. BMC Bioinformatics, 19(1):220, June 67 27 wal, James Hicks, Michael Wigler, and 2018. ISSN 1471-2105. doi: 10.1186/ 68 28 Michael C Schatz. Interactive analysis and s12859-018-2226-y. URL https://doi. 69 29 assessment of single-cell copy-number vari- org/10.1186/s12859-018-2226-y. 70 30 ations. Nat. Methods, 12(11):1058–1060, 71 31 November 2015. R R Gray, O G Pybus, and M Salemi. Mea- suring the temporal structure in Serially- 72 32 Charles Gawad, Winston Koh, and Stephen R Sampled phylogenies. Methods Ecol. Evol., 73 33 Quake. Single-cell genome sequencing: cur- 2(5):437–445, October 2011. 74 34 rent state of the science. Nat. Rev. Genet., 35 17(3):175–188, March 2016. Anna Graybeal. Is it better to add taxa or 75 characters to a difficult phylogenetic prob- 76 36 Farzad Ghaznavi, Andrew Evans, Anant lem? Syst. Biol., 47(1):9–17, 1998. 77 37 Madabhushi, and Michael Feldman. Digital 38 imaging in pathology: whole-slide imaging Christopher Heje Grønbech, Maximil- 78 39 and beyond. Annu. Rev. Pathol., 8:331– lian Fornitz Vording, Pascal Timshel, 79 40 359, January 2013. Casper Kaae Sønderby, Tune Hannes 80

52 1 Pers, and Ole Winther. scVAE: Varia- Davis, Joseph M Herman, Christine A 42 2 tional auto-encoders for single-cell gene Iacobuzio-Donahue, and Franziska Michor. 43 3 expression data. bioRxiv, page 318295, Computational modeling of pancreatic can- 44 4 October 2019. doi: 10.1101/318295. URL cer reveals kinetics of metastasis suggesting 45 5 https://www.biorxiv.org/content/10. optimum treatment strategies. Cell, 148(1- 46 6 1101/318295v4. 2):362–375, January 2012. 47

7 Dominic Grün, Lennart Kester, and Alexan- Christoph Hafemeister and Rahul Satija. Nor- 48 8 der van Oudenaarden. Validation of noise malization and variance stabilization of 49 9 models for single-cell transcriptomics. Na- single-cell RNA-seq data using regular- 50 10 ture Methods, 11(6):637–640, June 2014. ized negative binomial regression. bioRxiv, 51 11 ISSN 1548-7105. doi: 10.1038/nmeth. page 576827, March 2019. doi: 10.1101/ 52 12 2930. URL https://www.nature.com/ 576827. URL https://www.biorxiv.org/ 53 13 articles/nmeth.2930. content/10.1101/576827v2. 54

Laleh Haghverdi, Maren Büttner, F. Alexan- 55 14 Martin Guilliams, Charles-Antoine Dutertre, der Wolf, Florian Buettner, and Fabian J. 56 15 Charlotte L. Scott, Naomi McGovern, Theis. Diffusion pseudotime robustly 57 16 Dorine Sichien, Svetoslav Chakarov, Sofie reconstructs lineage branching. Nature 58 17 Van Gassen, Jinmiao Chen, Michael Methods, 13(10):845–848, October 2016. 59 18 Poidinger, Sofie De Prijck, Simon J. ISSN 1548-7105. doi: 10.1038/nmeth. 60 19 Tavernier, Ivy Low, Sergio Erdal Irac, 3971. URL https://www.nature.com/ 61 20 Citra Nurfarah Mattar, Hermi Rizal Suma- articles/nmeth.3971. 62 21 toh, Gillian Hui Ling Low, Tam John Kit 22 Chung, Dedrick Kok Hong Chan, Ker Kan Laleh Haghverdi, Aaron T. L. Lun, 63 23 Tan, Tony Lim Kiat Hon, Even Fos- Michael D. Morgan, and John C. Marioni. 64 24 sum, Bjarne Bogen, Mahesh Choolani, Batch effects in single-cell RNA-sequencing 65 25 Jerry Kok Yen Chan, Anis Larbi, Hervé data are corrected by matching mutual 66 26 Luche, Sandrine Henri, Yvan Saeys, nearest neighbors. Nature Biotechnology, 67 27 Evan William Newell, Bart N. Lambrecht, 36(5):421–427, 2018. ISSN 1546-1696. doi: 68 28 Bernard Malissen, and Florent Ginhoux. 10.1038/nbt.4091. 69 29 Unsupervised High-Dimensional Analysis 30 Aligns Dendritic Cells across Tissues Xiaoping Han, Renying Wang, Yincong Zhou, 70 31 and Species. Immunity, 45(3):669–684, Lijiang Fei, Huiyu Sun, Shujing Lai, Assieh 71 32 September 2016. ISSN 1074-7613. doi: Saadatpour, Ziming Zhou, Haide Chen, 72 33 10.1016/j.immuni.2016.08.015. URL http: Fang Ye, Daosheng Huang, Yang Xu, Wen- 73 34 //www.sciencedirect.com/science/ tao Huang, Mengmeng Jiang, Xinyi Jiang, 74 35 article/pii/S1074761316303399. Jie Mao, Yao Chen, Chenyu Lu, Jin Xie, 75 Qun Fang, Yibin Wang, Rui Yue, Tiefeng 76 36 Metin N Gurcan, Laura Boucheron, Ali Can, Li, He Huang, Stuart H Orkin, Guo-Cheng 77 37 Anant Madabhushi, Nasir Rajpoot, and Yuan, Ming Chen, and Guoji Guo. Map- 78 38 Bulent Yener. Histopathological image ping the mouse cell atlas by Microwell-Seq. 79 39 analysis: A review. IEEE Rev. Biomed. Cell, 172(5):1091–1107.e17, February 2018. 80 40 Eng., 2:147, 2009. Andreas Heindl, Sidra Nawaz, and Yinyin 81 41 Hiroshi Haeno, Mithat Gonen, Meghan B Yuan. Mapping spatial heterogeneity in 82

53 1 the tumor microenvironment: a new era for microfluidics. Sci. Rep., 7(1):5199, July 40 2 digital pathology. Lab. Invest., 95(4):377– 2017. 41 3 384, April 2015. Yong Hou, Kui Wu, Xulian Shi, Fuqiang Li, 42 43 4 Stephanie C. Hicks and Roger D. Peng. Luting Song, Hanjie Wu, Michael Dean, 44 5 Elements and Principles of Data Anal- Guibo Li, Shirley Tsang, Runze Jiang, 45 6 ysis. arXiv:1903.07639 [stat], March Xiaolong Zhang, Bo Li, Geng Liu, Ni- 46 7 2019. URL http://arxiv.org/abs/1903. harika Bedekar, Na Lu, Guoyun Xie, Han 47 8 07639. arXiv: 1903.07639. Liang, Liao Chang, Ting Wang, Jianghao Chen, Yingrui Li, Xiuqing Zhang, Huan- 48 9 Stephanie C. Hicks, F. William Townes, ming Yang, Xun Xu, Ling Wang, and Jun 49 10 Mingxiang Teng, and Rafael A. Irizarry. Wang. Comparison of variations detection 50 11 Missing data and technical variability between whole-genome amplification meth- 51 12 in single-cell RNA-sequencing exper- ods used in single-cell resequencing. Giga- 52 13 iments. Biostatistics, 19(4):562–578, science, 4:37, August 2015. 53 14 October 2018. ISSN 1465-4644. doi: Qiwen Hu and Casey S Greene. Param- 54 15 10.1093/biostatistics/kxx053. URL https: eter tuning is a key part of dimension- 55 16 //academic.oup.com/biostatistics/ ality reduction via deep variational au- 56 17 article/19/4/562/4599254. toencoders for single cell RNA transcrip- 57 tomics. Pacific Symposium on Biocomput- 58 18 Elad Hoffer and Nir Ailon. Deep Metric ing. Pacific Symposium on Biocomputing, 59 19 Learning Using Triplet Network. In Aasa 24:362–373, 2019. ISSN 2335-6936, 2335- 60 20 Feragen, Marcello Pelillo, and Marco Loog, 6928. URL https://www.ncbi.nlm.nih. 61 21 editors, Similarity-Based Pattern Recogni- gov/pubmed/30963075. 62 22 tion, Lecture Notes in Computer Science, 23 pages 84–92. Springer International Pub- Lei Huang, Fei Ma, Alec Chapman, Sijia 63 24 lishing, 2015. ISBN 978-3-319-24261-3. Lu, and Xiaoliang Sunney Xie. Single-Cell 64 Whole-Genome amplification and sequenc- 65 25 Ian H Holmes. Solving the master equation ing: Methodology and applications. Annu. 66 26 for indels. BMC Bioinformatics, 18(1):255, Rev. Genomics Hum. Genet., 16:79–102, 67 27 May 2017. June 2015. 68

28 Chung-Chau Hon, Jay W. Shin, Piero Carn- Mo Huang, Jingshu Wang, Eduardo Torre, 69 29 inci, and Michael J. T. Stubbington. Hannah Dueck, Sydney Shaffer, Roberto 70 30 The Human Cell Atlas: Technical ap- Bonasio, John I. Murray, Arjun Raj, 71 31 Briefings in proaches and challenges. Mingyao Li, and Nancy R. Zhang. 72 32 Functional Genomics , 17(4):283–294, July SAVER: gene expression recovery for 73 33 2018. ISSN 2041-2649. doi: 10.1093/bfgp/ single-cell RNA sequencing. Nature Meth- 74 34 https://academic.oup. elx029. URL ods, 15(7):539, July 2018. ISSN 1548-7105. 75 35 com/bfg/article/17/4/283/4571849 . doi: 10.1038/s41592-018-0033-z. URL 76 https://www.nature.com/articles/ 77 36 Masahito Hosokawa, Yohei Nishikawa, s41592-018-0033-z. 78 37 Masato Kogawa, and Haruko Takeyama. 38 Massively parallel whole genome amplifica- Joanna Hård, Ezeddin Al Hakim, Marie 79 39 tion for single-cell sequencing using droplet Kindblom, Åsa K. Björklund, Bengt 80

54 1 Sennblad, Ilke Demirci, Marta Paterlini, Livnat Jerby-Arnon, Nadja Pfetzer, Yedael Y 41 2 Pedro Reu, Erik Borgström, Patrik L. Waldman, Lynn McGarry, Daniel James, 42 3 Ståhl, Jakob Michaelsson, Jeff E. Mold, Emma Shanks, Brinton Seashore-Ludlow, 43 4 and Jonas Frisén. Conbase: a software Adam Weinstock, Tamar Geiger, Paul A 44 5 for unsupervised discovery of clonal so- Clemons, Eyal Gottlieb, and Eytan Rup- 45 6 matic mutations in single cells through read pin. Predicting cancer-specific vulnerabil- 46 7 phasing. Genome Biology, 20(1):68, April ity via data-driven detection of synthetic 47 8 2019. ISSN 1474-760X. doi: 10.1186/ lethality. Cell, 158(5):1199–1209, August 48 9 s13059-019-1673-8. URL https://doi. 2014. 49 10 org/10.1186/s13059-019-1673-8. Zhicheng Ji and Hongkai Ji. TSCAN: 50 11 T. Höllt, N. Pezzotti, V. van Unen, F. Kon- Pseudo-time reconstruction and evaluation 51 12 ing, B. P. F. Lelieveldt, and A. Vilanova. in single-cell RNA-seq analysis. Nucleic 52 13 CyteGuide: Visual Guidance for Hierar- Acids Research, 44(13):e117, 2016. ISSN 53 14 chical Single-Cell Analysis. IEEE Trans- 1362-4962. doi: 10.1093/nar/gkw430. 54 15 actions on Visualization and Computer Nelson Johansen and Gerald Quon. scAlign: 55 16 Graphics, 24(1):739–748, January 2018. a tool for alignment, integration and 56 17 ISSN 1077-2626. doi: 10.1109/TVCG.2017. rare cell identification from scRNA-seq 57 18 2744318. data. bioRxiv, page 504944, March 58 19 Giovanni Iacono, Elisabetta Mereu, Amy 2019. doi: 10.1101/504944. URL 59 20 Guillaumet-Adkins, Roser Corominas, Ivon https://www.biorxiv.org/content/10. 60 21 Cuscó, Gustavo Rodríguez-Esteban, Marta 1101/504944v4. 61 22 Gut, Luis Alberto Pérez-Jurado, Ivo Gut, Brett E Johnson, Tali Mazor, Chibo Hong, 62 23 and Holger Heyn. bigSCale: an analyti- Michael Barnes, Koki Aihara, Cory Y 63 24 cal framework for big-scale single-cell data. McLean, Shaun D Fouse, Shogo Ya- 64 25 Genome Res., 28(6):878–890, June 2018. mamoto, Hiroki Ueda, Kenji Tatsuno, 65 26 Humayun Irshad, Antoine Veillard, Ludovic Saurabh Asthana, Llewellyn E Jalbert, 66 27 Roux, and Daniel Racoceanu. Methods for Sarah J Nelson, Andrew W Bollen, W Clay 67 28 Nuclei Detection, Segmentation, and Clas- Gustafson, Elise Charron, William A 68 29 sification in Digital Histopathology: A Re- Weiss, Ivan V Smirnov, Jun S Song, 69 30 view—Current Status and Future Poten- Adam B Olshen, Soonmee Cha, Yongjun 70 31 tial. IEEE Reviews in Biomedical Engineer- Zhao, Richard A Moore, Andrew J 71 32 ing, 7:97–114, 2014. ISSN 1937-3333, 1941- Mungall, Steven J M Jones, Martin Hirst, 72 33 1189. doi: 10.1109/RBME.2013.2295804. Marco A Marra, Nobuhito Saito, Hiroyuki 73 Aburatani, Akitake Mukasa, Mitchel S 74 34 Martin Jacobsen. Point Process Theory and Berger, Susan M Chang, Barry S Taylor, 75 35 Applications: Marked Point and Piecewise and Joseph F Costello. Mutational anal- 76 36 Deterministic Processes. Springer Science ysis reveals the origin and therapy-driven 77 37 & Business Media, December 2005. evolution of recurrent glioma. Science, 343 78 (6167):189–193, January 2014. 79 38 Katharina Jahn, Jack Kuipers, and Niko 39 Beerenwinkel. Tree inference for single-cell Travis S. Johnson, Tongxin Wang, Zhi 80 40 data. Genome Biol., 17:86, May 2016. Huang, Christina Y. Yu, Yi Wu, Yatong 81

55 1 Han, Yan Zhang, Kun Huang, and Jie Elizabeth McCarthy, Eunice Wan, Simon 41 2 Zhang. LAmbDA: Label Ambiguous Wong, Lauren Byrnes, Cristina M Lanata, 42 3 Domain Adaptation Dataset Integration Rachel E Gate, Sara Mostafavi, Alexander 43 4 Reduces Batch Effects and Improves Marson, Noah Zaitlen, Lindsey A Criswell, 44 5 Subtype Detection. Bioinformatics, April and Chun Jimmie Ye. Multiplexed droplet 45 6 2019. doi: 10.1093/bioinformatics/btz295. single-cell RNA-sequencing using natural 46 7 URL https://academic.oup.com/ genetic variation. Nat. Biotechnol., 36(1): 47 8 bioinformatics/advance-article/ 89–94, January 2018b. 48 9 doi/10.1093/bioinformatics/btz295/ 10 5481958. Jurrian Kornelis de Kanter, Philip Lijnzaad, 49 Tito Candelli, Thanasis Margaritis, and 50 11 Altuna Akalin Jonathan Ronen. netsmooth: Frank Holstege. CHETAH: a selective, hi- 51 12 Network-smoothing based imputation for erarchical cell type identification method 52 13 F1000Res. single cell RNA-seq. , 7, 2018. for single-cell RNA sequencing. bioRxiv, 53 page 558908, February 2019. doi: 10.1101/ 54 14 Min Jung, Daniel Wells, Jannette Rusch, 558908. URL https://www.biorxiv.org/ 55 15 Suhaira Ahmad, Jonathan Marchini, Si- content/10.1101/558908v1. 56 16 mon R Myers, and Donald F Conrad. Uni- 17 fied single-cell analysis of testis gene regu- Nikos Karaiskos, Philipp Wahle, Jonathan 57 18 lation and pathology in five mouse strains. Alles, Anastasiya Boltengagen, Salah Ay- 58 19 eLife, 8, June 2019. ISSN 2050-084X. oub, Claudia Kipar, Christine Kocks, Niko- 59 20 doi: 10.7554/eLife.43966. URL http:// laus Rajewsky, and Robert P Zinzen. The 60 21 dx.doi.org/10.7554/eLife.43966. drosophila embryo at single-cell transcrip- 61 tome resolution. Science, 358(6360):194– 62 22 Melissa R Junttila and Frederic J de Sauvage. 199, October 2017a. 63 23 Influence of tumour micro-environment 24 heterogeneity on therapeutic response. Na- Nikos Karaiskos, Philipp Wahle, Jonathan 64 25 ture, 501(7467):346–354, September 2013. Alles, Anastasiya Boltengagen, Salah Ay- 65

26 Hyun Min Kang, Meena Subramaniam, Sasha oub, Claudia Kipar, Christine Kocks, Niko- 66 27 Targ, Michelle Nguyen, Lenka Maliskova, laus Rajewsky, and Robert P. Zinzen. The 67 28 Elizabeth McCarthy, Eunice Wan, Simon Drosophila embryo at single-cell transcrip- 68 Science 69 29 Wong, Lauren Byrnes, Cristina Lanata, tome resolution. , 358(6360):194– 30 Rachel Gate, Sara Mostafavi, Alexan- 199, October 2017b. ISSN 0036-8075, 70 31 der Marson, Noah Zaitlen, Lindsey A 1095-9203. doi: 10.1126/science.aan3235. 71 32 Criswell, and Chun Jimmie Ye. Multi- URL http://science.sciencemag.org/ 72 33 plexed droplet single-cell RNA-sequencing content/358/6360/194. 73 34 using natural genetic variation. Na- 35 ture biotechnology, 36(1):89–94, January Ino D. Karemaker and Michiel Vermeulen. 74 36 2018a. ISSN 1087-0156. doi: 10.1038/nbt. Single-Cell DNA Methylation Profiling: 75 37 4042. URL https://www.ncbi.nlm.nih. Technologies and Biological Applications. 76 Trends in Biotechnology 77 38 gov/pmc/articles/PMC5784859/. , 36(9):952–965, September 2018. ISSN 0167-7799, 1879- 78 39 Hyun Min Kang, Meena Subramaniam, Sasha 3096. doi: 10.1016/j.tibtech.2018.04.002. 79 40 Targ, Michelle Nguyen, Lenka Maliskova, URL https://www.cell.com/ 80

56 1 trends/biotechnology/abstract/ 217–227, 2013. doi: 10.1101/gr.140301. 41 2 S0167-7799(18)30115-X. 112. URL http://genome.cshlp.org/ 42 content/23/2/217.abstract. 43 3 Rongqin Ke, Marco Mignardi, Alexan- Branch- 44 4 dra Pacureanu, Jessica Svedlund, Johan Marek Kimmel and David Axelrod. ing Processes in Biology 45 5 Botling, Carolina Wählby, and Mats Nils- . Interdisci- 6 son. In situ sequencing for RNA anal- plinary Applied Mathematics. Springer- 46 7 ysis in preserved tissue and cells. Na- Verlag, New York, 2 edition, 2015. ISBN 47 8 ture Methods, 10(9):857–860, September 978-1-4939-1558-3. URL https://www. 48 9 2013. ISSN 1548-7105. doi: 10.1038/ springer.com/gp/book/9781493915583. 49 10 nmeth.2563. URL https://www.nature. Savvas Kinalis, Finn Cilius Nielsen, Ole 50 11 com/articles/nmeth.2563. Winther, and Frederik Otzen Bagger. 51 Deconvolution of autoencoders to learn bi- 52 12 Lennart Kester and Alexander van Oude- ological regulatory modules from single cell 53 13 naarden. Single-Cell Transcriptomics mRNA sequencing data. BMC bioinfor- 54 14 Meets Lineage Tracing. Cell Stem Cell, 23 matics, 20(1):379, July 2019. ISSN 1471- 55 15 (2):166–179, August 2018. ISSN 19345909. 2105. doi: 10.1186/s12859-019-2952-9. 56 16 doi: 10.1016/j.stem.2018.04.014. URL URL http://dx.doi.org/10.1186/ 57 17 https://linkinghub.elsevier.com/ s12859-019-2952-9. 58 18 retrieve/pii/S1934590918301760.

Vladimir Yu Kiselev, Andrew Yiu, and Mar- 59 19 Peter V Kharchenko, Lev Silberstein, and tin Hemberg. scmap: projection of single- 60 20 David T Scadden. Bayesian approach cell RNA-seq data across data sets. Na- 61 21 to single-cell differential expression analy- ture Methods, 15(5):359–362, May 2018. 62 22 sis. Nature Methods, 11(7):740–742, July ISSN 1548-7105. doi: 10.1038/nmeth. 63 23 2014. ISSN 1548-7091. doi: 10.1038/ 4644. URL https://www.nature.com/ 64 24 nmeth.2967. URL http://www.nature. articles/nmeth.4644. 65 25 com/doifinder/10.1038/nmeth.2967. Vladimir Yu Kiselev, Tallulah S. Andrews, 66 26 Kyu-Tae Kim, Hye Won Lee, Hae-Ock Lee, and Martin Hemberg. Challenges in 67 27 Sang Cheol Kim, Yun Jee Seo, Woosung unsupervised clustering of single-cell 68 28 Chung, Hye Hyeon Eum, Do-Hyun Nam, RNA-seq data. Nature Reviews Genetics, 69 29 Junhyong Kim, Kyeung Min Joo, and page 1, January 2019. ISSN 1471-0064. 70 30 Woong-Yang Park. Single-cell mRNA se- doi: 10.1038/s41576-018-0088-9. URL 71 31 quencing identifies subclonal heterogeneity https://www.nature.com/articles/ 72 32 in anti-cancer drug responses of lung ade- s41576-018-0088-9. 73 33 nocarcinoma cells. Genome Biol., 16:127, 34 June 2015. Allon M. Klein, Linas Mazutis, Ilke Akar- 74 tuna, Naren Tallapragada, Adrian Veres, 75 35 Tae-Min Kim, Ruibin Xi, Lovelace J. Lu- Victor Li, Leonid Peshkin, David A. Weitz, 76 36 quette, Richard W. Park, Mark D. John- and Marc W. Kirschner. Droplet barcod- 77 37 son, and Peter J. Park. Functional ing for single-cell transcriptomics applied 78 38 genomic analysis of chromosomal aber- to embryonic stem cells. Cell, 161(5):1187– 79 39 rations in a compendium of 8000 can- 1201, May 2015. ISSN 1097-4172. doi: 80 40 cer genomes. Genome Research, 23(2): 10.1016/j.cell.2015.04.044. 81

57 1 C A Klein, O Schmidt-Kittler, J A Schardt, identifying differential distributions in 41 2 K Pantel, M R Speicher, and G Rieth- single-cell RNA-seq experiments. Genome 42 3 müller. Comparative genomic hybridiza- Biology, 17(1):222, 2016b. ISSN 1474- 43 4 tion, loss of heterozygosity, and DNA se- 760X. doi: 10.1186/s13059-016-1077-y. 44 5 quence analysis of single cells. Proc. Natl. Dylan Kotliar, Adrian Veres, M Aurel Nagy, 45 6 Acad. Sci. U. S. A., 96(8):4494–4499, April Shervin Tabrizi, Eran Hodis, Douglas A 46 7 1999. Melton, and Pardis C Sabeti. Identify- 47 8 Sergey Knyazev, Viachaslau Tsyvina, An- ing gene expression programs of cell-type 48 9 drew Melnyk, Alexander Artyomenko, Ta- identity and cellular activity with single- 49 10 tiana Malygina, Yuri B Porozov, Ellsworth cell RNA-Seq. Elife, 8:e43803, July 2019. 50 11 Campbell, William M Switzer, Pavel Alexey M. Kozlov, Diego Darriba, Tomáš 51 12 Skums, and Alex Zelikovsky. CliqueSNV: Flouri, Benoit Morel, and Alexan- 52 13 Scalable reconstruction of Intra-Host viral dros Stamatakis. RAxML-NG: a fast, 53 14 populations from NGS reads, 2018. scalable and user-friendly tool for 54 15 Bryan Kolaczkowski and Joseph W Thornton. maximum likelihood phylogenetic in- 55 16 A mixed branch length model of hetero- ference. Bioinformatics, May 2019. 56 17 tachy improves phylogenetic accuracy. Mol. doi: 10.1093/bioinformatics/btz305. 57 18 Biol. Evol., 25(6):1054–1066, June 2008. URL https://academic.oup.com/ 58 bioinformatics/advance-article/ 59 19 Daisuke Komura and Shumpei Ishikawa. Ma- doi/10.1093/bioinformatics/btz305/ 60 20 chine learning methods for histopatho- 5487384. 61 21 logical image analysis. Comput. Struct. 22 Biotechnol. J., 16:34–42, February 2018. O Kozlov. Models, Optimizations, and 62 Tools for Large-Scale Phylogenetic Infer- 63 23 Hazal Koptagel, Seong-Hwan Jun, and ence, Handling Sequence Uncertainty, and 64 24 Jens Lagergren. SCuPhr: A Prob- Taxonomic Validation. PhD thesis, Karl- 65 25 abilistic Framework for Cell Lineage sruhe Institute of Technology (KIT), Octo- 66 26 Tree Reconstruction. bioRxiv, page ber 2018. 67 27 357442, June 2018. doi: 10.1101/ 28 357442. URL https://www.biorxiv.org/ Sergey Kryazhimskiy and Joshua B Plotkin. 68 29 content/early/2018/06/29/357442. The population genetics of dN/dS. PLoS 69 Genet., 4(12):e1000304, December 2008. 70 30 Keegan D Korthauer, Li-Fang Chu, Michael A 31 Newton, Yuan Li, James Thomson, Ron Jack Kuipers, Katharina Jahn, Benjamin J 71 32 Stewart, and Christina Kendziorski. A sta- Raphael, and Niko Beerenwinkel. Single- 72 33 tistical approach for identifying differential cell sequencing data reveal widespread re- 73 34 distributions in single-cell RNA-seq exper- currence and loss of mutational hits in the 74 35 iments. Genome Biol., 17(1):222, October life histories of tumors. Genome Res., 27 75 36 2016a. (11):1885–1894, November 2017. 76

37 Keegan D. Korthauer, Li-Fang Chu, Johannes Köster, Myles Brown, and 77 38 Michael A. Newton, Yuan Li, James X. Shirley Liu. A Bayesian model for 78 39 Thomson, Ron Stewart, and Christina single cell transcript expression analysis 79 40 Kendziorski. A statistical approach for on MERFISH data. Bioinformatics, 35 80

58 1 (6):995–1001, March 2019a. ISSN 1367- 411058, September 2018. doi: 10.1101/ 43 2 4803. doi: 10.1093/bioinformatics/bty718. 411058. URL https://www.biorxiv.org/ 44 3 URL https://academic.oup.com/ content/early/2018/09/13/411058. 45 4 bioinformatics/article/35/6/995/ 5 5078469. Ruben T H M Larue, Gilles Defraene, 46 Dirk De Ruysscher, Philippe Lambin, and 47 6 Johannes Köster, Louis Dijkstra, Tobias Wouter van Elmpt. Quantitative radiomics 48 7 Marschall, and Alexander Schönhuth. studies for tissue characterization: a re- 49 8 Enhancing sensitivity and controlling view of technology and methodological pro- 50 9 false discovery rate in somatic indel cedures. Br. J. Radiol., 90(1070):20160665, 51 10 discovery. bioRxiv, page 741256, Au- February 2017. 52 11 gust 2019b. doi: 10.1101/741256. URL 12 https://www.biorxiv.org/content/10. Devon A. Lawson, Kai Kessenbrock, 53 13 1101/741256v1. Ryan T. Davis, Nicholas Pervolarakis, 54 and Zena Werb. Tumour heterogeneity 55 14 Aaron T. L. Lun, Karsten Bach, and John C. and metastasis at single-cell resolu- 56 15 Marioni. Pooling across cells to normal- tion. Nature Cell Biology, 20(12): 57 16 ize single-cell RNA sequencing data with 1349, December 2018. ISSN 1476-4679. 58 17 Genome Biology many zero counts. , 17(1): doi: 10.1038/s41556-018-0236-7. URL 59 18 75, April 2016. ISSN 1474-760X. doi: 10. https://www.nature.com/articles/ 60 19 https:// 1186/s13059-016-0947-7. URL s41556-018-0236-7. 61 20 doi.org/10.1186/s13059-016-0947-7. Si Quang Le, Cuong Cao Dang, and Olivier 62 21 Emma Laks, Hans Zahn, Daniel Lai, Andrew Gascuel. Modeling protein evolution with 63 22 McPherson, Adi Steif, Jazmine Brimhall, several amino acid replacement matrices 64 23 Justina Biele, Beixi Wang, Tehmina depending on site rates. Mol. Biol. Evol., 65 24 Masud, Diljot Grewal, Cydney Nielsen, 29(10):2921–2936, October 2012. 66 25 Samantha Leung, Viktoria Bojilova, Maia

26 Smith, Oleg Golovko, Steven Poon, Peter Adam D Leaché, Barbara L Banbury, Joseph 67 27 Eirew, Farhia Kabeer, Teresa Ruiz de Al- Felsenstein, Adrián Nieto-Montes de Oca, 68 28 gara, So Ra Lee, M. Jafar Taghiyar, Curtis and Alexandros Stamatakis. Short tree, 69 29 Huebner, Jessica Ngo, Tim Chan, Spencer long tree, right tree, wrong tree: New ac- 70 30 Vatrt-Watts, Pascale Walters, Nafis Abrar, quisition bias corrections for inferring SNP 71 31 Sophia Chan, Matt Wiens, Lauren Mar- phylogenies. Syst. Biol., 64(6):1032–1047, 72 32 tin, R. Wilder Scott, Michael T. Under- 2015. 73 33 hill, Elizabeth Chavez, Christian Steidl, 34 Daniel Da Costa, Yusanne Ma, Robin J. N. Je Hyuk Lee, Evan R Daugharthy, Jonathan 74 35 Coope, Richard Corbett, Stephen Plea- Scheiman, Reza Kalhor, Thomas C Fer- 75 36 sance, Richard Moore, Andy J. Mungall, rante, Richard Terry, Brian M Turczyk, 76 37 Cruk Imaxt Consortium, Marco A. Marra, Joyce L Yang, Ho Suk Lee, John Aach, Kun 77 38 Carl Hansen, Sohrab Shah, and Samuel Zhang, and George M Church. Fluorescent 78 39 Aparicio. Resource: Scalable whole genome in situ sequencing (FISSEQ) of RNA for 79 40 sequencing of 40,000 single cells identifies gene expression profiling in intact cells and 80 41 stochastic aneuploidies, genome replication tissues. Nat. Protoc., 10(3):442–458, March 81 42 states and clonal repertoires. bioRxiv, page 2015. 82

59 1 Jeffrey T. Leek, Robert B. Scharpf, Héc- Sorger. Highly multiplexed immunofluores- 41 2 tor Corrada Bravo, David Simcha, Ben- cence imaging of human tissues and tumors 42 3 jamin Langmead, W. Evan Johnson, Don- using t-CyCIF and conventional optical mi- 43 4 ald Geman, Keith Baggerly, and Rafael A. croscopes. eLife, 7:e31657, July 2018. ISSN 44 5 Irizarry. Tackling the widespread and 2050-084X. doi: 10.7554/eLife.31657. URL 45 6 critical impact of batch effects in high- https://doi.org/10.7554/eLife.31657. 46 7 throughput data. Nature Reviews Genetics, 8 11(10):733–739, October 2010. ISSN 1471- Peijie Lin, Michael Troup, and Joshua W. K. 47 9 0064. doi: 10.1038/nrg2825. URL https: Ho. CIDR: Ultrafast and accurate clus- 48 10 //www.nature.com/articles/nrg2825. tering through imputation for single-cell 49 RNA-seq data. Genome Biology, 18(1):59, 50 11 Ana Carolina Leote, Xiaohui Wu, and March 2017b. ISSN 1474-760X. doi: 10. 51 12 Andreas Beyer. Network-based impu- 1186/s13059-017-1188-0. URL https:// 52 13 tation of dropouts in single-cell RNA doi.org/10.1186/s13059-017-1188-0. 53 14 sequencing data. bioRxiv, page 611517, 15 April 2019. doi: 10.1101/611517. URL G C Linderman, J Zhao, and Y Kluger. Zero- 54 16 https://www.biorxiv.org/content/10. preserving imputation of scRNA-seq data 55 bioRxiv 56 17 1101/611517v1. using low-rank approximation. , 2018. URL https://www.biorxiv.org/ 57 18 Wei Vivian Li and Jingyi Jessica Li. An accu- content/10.1101/397588v1.abstract. 58 19 rate and robust imputation method scim- 20 pute for single-cell RNA-seq data. Nat. Liang Liu, Zhenxiang Xi, Shaoyuan Wu, 59 21 Commun., 9(1):997, March 2018. Charles C Davis, and Scott V Edwards. Es- 60 timating phylogenetic trees from genome- 61 22 Yuval Lieberman, Lior Rokach, and Tal scale data. Ann. N. Y. Acad. Sci., 1360: 62 23 Shay. CaSTLe – Classification of single 36–53, December 2015. 63 24 cells by transfer learning: Harnessing 25 the power of publicly available single cell Jackson Loper, Trygve Bakken, Uygar 64 26 RNA sequencing experiments to annotate Sumbul, Gabe Murphy, Hongkui Zeng, 65 27 new experiments. PLOS ONE, 13(10): David Blei, and Liam Paninski. The 66 28 e0205499, October 2018. ISSN 1932-6203. Markov link method: a nonparametric 67 29 doi: 10.1371/journal.pone.0205499. URL approach to combine observations from 68 30 https://journals.plos.org/plosone/ multiple experiments. bioRxiv, page 69 31 article?id=10.1371/journal.pone. 457283, January 2019. doi: 10.1101/ 70 32 0205499. 457283. URL https://www.biorxiv.org/ 71 content/10.1101/457283v3. 72 33 Chieh Lin, Siddhartha Jain, Hannah Kim, 34 and Ziv Bar-Joseph. Using neural networks Romain Lopez, Jeffrey Regier, Michael B 73 35 for reducing the dimensions of single-cell Cole, Michael I Jordan, and Nir Yosef. 74 36 RNA-Seq data. Nucleic Acids Res., 45(17): Deep generative modeling for single-cell 75 Nature methods 76 37 e156, September 2017a. transcriptomics. , 15 (12):1053–1058, December 2018. ISSN 77 38 Jia-Ren Lin, Benjamin Izar, Shu Wang, 1548-7091, 1548-7105. doi: 10.1038/ 78 39 Clarence Yapp, Shaolin Mei, Parin M s41592-018-0229-2. URL http://dx.doi. 79 40 Shah, Sandro Santagata, and Peter K org/10.1038/s41592-018-0229-2. 80

60 1 Eric Lubeck, Ahmet F Coskun, Timur 1038/nprot.2016.138. URL https://www. 40 2 Zhiyentayev, Mubhij Ahmad, and Long nature.com/articles/nprot.2016.138. 41 3 Cai. Single-cell in situ RNA profiling by 42 4 sequential hybridization. Nat. Methods, 11 Iain C. Macaulay, Chris P. Ponting, and 43 5 (4):360–361, April 2014. Thierry Voet. Single-Cell Multiomics: Multiple Measurements from Single 44 6 Aaron T L Lun and John C Marioni. Over- Cells. Trends in Genetics, 33(2):155– 45 7 coming confounding plate effects in dif- 168, February 2017. ISSN 0168-9525. 46 8 ferential expression analyses of single-cell doi: 10.1016/j.tig.2016.12.003. URL 47 9 RNA-seq data. Biostatistics, 18(3):451– https://www.ncbi.nlm.nih.gov/pmc/ 48 10 464, July 2017. articles/PMC5303816/. 49

11 Aaron T L Lun, Karsten Bach, and John C Evan Z. Macosko, Anindita Basu, Rahul 50 12 Marioni. Pooling across cells to normalize Satija, James Nemesh, Karthik Shekhar, 51 13 single-cell RNA sequencing data with many Melissa Goldman, Itay Tirosh, Allison R. 52 14 zero counts. Genome Biol., 17:75, April Bialas, Nolan Kamitaki, Emily M. Marter- 53 15 2016. steck, John J. Trombetta, David A. Weitz, 54 Joshua R. Sanes, Alex K. Shalek, Aviv 55 16 Aaron T L Lun, Arianne C Richard, and Regev, and Steven A. McCarroll. Highly 56 17 John C Marioni. Testing for differential Parallel Genome-wide Expression Profil- 57 18 abundance in mass cytometry data. Nat. ing of Individual Cells Using Nanoliter 58 19 Methods, 14(7):707–709, July 2017. Droplets. Cell, 161(5):1202–1214, May 59 2015. ISSN 1097-4172. doi: 10.1016/j.cell. 60 20 Tao Luo, Lei Fan, Rong Zhu, and Dong Sun. 2015.05.002. 61 21 Microfluidic Single-Cell manipulation and 22 analysis: Methods and applications. Micro- Serghei Mangul, Lana S. Martin, Brian L. 62 23 machines (Basel), 10(2), February 2019. Hill, Angela Ka-Mei Lam, Margaret G. 63 Distler, Alex Zelikovsky, Eleazar Es- 64 24 Lovelace J. Luquette, Craig L. Bohrson, kin, and Jonathan Flint. Systematic 65 25 Max A. Sherman, and Peter J. Park. Iden- benchmarking of omics computational 66 26 tification of somatic mutations in single tools. Nature Communications, 10(1): 67 27 cell DNA-seq using a spatial model of 1393, March 2019. ISSN 2041-1723. 68 28 allelic imbalance. Nature Communications, doi: 10.1038/s41467-019-09406-4. URL 69 29 10(1):1–14, August 2019. ISSN 2041-1723. https://www.nature.com/articles/ 70 30 doi: 10.1038/s41467-019-11857-8. URL s41467-019-09406-4. 71 31 https://www.nature.com/articles/ 32 s41467-019-11857-8. Gioele La Manno, Ruslan Soldatov, Amit 72 Zeisel, Emelie Braun, Hannah Hochgerner, 73 33 Iain C. Macaulay, Mabel J. Teng, Wilfried Viktor Petukhov, Katja Lidschreiber, 74 34 Haerty, Parveen Kumar, Chris P. Ponting, Maria E. Kastriti, Peter Lönnerberg, 75 35 and Thierry Voet. Separation and paral- Alessandro Furlan, Jean Fan, Lars E. 76 36 lel sequencing of the genomes and tran- Borm, Zehua Liu, David van Bruggen, 77 37 scriptomes of single cells using G&T- Jimin Guo, Xiaoling He, Roger Barker, 78 38 seq. Nature Protocols, 11(11):2081–2103, Erik Sundström, Gonçalo Castelo-Branco, 79 39 November 2016. ISSN 1750-2799. doi: 10. Patrick Cramer, Igor Adameyko, Sten 80

61 1 Linnarsson, and Peter V. Kharchenko. URL http://science.sciencemag.org/ 41 2 RNA velocity of single cells. Nature, 560 content/358/6370/1622. 42 3 (7719):494, August 2018. ISSN 1476-4687. 4 doi: 10.1038/s41586-018-0414-6. URL Jing Meng and Yi-Ping Phoebe Chen. A 43 5 https://www.nature.com/articles/ database of simulated tumor genomes to- 44 6 s41586-018-0414-6. wards accurate detection of somatic small 45 variants in cancer. PLoS One, 13(8): 46 7 Erik A Martens, Rumen Kostadinov, Carlo C e0202982, August 2018. 47 8 Maley, and Oskar Hallatschek. Spatial 9 structure increases the waiting time for can- Christopher R. Merritt, Giang T. Ong, 48 10 cer. New J. Phys., 13, November 2011. Sarah Church, Kristi Barker, Gary Geiss, 49 Margaret Hoang, Jaemyeong Jung, Yan 50 11 Dariusz Matlak and Ewa Szczurek. Epista- Liang, Jill McKay-Fleisch, Karen Nguyen, 51 12 sis in genomic and survival data of can- Kristina Sorg, Isaac Sprague, Charles War- 52 13 cer patients. PLoS Comput. Biol., 13(7): ren, Sarah Warren, Zoey Zhou, Daniel R. 53 14 e1005626, July 2017. Zollinger, Dwayne L. Dunaway, Gordon B. 54 Mills, and Joseph M. Beechem. High 55 15 Davis James McCarthy, Raghd Rostom, multiplex, digital spatial profiling of pro- 56 16 Yuanhua Huang, Daniel J. Kunz, Petr teins and RNA in fixed tissue using ge- 57 17 Danecek, Marc Jan Bonder, Tzachi Hagai, nomic detection methods. bioRxiv, page 58 18 HipSci Consortium, Wenyi Wang, Daniel J. 559021, February 2019. doi: 10.1101/ 59 19 Gaffney, Benjamin D. Simons, Oliver Ste- 559021. URL https://www.biorxiv.org/ 60 20 gle, and Sarah A. Teichmann. Cardelino: content/10.1101/559021v2. 61 21 Integrating whole exomes and single-cell 22 transcriptomes to reveal phenotypic im- Zhun Miao, Jiaqi Li, and Xuegong Zhang. 62 23 pact of somatic variants. bioRxiv, page scRecover: Discriminating true and 63 24 413047, November 2018. doi: 10.1101/ false zeros in single-cell RNA-seq data 64 25 413047. URL https://www.biorxiv.org/ for imputation. bioRxiv, page 665323, 65 26 content/10.1101/413047v2. June 2019. doi: 10.1101/665323. URL 66 https://www.biorxiv.org/content/10. 67 27 Nicholas McGranahan and Charles Swanton. 1101/665323v1. 68 28 Clonal heterogeneity and tumor evolution: 29 Past, present, and the future. Cell, 168(4): Franziska Michor, Yoh Iwasa, and Martin A 69 30 613–628, February 2017. Nowak. Dynamics of cancer progression. 70 Nat. Rev. Cancer, 4(3):197–205, March 71 31 Chiara Medaglia, Amir Giladi, Liat Stoler- 2004. 72 32 Barak, Marco De Giovanni, Tomer Meir 33 Salame, Adi Biram, Eyal David, Han- Jeffrey R Moffitt, Junjie Hao, Dhanan- 73 34 jie Li, Matteo Iannacone, Ziv Shul- jay Bambah-Mukku, Tian Lu, Cather- 74 35 man, and Ido Amit. Spatial recon- ine Dulac, and . High- 75 36 struction of immune niches by combin- performance multiplexed fluorescence in 76 37 ing photoactivatable reporters and scRNA- situ hybridization in culture and tissue with 77 38 seq. Science, 358(6370):1622–1626, De- matrix imprinting and clearing. Proc. Natl. 78 39 cember 2017. ISSN 0036-8075, 1095- Acad. Sci. U. S. A., 113(50):14456–14461, 79 40 9203. doi: 10.1126/science.aao4277. December 2016. 80

62 1 Jeffrey R. Moffitt, Dhananjay Bambah- Richard A Neher, Colin A Russell, and Boris I 41 2 Mukku, Stephen W. Eichhorn, Eric Shraiman. Predicting evolution from the 42 3 Vaughn, Karthik Shekhar, Julio D. shape of genealogical trees. Elife, 3, Novem- 43 4 Perez, Nimrod D. Rubinstein, Junjie ber 2014. 44 5 Hao, Aviv Regev, , 6 and Xiaowei Zhuang. Molecular, spa- Malgorzata Nowicka, Carsten Krieg, Lukas M 45 7 tial, and functional single-cell profiling Weber, Felix J Hartmann, Silvia Guglietta, 46 8 of the hypothalamic preoptic region. Burkhard Becher, Mitchell P Levesque, 47 9 Science, 362(6416):eaau5324, Novem- and Mark D Robinson. CyTOF work- 48 10 ber 2018. ISSN 0036-8075, 1095-9203. flow: differential discovery in high- 49 11 doi: 10.1126/science.aau5324. URL throughput high-dimensional cytometry 50 12 http://science.sciencemag.org/ datasets. F1000Res., 6:748, May 2017. 51 13 content/362/6416/eaau5324. Huw A Ogilvie, Remco R Bouckaert, and 52 Alexei J Drummond. StarBEAST2 brings 53 14 Kevin R Moon, Jay S Stanley, Daniel faster species tree inference and accurate 54 15 Burkhardt, David van Dijk, Guy Wolf, and estimates of substitution rates. Mol. Biol. 55 16 Smita Krishnaswamy. Manifold learning- Evol., 34(8):2101–2114, August 2017. 56 17 based methods for analyzing single-cell 18 RNA-sequencing data. Current Opinion in J Guillermo Paez, Ming Lin, Rameen 57 19 Systems Biology, 7:36–46, 2018. Beroukhim, Jeffrey C Lee, Xiaojun Zhao, 58 Daniel J Richter, Stacey Gabriel, Paula 59 20 Marmar Moussa and Ion I. Măndoiu. Herman, Hidefumi Sasaki, David Alt- 60 21 Locality Sensitive Imputation for Sin- shuler, Cheng Li, Matthew Meyerson, and 61 22 gle Cell RNA-Seq Data. Journal of William R Sellers. Genome coverage and 62 23 Computational Biology, February 2019. sequence fidelity of phi29 polymerase-based 63 24 doi: 10.1089/cmb.2018.0236. URL multiple strand displacement whole genome 64 25 https://www.liebertpub.com/doi/10. amplification. Nucleic Acids Res., 32(9): 65 26 1089/cmb.2018.0236. e71, May 2004. 66

27 Nature Methods. Method of the Year 2013. Jong-Eun Park, Krzysztof Polanski, Kerstin 67 28 Nature Methods , 11(1):1–1, January 2014. Meyer, and Sarah A. Teichmann. Fast 68 29 ISSN 1548-7105. doi: 10.1038/nmeth. Batch Alignment of Single Cell Transcrip- 69 30 2801. URL https://www.nature.com/ tomes Unifies Multiple Mouse Cell Atlases 70 31 articles/nmeth.2801. into an Integrated Landscape. bioRxiv, 71 page 397042, August 2018. doi: 10.1101/ 72 32 Nicholas Navin, Jude Kendall, Jennifer Troge, 397042. URL https://www.biorxiv.org/ 73 33 Peter Andrews, Linda Rodgers, Jeanne content/10.1101/397042v2. 74 34 McIndoo, Kerry Cook, Asya Stepan- 35 sky, Dan Levy, Diane Esposito, Lakshmi Anoop P Patel, Itay Tirosh, John J Trom- 75 36 Muthuswamy, Alex Krasnitz, W Richard betta, Alex K Shalek, Shawn M Gille- 76 37 McCombie, James Hicks, and Michael spie, Hiroaki Wakimoto, Daniel P Cahill, 77 38 Wigler. Tumour evolution inferred by Brian V Nahed, William T Curry, Robert L 78 39 single-cell sequencing. Nature, 472(7341): Martuza, David N Louis, Orit Rozenblatt- 79 40 90–94, April 2011. Rosen, Mario L Suvà, Aviv Regev, and 80

63 1 Bradley E Bernstein. Single-cell RNA- zero-inflated single-cell gene expres- 42 2 seq highlights intratumoral heterogeneity sion analysis. Genome Biology, 16 43 3 in primary glioblastoma. Science, 344 (1):241, November 2015. ISSN 1474- 44 4 (6190):1396–1401, June 2014. 760X. doi: 10.1186/s13059-015-0805-z. 45 URL https://doi.org/10.1186/ 46 5 Tao Peng, Qin Zhu, Penghang Yin, and s13059-015-0805-z. 47 6 Kai Tan. SCRABBLE: single-cell RNA- 7 seq imputation constrained by bulk RNA- Mireya Plass, Jordi Solana, F. Alexan- 48 8 seq data. Genome biology, 20(1):88, May der Wolf, Salah Ayoub, Aristotelis Mi- 49 9 2019. ISSN 1465-6906. doi: 10.1186/ sios, Petar Glažar, Benedikt Obermayer, 50 10 s13059-019-1681-8. URL http://dx.doi. Fabian J. Theis, Christine Kocks, and Niko- 51 11 org/10.1186/s13059-019-1681-8. laus Rajewsky. Cell type atlas and lineage 52

12 N. Pezzotti, T. Höllt, B. Lelieveldt, E. Eise- tree of a whole complex animal by single- 53 Science 54 13 mann, and A. Vilanova. Hierarchical cell transcriptomics. , 360(6391): 14 Stochastic Neighbor Embedding. Com- eaaq1723, May 2018. ISSN 0036-8075, 55 15 puter Graphics Forum, 35(3):21–30, 2016. 1095-9203. doi: 10.1126/science.aaq1723. 56 16 ISSN 1467-8659. doi: 10.1111/cgf. URL http://science.sciencemag.org/ 57 17 12878. URL https://onlinelibrary. content/360/6391/eaaq1723. 58 18 wiley.com/doi/abs/10.1111/cgf.12878. Hannah A. Pliner, Jonathan S. Packer, 59 19 Ángel J Picher, Bettina Budeus, Oliver José L. McFaline-Figueroa, Darren A. 60 20 Wafzig, Carola Krüger, Sara García- Cusanovich, Riza M. Daza, Delasa 61 21 Gómez, María I Martínez-Jiménez, Al- Aghamirzaie, Sanjay Srivatsan, Xiaojie 62 22 berto Díaz-Talavera, Daniela Weber, Luis Qiu, Dana Jackson, Anna Minkina, An- 63 23 Blanco, and Armin Schneider. TruePrime drew C. Adey, Frank J. Steemers, Jay 64 24 is a novel method for whole-genome am- Shendure, and Cole Trapnell. Cicero 65 25 plification from single cells based on Tth- Predicts cis-Regulatory DNA Interactions 66 26 Nat. Commun. PrimPol. , 7:13296, Novem- from Single-Cell Chromatin Accessi- 67 27 ber 2016a. bility Data. Molecular Cell, 71(5): 68 858–871.e8, 2018. ISSN 1097-4164. doi: 69 28 Ángel J. Picher, Bettina Budeus, Oliver 10.1016/j.molcel.2018.06.044. 70 29 Wafzig, Carola Krüger, Sara García- 30 Gómez, María I. Martínez-Jiménez, Al- Olivier Poirion, Xun Zhu, Travers Ching, and 71 31 berto Díaz-Talavera, Daniela Weber, Luis Lana X. Garmire. Using single nucleotide 72 32 Blanco, and Armin Schneider. TruePrime variations in single-cell RNA-seq to identify 73 33 is a novel method for whole-genome subpopulations and genotype-phenotype 74 34 amplification from single cells based on linkage. Nature Communications, 9(1): 75 35 TthPrimPol. Nature Communications, 4892, November 2018. ISSN 2041-1723. 76 36 7:13296, November 2016b. ISSN 2041- doi: 10.1038/s41467-018-07170-5. URL 77 37 1723. doi: 10.1038/ncomms13296. URL https://www.nature.com/articles/ 78 38 https://www.nature.com/articles/ s41467-018-07170-5. 79 39 ncomms13296.

40 Emma Pierson and Christopher Yau. David D Pollock, Derrick J Zwickl, Jimmy A 80 41 ZIFA: Dimensionality reduction for McGuire, and David M Hillis. Increased 81

64 1 taxon sampling is advantageous for phylo- Sanes, Rahul Satija, Ton N. Schumacher, 41 2 genetic inference. Syst. Biol., 51(4):664– Alex Shalek, Ehud Shapiro, Padmanee 42 3 671, August 2002. Sharma, Jay W. Shin, Oliver Stegle, 43 Michael Stratton, Michael J. T. Stubbing- 44 4 Vladimir Potapov and Jennifer L Ong. Ex- ton, Alexander van Oudenaarden, Allon 45 5 amining sources of error in PCR by Single- Wagner, Fiona Watt, Jonathan Weissman, 46 6 PLoS One Molecule sequencing. , 12(1): Barbara Wold, Ramnik Xavier, Nir Yosef, 47 7 e0169774, January 2017. and the Human Cell Atlas Meeting Partic- 48 ipants. The Human Cell Atlas. bioRxiv, 49 8 Xiaojie Qiu, Qi Mao, Ying Tang, Li Wang, page 121202, May 2017. doi: 10.1101/ 50 9 Raghav Chawla, Hannah A. Pliner, and 121202. URL https://www.biorxiv.org/ 51 10 Cole Trapnell. Reversed graph embed- content/10.1101/121202v1. 52 11 ding resolves complex single-cell trajecto- 12 Nature Methods ries. , 14(10):979–982, Oc- John E. Reid and Lorenz Wernisch. Pseu- 53 13 tober 2017. ISSN 1548-7105. doi: 10.1038/ dotime estimation: deconfounding single 54 14 nmeth.4402. URL https://www.nature. cell time series. Bioinformatics, 32(19): 55 15 com/articles/nmeth.4402. 2973–2980, October 2016. ISSN 1367- 56 4803. doi: 10.1093/bioinformatics/btw372. 57 16 Bruce Rannala and Ziheng Yang. Efficient URL https://academic.oup.com/ 58 17 bayesian species tree inference under the bioinformatics/article/32/19/2973/ 59 18 multispecies coalescent. Syst. Biol., 66(5): 2196633. 60 19 823–842, September 2017.

Stephen Reid, Jonathan Taylor, and Robert 61 20 Benjamin Redelings. Erasing errors due to Tibshirani. A general framework for esti- 62 21 alignment ambiguity when estimating posi- mation and inference from clusters of fea- 63 22 tive selection. Mol. Biol. Evol., 31(8):1979– tures. J. Am. Stat. Assoc., 113(521):280– 64 23 1993, August 2014. 293, January 2018. 65

24 Aviv Regev, Sarah A. Teichmann, Eric S. Davide Risso, Fanny Perraudeau, Svet- 66 25 Lander, Ido Amit, Christophe Benoist, lana Gribkova, Sandrine Dudoit, and 67 26 , Bernd Bodenmiller, Peter Jean-Philippe Vert. A general and 68 27 Campbell, Piero Carninci, Menna Clat- flexible method for signal extraction 69 28 worthy, , Bart Deplancke, from single-cell RNA-seq data. Na- 70 29 Ian Dunham, James Eberwine, Roland ture communications, 9(1):284, January 71 30 Eils, Wolfgang Enard, Andrew Farmer, 2018. ISSN 2041-1723. doi: 10.1038/ 72 31 Lars Fugger, Berthold Göttgens, Nir Ha- s41467-017-02554-5. URL https://doi. 73 32 cohen, Muzlifah Haniffa, Martin Hem- org/10.1038/s41467-017-02554-5. 74 33 berg, Seung Kim, Paul Klenerman, Arnold

34 Kriegstein, Ed Lein, Sten Linnarsson, Elena Rivas and Sean R Eddy. Probabilis- 75 35 Joakim Lundeberg, Partha Majumder, tic phylogenetic inference with insertions 76 36 John C. Marioni, Miriam Merad, Musa and deletions. PLoS Comput. Biol., 4(9): 77 37 Mhlanga, Martijn Nawijn, Mihai Netea, e1000172, September 2008. 78 38 Garry Nolan, Dana Pe’er, Anthony Philli- 39 pakis, Chris P. Ponting, Steve Quake, Abbas H. Rizvi, Pablo G. Camara, Elena K. 79 40 Wolf Reik, Orit Rozenblatt-Rosen, Joshua Kandror, Thomas J. Roberts, Ira Schieren, 80

65 1 Tom Maniatis, and Raul Rabadan. Single- An R package for ‘omics feature se- 42 2 cell topological RNA-seq analysis reveals lection and multiple data integration. 43 3 insights into cellular differentiation and de- PLOS Computational Biology, 13(11): 44 4 velopment. Nature Biotechnology, 35(6): e1005752, November 2017b. ISSN 1553- 45 5 551–560, 2017. ISSN 1546-1696. doi: 10. 7358. doi: 10.1371/journal.pcbi.1005752. 46 6 1038/nbt.3854. URL https://journals.plos.org/ 47 ploscompbiol/article?id=10.1371/ 48 7 Simone Rizzetto, Auda A Eltahla, Peijie Lin, journal.pcbi.1005752. 49 8 Rowena Bull, Andrew R Lloyd, Joshua 9 W K Ho, Vanessa Venturi, and Fabio Lu- Alexander B. Rosenberg, Charles M. Roco, 50 10 ciani. Impact of sequencing depth and read Richard A. Muscat, Anna Kuchina, Paul 51 11 length on single cell RNA sequencing data Sample, Zizhen Yao, Lucas T. Graybuck, 52 12 of T cells. Sci. Rep., 7(1):12781, October David J. Peeler, Sumit Mukherjee, Wei 53 13 2017. Chen, Suzie H. Pun, Drew L. Sellers, 54

14 S Roch. A short proof that phylogenetic tree Bosiljka Tasic, and Georg Seelig. Single- 55 15 reconstruction by maximum likelihood is cell profiling of the developing mouse brain 56 16 hard. IEEE/ACM Trans. Comput. Biol. and spinal cord with split-pool barcoding. 57 Science 58 17 Bioinform., 3(1):92–94, 2006. , 360(6385):176–182, April 2018. ISSN 0036-8075, 1095-9203. doi: 10.1126/ 59 18 Samuel G. Rodriques, Robert R. Stick- science.aam8999. URL http://science. 60 19 els, Aleksandrina Goeva, Carly A. Mar- sciencemag.org/content/360/6385/176. 61 20 tin, Evan Murray, Charles R. Vander-

21 burg, Joshua Welch, Linlin M. Chen, Fei Edith M Ross and Florian Markowetz. On- 62 22 Chen, and Evan Z. Macosko. Slide- coNEM: inferring tumor evolution from 63 23 seq: A scalable technology for measur- single-cell sequencing data. Genome Biol., 64 24 ing genome-wide expression at high spa- 17:69, April 2016. 65 25 tial resolution. Science, 363(6434):1463–

26 1467, March 2019. ISSN 0036-8075, Andrew Roth, Andrew McPherson, Emma 66 27 1095-9203. doi: 10.1126/science.aaw1219. Laks, Justina Biele, Damian Yap, Adrian 67 28 URL https://science.sciencemag.org/ Wan, Maia A Smith, Cydney B Nielsen, 68 29 content/363/6434/1463. Jessica N McAlpine, Samuel Aparicio, 69 Alexandre Bouchard-Côté, and Sohrab P 70 30 Florian Rohart, Aida Eslami, Nicholas Mati- Shah. Clonal genotype and population 71 31 gian, Stéphanie Bougeard, and Kim- structure inference from single-cell tumor 72 32 Anh Lê Cao. MINT: a multivari- sequencing. Nat. Methods, 13(7):573–576, 73 33 ate integrative method to identify re- July 2016. 74 34 producible molecular signatures across 35 independent experiments and platforms. Łukasz Rączkowski, Marcin Możejko, Joanna 75 36 BMC Bioinformatics, 18(1):128, February Zambonelli, and Ewa Szczurek. ARA: 76 37 2017a. ISSN 1471-2105. doi: 10.1186/ accurate, reliable and active histopatho- 77 38 s12859-017-1553-8. URL https://doi. logical image classification framework with 78 39 org/10.1186/s12859-017-1553-8. Bayesian deep learning. Scientific Reports, 79 40 Florian Rohart, Benoît Gautier, Amrit 9(1):1–12, October 2019. ISSN 2045-2322. 80 41 Singh, and Kim-Anh Lê Cao. mixOmics: doi: 10.1038/s41598-019-50587-1. URL 81

66 1 https://www.nature.com/articles/ Yao Liang Wong, Nicholas Rhind, Arshad 41 2 s41598-019-50587-1. Desai, and . Chromosome 42 Mis-segregation Generates Cell-Cycle- 43 3 Adela Saco, Jose Ramírez, Natalia Rakislova, Arrested Cells with Complex Karyotypes 44 4 Aurea Mira, and Jaume Ordi. Validation of that Are Eliminated by the Immune 45 5 Whole-Slide imaging for histolopathogical System. Developmental Cell, 41(6): 46 6 Pathobiology diagnosis: Current state. , 83 638–651.e5, June 2017. ISSN 15345807. 47 7 (2-3):89–98, April 2016. doi: 10.1016/j.devcel.2017.05.022. URL 48 https://linkinghub.elsevier.com/ 49 8 Wouter Saelens, Robrecht Cannoodt, retrieve/pii/S1534580717304306. 50 9 Helena Todorov, and Yvan Saeys. A 10 comparison of single-cell trajectory in- Gryte Satas and Benjamin J Raphael. Haplo- 51 11 ference methods. Nature Biotechnology, type phasing in single-cell DNA-sequencing 52 12 page 1, April 2019. ISSN 1546-1696. data. Bioinformatics, 34(13):i211–i217, 53 13 doi: 10.1038/s41587-019-0071-9. URL July 2018. 54 14 https://www.nature.com/articles/ 15 s41587-019-0071-9. Rahul Satija, Jeffrey A Farrell, David Gen- 55 nert, Alexander F Schier, and Aviv Regev. 56 16 Yvan Saeys, Sofie Van Gassen, and Bart N. Spatial reconstruction of single-cell gene ex- 57 17 Lambrecht. Computational flow cytom- pression data. Nat. Biotechnol., 33(5):495– 58 18 etry: helping to make sense of high- 502, May 2015. 59 19 dimensional immunology data. Nature 20 Reviews Immunology, 16(7):449–462, July Kenta Sato, Koki Tsuyuzaki, Kentaro 60 21 2016. ISSN 1474-1741. doi: 10.1038/nri. Shimizu, and Itoshi Nikaido. CellFish- 61 22 2016.56. URL https://www.nature.com/ ing.jl: an ultrafast and scalable cell 62 23 articles/nri.2016.56. search method for single-cell RNA sequenc- 63 ing. Genome Biology, 20(1):31, February 64 24 Sinem K. Saka, Yu Wang, Jocelyn Y. Kishi, 2019. ISSN 1474-760X. doi: 10.1186/ 65 25 Allen Zhu, Yitian Zeng, Wenxin Xie, s13059-019-1639-x. URL https://doi. 66 26 Koray Kirli, Clarence Yapp, Marcelo org/10.1186/s13059-019-1639-x. 67 27 Cicconet, Brian J. Beliveau, Sylvain W. 28 Lapan, Siyuan Yin, Millicent Lin, Ed- Arpiar Saunders, Evan Z. Macosko, Alec 68 29 ward S. Boyden, Pascal S. Kaeser, German Wysoker, Melissa Goldman, Fenna M. 69 30 Pihan, George M. Church, and Peng Yin. Krienen, Heather de Rivera, Elizabeth 70 31 Immuno-SABER enables highly multi- Bien, Matthew Baum, Laura Bortolin, 71 32 plexed and amplified protein imaging in Shuyu Wang, Aleksandrina Goeva, James 72 33 tissues. Nature Biotechnology, 37(9):1080– Nemesh, Nolan Kamitaki, Sara Brum- 73 34 1090, September 2019. ISSN 1546-1696. baugh, David Kulp, and Steven A. 74 35 doi: 10.1038/s41587-019-0207-y. URL McCarroll. Molecular Diversity and Spe- 75 36 https://www.nature.com/articles/ cializations among the Cells of the Adult 76 Cell 77 37 s41587-019-0207-y. Mouse Brain. , 174(4):1015–1030.e16, August 2018. ISSN 0092-8674. doi: 78 38 Stefano Santaguida, Amelia Richardson, 10.1016/j.cell.2018.07.028. URL http: 79 39 Divya Ramalingam Iyer, Ons M’Saad, //www.sciencedirect.com/science/ 80 40 Lauren Zasadil, Kristin A. Knouse, article/pii/S0092867418309553. 81

67 1 Denis Schapiro, Hartland W Jackson, Swetha man, and Florian Markowetz. Phyloge- 43 2 Raghuraman, Jana R Fischer, Vito R T netic Quantification of Intra-tumour Het- 44 3 Zanotelli, Daniel Schulz, Charlotte Giesen, erogeneity. PLoS Computational Biol- 45 4 Raúl Catena, Zsuzsanna Varga, and Bernd ogy, 10(4):e1003535, April 2014. ISSN 46 5 Bodenmiller. histoCAT: analysis of cell 1553-7358. doi: 10.1371/journal.pcbi. 47 6 phenotypes and interactions in multiplex 1003535. URL https://dx.plos.org/10. 48 7 image cytometry data. Nat. Methods, 14 1371/journal.pcbi.1003535. 49 8 (9):873–876, September 2017. Roberto Semeraro, Valerio Orlandini, and Al- 50 9 Geoffrey Schiebinger, Jian Shu, Marcin berto Magi. Xome-Blender: A novel can- 51 10 Tabaka, Brian Cleary, Vidya Subrama- cer genome simulator. PLoS One, 13(4): 52 11 nian, Aryeh Solomon, Siyan Liu, Sta- e0194472, April 2018. 53 12 cie Lin, Peter Berube, Lia Lee, Jenny 13 Chen, Justin Brumbaugh, Philippe Rigol- Debarka Sengupta, Nirmala Arul Rayan, 54 14 let, Konrad Hochedlinger, Rudolf Jaenisch, Michelle Lim, Bing Lim, and Shyam 55 15 Aviv Regev, and Eric S. Lander. Re- Prabhakar. Fast, scalable and accu- 56 16 construction of developmental landscapes rate differential expression analysis for 57 17 by optimal-transport analysis of single- single cells. bioRxiv, page 049734, 58 18 cell gene expression sheds light on cel- April 2016. doi: 10.1101/049734. URL 59 19 bioRxiv lular reprogramming. , page https://www.biorxiv.org/content/10. 60 20 191056, September 2017. doi: 10.1101/ 1101/049734v1. 61 21 191056. URL https://www.biorxiv.org/ 22 content/10.1101/191056v1. Manu Setty, Michelle D. Tadmor, Shlomit 62 Reich-Zeliger, Omer Angel, Tomer Meir 63 23 Herbert B Schiller, Daniel T Montoro, Salame, Pooja Kathail, Kristy Choi, Sean 64 24 Lukas M Simon, Emma L Rawlins, Bendall, , and Dana Pe’er. 65 25 Kerstin B Meyer, Maximilian Strunz, Wishbone identifies bifurcating develop- 66 26 Felipe Vieira Braga, Wim Timens, Ger- mental trajectories from single-cell data. 67 27 ard H Koppelman, G.R. Scott Budinger, Nature Biotechnology, 34(6):637–645, June 68 28 Janette K Burgess, Avinash Waghray, 2016. ISSN 1546-1696. doi: 10.1038/nbt. 69 29 Maarten van den Berge, Fabian J Theis, 3569. URL https://www.nature.com/ 70 30 Aviv Regev, Naftali Kaminski, Jayaraj articles/nbt.3569. 71 31 Rajagopal, Sarah A Teichmann, Alexan- 32 der V Misharin, and Martijn C Nawijn. D T Severson, R P Owen, M J White, X Lu, 72 33 The Human Lung Cell Atlas - A high- and B Schuster-Böckler. BEARscc deter- 73 34 resolution reference map of the human mines robustness of single-cell clusters us- 74 35 lung in health and disease. American Nat. 75 36 Journal of Respiratory Cell and Molecular ing simulated technical replicates. Commun., 9(1):1187, March 2018. 76 37 Biology, April 2019. ISSN 1044-1549. 38 doi: 10.1165/rcmb.2018-0416TR. URL 77 39 https://www.atsjournals.org/doi/ Sheel Shah, Eric Lubeck, Wen Zhou, and Long Cai. In situ transcription profiling of 78 40 abs/10.1165/rcmb.2018-0416TR. single cells reveals spatial organization of 79 41 Roland F. Schwarz, Anne Trinh, Botond cells in the mouse hippocampus. Neuron, 80 42 Sipos, James D. Brenton, Nick Gold- 92(2):342–357, October 2016a. 81

68 1 Sheel Shah, Eric Lubeck, Wen Zhou, and Pavel Skums, Viachaslau Tsyvina, and Alex 41 2 Long Cai. In Situ Transcription Profiling of Zelikovsky. Inference of clonal selec- 42 3 Single Cells Reveals Spatial Organization of tion in cancer populations using single- 43 4 Cells in the Mouse Hippocampus. Neuron, cell sequencing data. bioRxiv, page 44 5 92(2):342–357, October 2016b. ISSN 0896- 465211, January 2019. doi: 10.1101/ 45 6 6273. doi: 10.1016/j.neuron.2016.10.001. 465211. URL https://www.biorxiv.org/ 46 7 URL https://www.cell.com/neuron/ content/10.1101/465211v2. 47 8 abstract/S0896-6273(16)30702-4. Martin D Smith, Joel O Wertheim, Steven 48 9 Arun Shivanandan, Jayakrishnan Unnikrish- Weaver, Ben Murrell, Konrad Scheffler, and 49 10 nan, and Aleksandra Radenovic. On char- Sergei L Kosakovsky Pond. Less is more: an 50 11 acterizing protein spatial clusters with cor- adaptive branch-site random effects model 51 12 Sci. Rep. relation approaches. , 6:31164, for efficient detection of episodic diversify- 52 13 August 2016. ing selection. Mol. Biol. Evol., 32(5):1342– 53 1353, May 2015. 54 14 Angus M Sidore, Freeman Lan, Shaun W 15 Lim, and Adam R Abate. Enhanced se- Charlotte Soneson and Mark D Robinson. To- 55 16 quencing coverage with digital droplet mul- wards unified quality verification of syn- 56 17 tiple displacement amplification. Nucleic thetic count data with countsimQC. Bioin- 57 18 Acids Res., 44(7):e66, April 2016. formatics, 34(4):691–692, 2017. 58

19 Jochen Singer, Jack Kuipers, Katharina Charlotte Soneson and Mark D Robinson. 59 20 Jahn, and Niko Beerenwinkel. Single-cell Bias, robustness and scalability in single- 60 21 mutation identification via phylogenetic cell differential expression analysis. Nat. 61 22 inference. Nature Communications, 9(1): Methods, February 2018. 62 23 5144, December 2018. ISSN 2041-1723. 24 doi: 10.1038/s41467-018-07627-7. URL Bastiaan Spanjaard, Bo Hu, Nina Mitic, 63 25 https://www.nature.com/articles/ Pedro Olivares-Chauvet, Sharan Janjuha, 64 26 s41467-018-07627-7. Nikolay Ninov, and Jan Philipp Junker. 65

27 Amrit Singh, Benoit Gautier, Casey P. Simultaneous lineage tracing and cell-type 66 28 Shannon, Florian Rohart, Michael Vacher, identification using CRISPR-Cas9-induced 67 Nat. Biotechnol. 68 29 Scott J. Tebutt, and Kim-Anh Le Cao. genetic scars. , 36(5):469– 30 DIABLO: from multi-omics assays to 473, June 2018. 69 31 biomarker discovery, an integrative ap- 32 proach. bioRxiv, page 067611, March C Spits, C Le Caignec, M De Rycke, 70 33 2018. doi: 10.1101/067611. URL L Van Haute, A Van Steirteghem, 71 34 https://www.biorxiv.org/content/10. I Liebaers, and K Sermon. Optimization 72 35 1101/067611v2. and evaluation of single-cell whole-genome 73 multiple displacement amplification. Hum. 74 36 Debajyoti Sinha, Akhilesh Kumar, Himan- Mutat., 27(5):496–503, 2006a. 75 37 shu Kumar, Sanghamitra Bandyopadhyay, 38 and Debarka Sengupta. dropclust: efficient Claudia Spits, Cédric Le Caignec, Martine 76 39 clustering of ultra-large scRNA-seq data. De Rycke, Lindsey Van Haute, André 77 40 Nucleic Acids Res., 46(6):e36, April 2018. Van Steirteghem, Inge Liebaers, and Karen 78

69 1 Sermon. Whole-genome multiple displace- Chika Yokota, and Mats Nilsson. Placing 41 2 ment amplification from single cells. Nat. RNA in context and space - methods for 42 3 Protoc., 1(4):1965–1970, November 2006b. spatially resolved transcriptomics. FEBS 43 J., March 2018. 44 4 S Srinivasan, N T Johnson, and D Ko- 5 rkin. A Hybrid Deep Clustering Ap- Tim Stuart, Andrew Butler, Paul Hoffman, 45 6 proach for Robust Cell Type Profiling Us- Christoph Hafemeister, Efthymia Papalexi, 46 7 ing Single-cell RNA-seq Data. bioRxiv, William M. Mauck, Marlon Stoeckius, Pe- 47 8 2019. URL https://www.biorxiv.org/ ter Smibert, and Rahul Satija. Comprehen- 48 9 content/10.1101/511626v1.abstract. sive integration of single cell data. bioRxiv, 49 page 460147, November 2018. doi: 10.1101/ 50 10 Divyanshu Srivastava, Arvind Iyer, Vibhor 460147. URL https://www.biorxiv.org/ 51 11 Kumar, and Debarka Sengupta. Cel- content/10.1101/460147v1. 52 12 lAtlasSearch: a scalable search engine 13 Nucleic Acids Re- for single cells. Michael J T Stubbington, Orit Rozenblatt- 53 14 search , 46(W1):W141–W147, July 2018. Rosen, Aviv Regev, and Sarah A Teich- 54 15 ISSN 0305-1048. doi: 10.1093/nar/ mann. Single-cell transcriptomics to ex- 55 16 https://academic.oup. gky421. URL plore the immune system in health and dis- 56 17 com/nar/article/46/W1/W141/5000022 . ease. Science, 358(6359):58–63, October 57 2017. 58 18 Oliver Stegle, Sarah A Teichmann, and 19 John C Marioni. Computational and an- Patrik L. Ståhl, Fredrik Salmén, Sanja Vick- 59 20 alytical challenges in single-cell transcrip- ovic, Anna Lundmark, José Fernández 60 21 tomics. Nat. Rev. Genet., 16(3):133, Jan- Navarro, Jens Magnusson, Stefania Gia- 61 22 uary 2015. comello, Michaela Asp, Jakub O. West- 62 63 23 Genevieve L Stein-O’Brien, Brian S Clark, holm, Mikael Huss, Annelie Mollbrink, 64 24 Thomas Sherman, Cristina Zibetti, Qi- Sten Linnarsson, Simone Codeluppi, Åke 65 25 wen Hu, Rachel Sealfon, Sheng Liu, Jiang Borg, Fredrik Pontén, Paul Igor Costea, 66 26 Qian, Carlo Colantuoni, Seth Blackshaw, Pelin Sahlén, Jan Mulder, Olaf Bergmann, 67 27 Loyal A Goff, and Elana J Fertig. De- Joakim Lundeberg, and Jonas Frisén. Vi- 68 28 composing Cell Identity for Transfer Learn- sualization and analysis of gene expres- 69 29 ing across Cellular Measurements, Plat- sion in tissue sections by spatial transcrip- Science (New York, N.Y.) 70 30 forms, Tissues, and Species. Cell sys- tomics. , 353 71 31 tems, 8(5):395–411.e8, May 2019. ISSN (6294):78–82, July 2016. ISSN 1095-9203. 72 32 2405-4720, 2405-4712. doi: 10.1016/j.cels. doi: 10.1126/science.aaf2403. 33 2019.04.004. URL http://dx.doi.org/ S Sun, J Zhu, Y Ma, and X Zhou. Ac- 73 34 10.1016/j.cels.2019.04.004. curacy, Robustness and Scalability of Di- 74 35 Lars Steinbrück and Alice Carolyn McHardy. mensionality Reduction Methods for Sin- 75 36 Allele dynamics plots for the study of evolu- gle Cell RNAseq Analysis. bioRxiv, 76 37 tionary dynamics in viral populations. Nu- 2019. URL https://www.biorxiv.org/ 77 38 cleic Acids Res., 39(1):e4, January 2011. content/10.1101/641142v1.abstract. 78

39 Carina Strell, Markus M Hilscher, Navya Valentine Svensson, Sarah A Teichmann, and 79 40 Laxman, Jessica Svedlund, Chenglin Wu, Oliver Stegle. SpatialDE: identification of 80

70 1 spatially variable genes. Nat. Methods, 15 Bayesian gene expression recovery, im- 39 2 (5):343–346, May 2018a. putation and normalisation for single 40 cell RNA-sequencing data. bioRxiv, 41 3 Valentine Svensson, Roser Vento-Tormo, and 2018. URL https://www.biorxiv.org/ 42 4 Sarah A. Teichmann. Exponential scal- content/10.1101/384586v2.abstract. 43 5 ing of single-cell RNA-seq in the past 6 decade. Nature Protocols, 13(4):599–604, H Telenius, N P Carter, C E Bebb, M Norden- 44 7 April 2018b. ISSN 1750-2799. doi: 10. skjöld, B A Ponder, and A Tunnacliffe. De- 45 8 1038/nprot.2017.149. URL https://www. generate oligonucleotide-primed PCR: gen- 46 9 nature.com/articles/nprot.2017.149. eral amplification of target DNA by a single 47 degenerate primer. Genomics, 13(3):718– 48 10 Charles Swanton. Intratumor heterogeneity: 725, July 1992. 49 11 evolution through space and time. Cancer 12 Res. , 72(19):4875–4882, October 2012. Luyi Tian, Xueyi Dong, Saskia Freytag, 50 Kim-Anh Lê Cao, Shian Su, Abolfazl 51 13 Ewa Szczurek, Navodit Misra, and Martin JalalAbadi, Daniela Amann-Zalcenstein, 52 14 Vingron. Synthetic sickness or lethality Tom S. Weber, Azadeh Seidi, Jafar S. 53 15 points at candidate combination therapy Jabbari, Shalin H. Naik, and Matthew E. 54 16 targets in glioblastoma. Int. J. Cancer, 133 Ritchie. Benchmarking single cell RNA- 55 17 (9):2123–2132, November 2013. sequencing analysis pipelines using mixture 56 control experiments. Nature Methods, 16 57 18 The Tabula Muris Consortium. Single-cell (6):479, June 2019. ISSN 1548-7105. 58 19 transcriptomics of 20 mouse organs cre- doi: 10.1038/s41592-019-0425-8. URL 59 20 ates a Tabula Muris. Nature, 562(7727): https://www.nature.com/articles/ 60 21 367, October 2018. ISSN 1476-4687. s41592-019-0425-8. 61 22 doi: 10.1038/s41586-018-0590-4. URL 23 https://www.nature.com/articles/ F. William Townes, Stephanie C. Hicks, 62 24 s41586-018-0590-4. Martin J. Aryee, and Rafael A. Irizarry. 63 64 25 Divyanshu Talwar, Aanchal Mongia, De- Feature Selection and Dimension Reduc- 65 26 barka Sengupta, and Angshul Majum- tion for Single Cell RNA-Seq based on bioRxiv 66 27 dar. AutoImpute: Autoencoder based a Multinomial Model. , page 67 28 imputation of single-cell RNA-seq data. 574574, March 2019. doi: 10.1101/ https://www.biorxiv.org/ 68 29 Scientific reports, 8(1):16329, November 574574. URL content/10.1101/574574v1 69 30 2018. ISSN 2045-2322. doi: 10.1038/ . 31 s41598-018-34688-x. URL http://dx. 70 32 doi.org/10.1038/s41598-018-34688-x. Cole Trapnell, Davide Cacchiarelli, Jonna Grimsby, Prapti Pokharel, Shuqiang Li, 71 33 Amos Tanay and Aviv Regev. Scaling Michael Morse, Niall J. Lennon, Kenneth J. 72 34 single-cell genomics from phenomenology Livak, Tarjei S. Mikkelsen, and John L. 73 35 to mechanism. Nature, 541(7637):331–338, Rinn. The dynamics and regulators of cell 74 36 January 2017. fate decisions are revealed by pseudotempo- 75 ral ordering of single cells. Nature Biotech- 76 37 W Tang, F Bertaux, P Thomas, C Ste- nology, 32(4):381–386, April 2014. ISSN 77 38 fanelli, M Saint, and others. bayNorm: 1546-1696. doi: 10.1038/nbt.2859. 78

71 1 Po-Yuan Tung, John D. Blischak, single-cell sequencing data. bioRxiv, 42 2 Chiaowen Joyce Hsiao, David A. Knowles, page 623397, May 2019. doi: 10.1101/ 43 3 Jonathan E. Burnett, Jonathan K. 623397. URL https://www.biorxiv.org/ 44 4 Pritchard, and Yoav Gilad. Batch effects content/10.1101/623397v1. 45 5 and the effective design of single-cell gene 6 expression studies. Scientific Reports, 7: Dimitrios V Vavoulis, Margherita 46 7 39921, January 2017. ISSN 2045-2322. Francescatto, Peter Heutink, and Ju- 47 8 doi: 10.1038/srep39921. URL https: lian Gough. DGEclust: differential 48 9 //www.nature.com/articles/srep39921. expression analysis of clustered count data. 49 Genome Biol., 16:39, February 2015. 50 10 Samra Turajlic and Charles Swanton. Metas- 11 tasis as an evolutionary process. Science, Archit Verma and Barbara E. Engelhardt. 51 12 352(6282):169–175, April 2016. A robust nonlinear low-dimensional man- 52 ifold for single cell RNA-seq data. bioRxiv, 53 13 Vincent van Unen, Thomas Höllt, Nicola page 443044, October 2018. doi: 10.1101/ 54 14 Pezzotti, Na Li, Marcel J. T. Rein- 443044. URL https://www.biorxiv.org/ 55 15 ders, Elmar Eisemann, Frits Koning, content/10.1101/443044v1. 56 16 Anna Vilanova, and Boudewijn P. F. 17 Lelieveldt. Visual analysis of mass cy- Beate Vieth, Christoph Ziegenhain, Swati 57 18 tometry data by hierarchical stochastic Parekh, Wolfgang Enard, and Ines Hell- 58 19 neighbour embedding reveals rare cell mann. powsimr: power analysis for 59 20 types. Nature Communications, 8(1): bulk and single cell RNA-seq experiments. 60 21 1740, November 2017. ISSN 2041-1723. Bioinformatics, 33(21):3486–3488, Novem- 61 22 doi: 10.1038/s41467-017-01689-9. URL ber 2017. 62 23 https://www.nature.com/articles/

24 s41467-017-01689-9. Beate Vieth, Swati Parekh, Christoph 63 Ziegenhain, Wolfgang Enard, and Ines 64 25 Catalina A Vallejos, John C Marioni, and Hellmann. A systematic evaluation of 65 26 Sylvia Richardson. BASiCS: Bayesian anal- single cell RNA-seq analysis pipelines. 66 27 ysis of Single-Cell sequencing data. PLoS Nature Communications, 10(1):1–11, 67 28 Comput. Biol., 11(6):e1004333, June 2015. October 2019. ISSN 2041-1723. doi: 68 29 Trieu My Van and Christian U. Blank. 10.1038/s41467-019-12266-7. URL 69 30 A user’s perspective on GeoMxTM https://www.nature.com/articles/ 70 31 digital spatial profiling. Immuno- s41467-019-12266-7. 71 32 Oncology Technology, 1:11–18, July 72 33 2019. ISSN 2590-0188, 2590-0188. Irma Virant-Klun, Stefan Leicht, Christopher 73 34 doi: 10.1016/j.iotech.2019.05.001. URL Hughes, and Jeroen Krijgsveld. Identifi- 74 35 https://www.esmoiotech.org/article/ cation of Maturation-Specific Proteins by 75 36 S2590-0188(19)30002-4/abstract. Single-Cell Proteomics of Human Oocytes. Molecular & cellular proteomics: MCP, 15 76 37 Koen van den Berge, Hector Roux de (8):2616–2627, 2016. ISSN 1535-9484. doi: 77 38 Bezieux, Kelly Street, Wouter Saelens, Ro- 10.1074/mcp.M115.056887. 78 39 brecht Cannoodt, Yvan Saeys, Sandrine 40 Dudoit, and Lieven Clement. Trajectory- Sarah A. Vitak, Kristof A. Torkenczy, Jimi L. 79 41 based differential expression analysis for Rosenkrantz, Andrew J. Fields, Lena 80

72 1 Christiansen, Melissa H. Wong, Lucia Car- 655365, June 2019. doi: 10.1101/ 42 2 bone, Frank J. Steemers, and Andrew 655365. URL https://www.biorxiv.org/ 43 3 Adey. Sequencing thousands of single-cell content/10.1101/655365v2. 44 4 genomes with combinatorial indexing. Na- 5 ture Methods, 14(3):302–308, March 2017. Dongfang Wang and Jin Gu. VASC: Dimen- 45 6 ISSN 1548-7105. doi: 10.1038/nmeth. sion reduction and visualization of single- 46 7 4154. URL https://www.nature.com/ cell RNA-seq data by deep variational au- 47 8 articles/nmeth.4154. toencoder. Genomics Proteomics Bioinfor- 48 matics, 16(5):320–331, October 2018. 49 9 Bartlomiej Waclaw, Ivana Bozic, Meredith E 10 Pittman, Ralph H Hruban, Bert Vogelstein, Jingshu Wang, Divyansh Agarwal, 50 11 and Martin A Nowak. A spatial model Mo Huang, Gang Hu, Zilu Zhou, 51 12 predicts that dispersal and cell turnover Chengzhong Ye, and Nancy R. Zhang. 52 13 limit intratumour heterogeneity. Nature, Data denoising with transfer learning 53 14 525(7568):261–264, September 2015. in single-cell transcriptomics. Na- 54 ture Methods 55 15 Daniel E. Wagner, Caleb Weinreb, Zach M. , 16(9):875–878, Septem- 16 Collins, James A. Briggs, Sean G. Megason, ber 2019a. ISSN 1548-7105. doi: 56 17 and Allon M. Klein. Single-cell mapping of 10.1038/s41592-019-0537-1. URL 57 18 gene expression landscapes and lineage in https://www.nature.com/articles/ 58 19 the zebrafish embryo. Science, 360(6392): s41592-019-0537-1. 59 20 981–987, June 2018a. ISSN 0036-8075, 21 1095-9203. doi: 10.1126/science.aar4362. Tongxin Wang, Travis S. Johnson, Wei Shao, 60 22 URL http://science.sciencemag.org/ Zixiao Lu, Bryan R. Helm, Jie Zhang, 61 23 content/360/6392/981. and Kun Huang. BERMUDA: a novel 62 deep transfer learning method for single- 63 24 Florian Wagner and Itai Yanai. Moana: A cell RNA sequencing batch correction re- 64 25 robust and scalable cell type classifica- veals hidden high-resolution cellular sub- 65 26 tion framework for single-cell RNA-Seq types. Genome Biology, 20(1):165, Au- 66 27 bioRxiv data. , page 456129, Octo- gust 2019b. ISSN 1474-760X. doi: 10. 67 28 ber 2018. doi: 10.1101/456129. URL 1186/s13059-019-1764-6. URL https:// 68 29 https://www.biorxiv.org/content/10. doi.org/10.1186/s13059-019-1764-6. 69 30 1101/456129v1.

Xiao Wang, William E. Allen, Matthew A. 70 31 Florian Wagner, Yun Yan, and Itai Wright, Emily L. Sylwestrak, Nikolay 71 32 Yanai. K-nearest neighbor smoothing Samusik, Sam Vesuna, Kathryn Evans, 72 33 for high-throughput single-cell RNA- Cindy Liu, Charu Ramakrishnan, Jia 73 34 Seq data. bioRxiv, page 217737, April Liu, Garry P. Nolan, Felice-Alessio Bava, 74 35 2018b. doi: 10.1101/217737. URL and . Three-dimensional 75 36 https://www.biorxiv.org/content/10. intact-tissue sequencing of single-cell tran- 76 37 1101/217737v3. scriptional states. Science, 361(6400): 77 38 Florian Wagner, Dalia Barkley, and Itai eaat5691, July 2018. ISSN 0036-8075, 78 39 Yanai. Accurate denoising of single- 1095-9203. doi: 10.1126/science.aat5691. 79 40 cell RNA-Seq data using unbiased prin- URL https://science.sciencemag.org/ 80 41 cipal component analysis. bioRxiv, page content/361/6400/eaat5691. 81

73 1 Lukas M. Weber and Mark D. Robinson. 459891. URL https://www.biorxiv.org/ 42 2 Comparison of clustering methods for content/10.1101/459891v1. 43 3 high-dimensional single-cell flow and mass 4 cytometry data. Cytometry Part A, 89 Joshua D Welch, Alexander J Hartemink, and 44 5 (12):1084–1096, December 2016. ISSN Jan F Prins. MATCHER: manifold align- 45 6 1552-4922. doi: 10.1002/cyto.a.23030. ment reveals correspondence between single 46 7 URL https://onlinelibrary.wiley. cell transcriptome and epigenome dynam- 47 Genome Biol. 48 8 com/doi/full/10.1002/cyto.a.23030. ics. , 18(1):138, July 2017.

9 Lukas M. Weber, Malgorzata Nowicka, Char- Jon F. Wilkins, Vincent L. Cannataro, 49 10 lotte Soneson, and Mark D. Robin- Brian Shuch, and Jeffrey P. Townsend. 50 11 son. diffcyt: Differential discovery Analysis of mutation, selection, and epis- 51 12 in high-dimensional cytometry via high- tasis: an informed approach to cancer 52 13 resolution clustering. bioRxiv, page clinical trials. Oncotarget, 9(32):22243– 53 14 349738, November 2018. doi: 10.1101/ 22253, April 2018. ISSN 1949-2553. doi: 54 15 349738. URL https://www.biorxiv.org/ 10.18632/oncotarget.25155. URL http: 55 16 content/10.1101/349738v2. //www.oncotarget.com/index.php? 56 journal=oncotarget&page=article&op= 57 17 Lukas M. Weber, Wouter Saelens, Robrecht view&path[]=25155&path[]=78833. 58 18 Cannoodt, Charlotte Soneson, Alexander 19 Hapfelmeier, Paul P. Gardner, Anne-Laure Marc J Williams, Benjamin Werner, Chris P 59 20 Boulesteix, Yvan Saeys, and Mark D. Barnes, Trevor A Graham, and An- 60 21 Robinson. Essential guidelines for compu- drea Sottoriva. Identification of neu- 61 22 tational method benchmarking. Genome tral tumor evolution across cancer types. 62 23 Biology, 20(1):125, June 2019. ISSN 1474- Nature Genetics, 48(3):238–244, January 63 24 760X. doi: 10.1186/s13059-019-1738-8. 2016. ISSN 1061-4036. doi: 10.1038/ 64 25 URL https://doi.org/10.1186/ ng.3489. URL http://www.nature.com/ 65 26 s13059-019-1738-8. doifinder/10.1038/ng.3489. 66

27 Caleb Weinreb, Samuel Wolock, Betsabeh K. F Alexander Wolf, Philipp Angerer, and 67 28 Tusi, Merav Socolovsky, and Allon M. Fabian J Theis. SCANPY: large-scale 68 29 Klein. Fundamental limits on dynamic in- single-cell gene expression data analysis. 69 30 ference from single-cell snapshots. Proceed- Genome Biol., 19(1):15, February 2018. 70 31 ings of the National Academy of Sciences, 71 32 115(10):E2467–E2476, March 2018. ISSN F. Alexander Wolf, Fiona K. Hamey, 72 33 0027-8424, 1091-6490. doi: 10.1073/pnas. Mireya Plass, Jordi Solana, Joakim S. 73 34 1714723115. URL https://www.pnas. Dahlin, Berthold Göttgens, Nikolaus Ra- 74 35 org/content/115/10/E2467. jewsky, Lukas Simon, and Fabian J. Theis. PAGA: graph abstraction recon- 75 36 Joshua Welch, Velina Kozareva, Ashley Fer- ciles clustering with trajectory inference 76 37 reira, Charles Vanderburg, Carly Martin, through a topology preserving map of sin- 77 38 and Evan Macosko. Integrative inference gle cells. Genome Biology, 20(1):59, March 78 39 of brain cell similarities and differences 2019. ISSN 1474-760X. doi: 10.1186/ 79 40 from single-cell genomics. bioRxiv, page s13059-019-1663-x. URL https://doi. 80 41 459891, November 2018. doi: 10.1101/ org/10.1186/s13059-019-1663-x. 81

74 1 Larry Xi, Alexander Belyaev, Sandra Spur- Hamim Zafar, Yong Wang, Luay Nakhleh, 41 2 geon, Xiaohui Wang, Haibiao Gong, Robert Nicholas Navin, and Ken Chen. Mono- 42 3 Aboukhalil, and Richard Fekete. New li- var: single-nucleotide variant detection in 43 4 brary construction method for single-cell single cells. Nature Methods, 13(6):505– 44 5 genomes. PLoS One, 12(7):e0181163, July 507, June 2016. ISSN 1548-7105. doi: 45 6 2017. 10.1038/nmeth.3835. URL https://www. 46 nature.com/articles/nmeth.3835. 47 7 Li Charlie Xia, Dongmei Ai, Hojoon Lee, 8 Noemi Andor, Chao Li, Nancy R Zhang, Hamim Zafar, Anthony Tzen, Nicholas Navin, 48 9 and Hanlee P Ji. SVEngine: an efficient Ken Chen, and Luay Nakhleh. SiFit: infer- 49 10 and versatile simulator of genome struc- ring tumor trees from single-cell sequenc- 50 11 tural variations with features of cancer ing data under finite-sites models. Genome 51 12 clonal evolution. Gigascience, 7(7), July Biol., 18(1):178, September 2017. 52 13 2018. Hans Zahn, Adi Steif, Emma Laks, Peter 53 14 Li Yang and P Charles Lin. Mechanisms that Eirew, Michael VanInsberghe, Sohrab P 54 15 drive inflammatory tumor microenviron- Shah, Samuel Aparicio, and Carl L Hansen. 55 16 ment, tumor heterogeneity, and metastatic Scalable whole-genome single-cell library 56 17 Semin. Cancer Biol. progression. , 47:185– preparation without preamplification. Nat. 57 18 195, December 2017. Methods, 14(2):167–173, February 2017a. 58

19 Z Yang. Maximum likelihood phylogenetic es- Hans Zahn, Adi Steif, Emma Laks, Peter 59 20 timation from DNA sequences with variable Eirew, Michael VanInsberghe, Sohrab P. 60 21 rates over sites: approximate methods. J. Shah, Samuel Aparicio, and Carl L. 61 22 Mol. Evol., 39(3):306–314, September 1994. Hansen. Scalable whole-genome single- 62 63 23 Yinyin Yuan. Spatial heterogeneity in the tu- cell library preparation without preampli- Nature Methods 64 24 mor microenvironment. Cold Spring Harb. fication. , 14(2):167–173, 65 25 Perspect. Med., 6(8), August 2016. February 2017b. ISSN 1548-7105. doi: 10.1038/nmeth.4140. URL https://www. 66 26 Simone Zaccaria, Mohammed El-Kebir, Gun- nature.com/articles/nmeth.4140. 67 27 nar W. Klau, and Benjamin J. Raphael. 28 The Copy-Number Tree Mixture Deconvo- Luke Zappia, Belinda Phipson, and Alicia 68 29 lution Problem and Applications to Multi- Oshlack. Splatter: simulation of single-cell 69 30 sample Bulk Sequencing Tumor Data. In RNA sequencing data. Genome Biol., 18 70 31 S. Cenk Sahinalp, editor, Research in (1):174, September 2017. 71 32 Computational Molecular Biology, Lecture Ron Zeira and Ron Shamir. Genome 72 33 Notes in Computer Science, pages 318–335. Rearrangement Problems with Sin- 73 34 Springer International Publishing, 2017. gle and Multiple Gene Copies : A 74 35 ISBN 978-3-319-56970-3. Review. Not clear where this was 75 36 H Zafar, N Navin, K Chen, and L Nakhleh. initially published and whether it is 76 37 SiCloneFit: Bayesian inference of popula- peer-reviewed., 2018. URL https: 77 38 tion structure, genotype, and phylogeny of //pdfs.semanticscholar.org/85e6/ 78 39 tumor clones from single-cell genome se- 7eb03d1b3d004c60a12df08c1f937fbaa974. 79 40 quencing data. bioRxiv, 2018. pdf. 80

75 1 Amit Zeisel, Hannah Hochgerner, Peter RNA-seq analysis in recessive dystrophic 43 2 Lönnerberg, Anna Johnsson, Fatima epidermolysis bullosa. PLoS Comput. Biol., 44 3 Memic, Job van der Zwan, Martin Häring, 14(4):e1006053, April 2018. 45 4 Emelie Braun, Lars E. Borm, Gioele Jesse M. Zhang, Govinda M. Kamath, 46 5 La Manno, Simone Codeluppi, Alessan- and David N. Tse. Valid post-clustering 47 6 dro Furlan, Kawai Lee, Nathan Skene, differential analysis for single-cell RNA- 48 7 Kenneth D. Harris, Jens Hjerling-Leffler, Seq. bioRxiv, page 463265, June 49 8 Ernest Arenas, Patrik Ernfors, Ulrika 2019b. doi: 10.1101/463265. URL 50 9 Marklund, and Sten Linnarsson. Molec- https://www.biorxiv.org/content/10. 51 10 ular Architecture of the Mouse Nervous 1101/463265v3. 52 11 System. Cell, 174(4):999–1014.e22, 12 August 2018. ISSN 0092-8674. doi: Jianzhi Zhang, Rasmus Nielsen, and Ziheng 53 13 10.1016/j.cell.2018.06.021. URL http: Yang. Evaluation of an improved branch- 54 14 //www.sciencedirect.com/science/ site likelihood method for detecting posi- 55 15 article/pii/S009286741830789X. tive selection at the molecular level. Mol. 56 Biol. Evol., 22(12):2472–2479, December 57 16 Allen W. Zhang, Ciara O’Flanagan, Eliz- 2005. 58 17 abeth Chavez, Jamie LP Lim, Andrew 18 McPherson, Matt Wiens, Pascale Wal- Jingsong Zhang, Jessica J. Cunningham, 59 19 ters, Tim Chan, Brittany Hewitson, Daniel Joel S. Brown, and Robert A. Gatenby. 60 20 Lai, Anja Mottok, Clementine Sarkozy, Integrating evolutionary dynamics into 61 21 Lauren Chong, Tomohiro Aoki, Xue- treatment of metastatic castrate-resistant 62 22 hai Wang, Andrew P. Weng, Jessica N. prostate cancer. Nature Communications, 8 63 23 McAlpine, Samuel Aparicio, Christian (1):1816, November 2017. ISSN 2041-1723. 64 24 Steidl, Kieran R. Campbell, and Sohrab P. doi: 10.1038/s41467-017-01968-5. URL 65 25 Shah. Probabilistic cell type assignment https://www.nature.com/articles/ 66 26 of single-cell transcriptomic data reveals s41467-017-01968-5. 67 27 spatiotemporal microenvironment dynam- 28 ics in human cancers. bioRxiv, page L. Zhang and S. Zhang. Comparison of com- 68 29 521914, January 2019a. doi: 10.1101/ putational methods for imputing single- 69 30 521914. URL https://www.biorxiv.org/ cell RNA-sequencing data. IEEE/ACM 70 31 content/10.1101/521914v1. Transactions on Computational Biology 71 and Bioinformatics, pages 1–1, 2018. ISSN 72 32 Chao Zhang. Single-Cell Data Analysis 1545-5963. doi: 10.1109/TCBB.2018. 73 33 Using MMD Variational Autoencoder 2848633. 74 34 for a More Informative Latent Repre- 35 sentation. bioRxiv, page 613414, June L Zhang, X Cui, K Schmitt, R Hubert, W Na- 75 36 2019. doi: 10.1101/613414. URL vidi, and N Arnheim. Whole genome am- 76 37 https://www.biorxiv.org/content/10. plification from a single cell: implications 77 38 1101/613414v2. for genetic analysis. Proc. Natl. Acad. Sci. 78 U. S. A., 89(13):5847–5851, July 1992. 79 39 Huanan Zhang, Catherine A A Lee, Zhuliu Li, 40 John R Garbe, Cindy R Eide, Raphael Pe- Xiao-Fei Zhang, Le Ou-Yang, Shuo Yang, 80 41 tegrosso, Rui Kuang, and Jakub Tolar. A Xing-Ming Zhao, Xiaohua Hu, and Hong 81 42 multitask clustering approach for single-cell Yan. EnImpute: imputing dropout 82

76 1 events in single cell RNA sequencing Rapolas Zilionis, Juozas Nainys, Adrian 43 2 data via ensemble learning. Bioinfor- Veres, Virginia Savova, David Zemmour, 44 3 matics, May 2019c. ISSN 1367-4803, Allon M Klein, and Linas Mazutis. Single- 45 4 1367-4811. doi: 10.1093/bioinformatics/ cell barcoding and sequencing using droplet 46 5 btz435. URL http://dx.doi.org/10. microfluidics. Nat. Protoc., 12(1):44–73, 47 6 1093/bioinformatics/btz435. January 2017. 48

7 Xiuwei Zhang, Chenling Xu, and Nir Yosef. 8 SymSim: simulating multi-faceted variabil- 9 ity in single cell RNA sequencing. bioRxiv, 10 page 378646, April 2019d. doi: 10.1101/ 11 378646. URL https://www.biorxiv.org/ 12 content/10.1101/378646v3.

13 Grace X. Y. Zheng, Jessica M. Terry, Phillip 14 Belgrader, Paul Ryvkin, Zachary W. 15 Bent, Ryan Wilson, Solongo B. Ziraldo, 16 Tobias D. Wheeler, Geoff P. McDer- 17 mott, Junjie Zhu, Mark T. Gregory, Joe 18 Shuga, Luz Montesclaros, Jason G. Under- 19 wood, Donald A. Masquelier, Stefanie Y. 20 Nishimura, Michael Schnall-Levin, Paul W. 21 Wyatt, Christopher M. Hindson, Rajiv 22 Bharadwaj, Alexander Wong, Kevin D. 23 Ness, Lan W. Beppu, H. Joachim Deeg, 24 Christopher McFarland, Keith R. Loeb, 25 William J. Valente, Nolan G. Ericson, 26 Emily A. Stevens, Jerald P. Radich, Tar- 27 jei S. Mikkelsen, Benjamin J. Hindson, 28 and Jason H. Bielas. Massively paral- 29 lel digital transcriptional profiling of sin- 30 gle cells. Nature Communications, 8:14049, 31 January 2017. ISSN 2041-1723. doi: 10. 32 1038/ncomms14049. URL https://www. 33 nature.com/articles/ncomms14049.

34 Lingxue Zhu, Jing Lei, Bernie Devlin, 35 and Kathryn Roeder. A unified sta- 36 tistical framework for single cell and 37 bulk RNA sequencing data. The An- 38 nals of Applied Statistics, 12(1):609–632, 39 March 2018. ISSN 1932-6157, 1941- 40 7330. doi: 10.1214/17-AOAS1110. URL 41 https://projecteuclid.org/euclid. 42 aoas/1520564486.

77