Scelestial: Fast and Accurate Single-Cell Lineage Tree Inference Based on a Steiner Tree Approximation Algorithm M.-H

bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Scelestial: fast and accurate single-cell lineage tree inference based on a Steiner tree approximation algorithm M.-H. Foroughmand-Araabi1, S. Goliaei1, A. C. McHardy1* 1 Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany * Corresponding author ([email protected]) Abstract Single-cell genome sequencing provides a highly granular view of biological systems but is affected by high error rates, allelic amplification bias, and uneven genome coverage. This creates a need for data-specific computational methods, for purposes such as for cell lineage tree inference. The objective of cell lineage tree reconstruction is to infer the evolutionary process that generated a set of observed cell genomes. Lineage trees may enable a better understanding of tumor formation and growth, as well as of organ development for healthy body cells. We describe a method, Scelestial, for lineage tree reconstruction from single-cell data, which is based on an approximation algorithm for the Steiner tree problem and is a generalization of the neighbor-joining method. We adapt the algorithm to efficiently select a limited subset of potential sequences as internal nodes, in the presence of missing values, and to minimize cost by lineage tree-based missing value imputation. In a comparison against seven state-of-the-art single-cell lineage tree reconstruction algorithms - BitPhylogeny, OncoNEM, SCITE, SiFit, SASC, SCIPhI, and SiCloneFit - on simulated and real single-cell tumor samples, Scelestial performed best at reconstructing trees in terms of accuracy and run time. Scelestial has been implemented in C++. It is also available as an R package named RScelestial. May 12, 2021 1/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 1 Introduction Lineage trees describe the evolutionary process that created a sample of clonally related entities, such as individual cells within an organ or a tumor. A lineage tree also suggests the constitution of a founder cell, as well as of its descendants, and the evolutionary events that occurred during lineage formation. Single-cell genomic data provide a highly resolved view of cellular evolution, substantially more than bulk genome sequencing [1]. However, they also come with high rates of missing values and distorted allele frequencies arising from amplification biases and sequencing errors [2]. Lineage tree reconstruction from single-cell data therefore requires specific considerations for handling missing values and errors. Specific approaches to tackle this problem include the methods of Kim and Simon [3], Bit- Phylogeny [4], OncoNEM [5], SCITE [6], SiFit [7], SASC [8], SPhyR [9], SCIPhI [10], SiCloneFit [11], B-SCITE [12], and PhISCS [13]. Kim and Simon [3] make use of the infinite site assumption and infer a \mutation tree" based on calculating a probability for ordering mutations in a lineage and constructing a tree, finding the maximum spanning tree in this graph. BitPhylogeny [4] provides a stochastic process through a graphical model that stochastically generates the given input data and uses Markov Chain Monte Carlo (MCMC) for sampling to search for the best lineage tree model and the associated parameters. OncoNEM [5] and SCITE [6] infer a phylogenetic tree over all the samples under a maximum likelihood model. OncoNEM uses a heuristic search, whereas SCITE uses MCMC sampling to find a maximum likelihood tree. SiCloneFit [11] infers subclonal structures and a phylogeny via a Bayesian method under the finite site assumption. SASC [8] and SPhyR [9] consider the k-Dollo model, a more relaxed model in comparison than infinite site assumption. In k-Dollo model, a mutation can be gained once in a tumor but may be lost multiple times afterwards. SASC uses simulated annealing and SPhyR uses k-means to find the best k-Dollo evolutionary tree. B-SCITE [12] and PhISCS [13] infer subclonal evolution, based on a combination of single-cell and bulk sequencing data. May 12, 2021 2/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. B-SCITE uses an MCMC method to maximize a likelihood function for trees and sequencing data. PhISCS formulates tree reconstruction as com- binatorial and mathematical programming problems, and uses standard mathematical programming solvers to find the solution. Integer or integer linear programmings for the phylogenetic tree reconstruction are also available [14,15]. The Steiner tree problem is a classic problem in theoretical computer science with a wide range of applications in various areas including very- large-scale integration design [16], network routing [17], civil engineering [18], and other areas [19{22]. Given a weighted graph and some vertices marked as terminals, the Steiner tree problem is the problem of finding the minimum weighted tree that connects all the terminal vertices, with non-terminal vertices that may or may not be included in the optimal tree. Although the Steiner tree problem is NP-hard and no polynomial-time exact solution is known, some elegant approximation algorithms are available. They guarantee polynomial time and a constant factor approximation ratio, and have been used for phylogenetic reconstruction [23]. This is a great advantage compared with sampling heuristics such as MCMC for finding an optimal solution, which may be trapped in local optima. Here, we describe Scelestial, a method for lineage tree reconstruction from single-cell datasets, based on the Berman approximation algorithm for the Steiner tree problem [24]. Our method infers the evolutionary history for single-cell data in the form of a lineage tree and imputes the missing values accordingly. Dealing with missing values makes the problem much harder in theory. To overcome this difficulty, we represent a hypercube represented via [24] as a sequence with missing values as its center. This representation, although imposes some cost in the results of the algorithm caused by misplacement of the center from the best imputation, which we do not know in advance, helps us to get a fast yet accurate and robust algorithm. May 12, 2021 3/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 2 Results 2.1 Performance in lineage tree reconstruction on simulated data We compared the performance of Scelestial to SCITE, OncoNEM, BitPhy- logeny, SASC (as a recent instance of k-Dollo-based methods), SCIPhI, SiFit, and SiCloneFit. For this, we generated data with the cell evolution simulator provided by OncoNEM and with another tumor evolution simulator that we developed (Section 3.2). To assess the inferred trees, we calculated their distance to the ground truth lineage trees as the normalized pairwise distances between corresponding samples, as described in Sec- tion 3.4.1. In addition, we calculated the similarity between the generated trees and the ground truth by comparing tree splits (Section 3.4.2). For run time evaluation, we executed all the methods on simulated tumor data over a range of numbers of samples and sites. We also assessed the run time performance of the methods in relation to different parameters, such as the number of samples and sites (Section 2.5). First, we evaluated the algorithms on data created with OncoNEM's simulator. The simulator creates samples by producing clones and sampling from these clones (Section 3.3). It accepts the number of clones (i.e., the number of nodes in the lineage tree), the number of sampled cells, the number of sites, the false positive rate, the false negative rate, and the missing value rate as parameters. We used 1.5% as the setting for the false positive positives, 10% for false negatives and 7% for the missing value rate. These parameters are consistent with observed parameters in single-cell data [25]. We performed tests for 50 and 100 samples, 5 and 10 clones, and 20 and 50 sites, and calculated the pairwise distance between the inferred and ground truth trees for all methods (Table 1). From these data, Scelestial reconstructed lineage trees with the lowest distance to the ground truth tree in 5 out of 8 cases. SASC, SCITE, and SiFit each performed best in one of the remaining cases. Next, we simulated 100 single-cell datasets from a solid tumor covering May 12, 2021 4/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Table 1: Comparison of reconstruction of ground truth lineage tree from data simulated by OncoNEM, showing the distance between the inferred trees and the ground truth for all methods across eight lineage trees.

Scelestial: Fast and Accurate Single-Cell Lineage Tree Inference Based on a Steiner Tree Approximation Algorithm M.-H

Species Concepts Should Not Conflict with Evolutionary History, but Often Do

Phylogenetic Comparative Methods: a User's Guide for Paleontologists

Lineages, Splits and Divergence Challenge Whether the Terms Anagenesis and Cladogenesis Are Necessary

Statistical Inference for the Linguistic and Non-Linguistic Past

Learning the “Game” of Life: Reconstruction of Cell Lineages Through Crispr-Induced Dna Mutations

Phylogenetics

Advances in Computational Methods for Phylogenetic Networks in the Presence of Hybridization

Reading Phylogenetic Trees: a Quick Review (Adapted from Evolution.Berkeley.Edu)

Phylogenetic Systematics As the Basis of Comparative Biology

Allozyme Divergence and Phylogenetic Relationships Among Capra, Ovis and Rupicapra (Artyodactyla, Bovidae)

Taxonomy and Classification Goals: Un Ders Tan D Traditi Onal and Hi Erarchi Cal Cl Assifi Cati Ons of Biodiversity, and What Information Classifications May Contain

Phylogenetic Comparative Methods