<<

bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Scelestial: fast and accurate single-cell lineage tree inference based on a Steiner tree approximation algorithm M.-H. Foroughmand-Araabi1, S. Goliaei1, A. C. McHardy1*

1 Department of Computational of Infection Research, Helmholtz

Centre for Infection Research, Braunschweig, Germany

* Corresponding author ([email protected])

Abstract

Single-cell genome sequencing provides a highly granular view of biological

systems but is affected by high error rates, allelic amplification bias, and

uneven genome coverage. This creates a need for data-specific computational

methods, for purposes such as for cell lineage tree inference. The objective

of cell lineage tree reconstruction is to infer the evolutionary process that

generated a set of observed cell genomes. Lineage trees may enable a

better understanding of tumor formation and growth, as well as of organ

development for healthy body cells. We describe a method, Scelestial,

for lineage tree reconstruction from single-cell data, which is based on an

approximation algorithm for the Steiner tree problem and is a generalization

of the neighbor-joining method. We adapt the algorithm to efficiently select

a limited subset of potential sequences as internal nodes, in the presence

of missing values, and to minimize cost by lineage tree-based missing

value imputation. In a comparison against seven state-of-the-art single-cell

lineage tree reconstruction algorithms - BitPhylogeny, OncoNEM, SCITE,

SiFit, SASC, SCIPhI, and SiCloneFit - on simulated and real single-cell

tumor samples, Scelestial performed best at reconstructing trees in terms

of accuracy and run time. Scelestial has been implemented in C++. It is

also available as an R package named RScelestial.

May 12, 2021 1/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 Introduction

Lineage trees describe the evolutionary process that created a sample of

clonally related entities, such as individual cells within an organ or a tumor.

A lineage tree also suggests the constitution of a founder cell, as well as

of its descendants, and the evolutionary events that occurred during lin-

eage formation. Single-cell genomic data provide a highly resolved view

of cellular , substantially more than bulk genome sequencing [1].

However, they also come with high rates of missing values and distorted

allele frequencies arising from amplification biases and sequencing errors [2].

Lineage tree reconstruction from single-cell data therefore requires specific

considerations for handling missing values and errors. Specific approaches

to tackle this problem include the methods of Kim and Simon [3], Bit-

Phylogeny [4], OncoNEM [5], SCITE [6], SiFit [7], SASC [8], SPhyR [9],

SCIPhI [10], SiCloneFit [11], B-SCITE [12], and PhISCS [13]. Kim and

Simon [3] make use of the infinite site assumption and infer a “

tree” based on calculating a probability for ordering in a lineage

and constructing a tree, finding the maximum spanning tree in this graph.

BitPhylogeny [4] provides a stochastic process through a graphical model

that stochastically generates the given input data and uses Markov Chain

Monte Carlo (MCMC) for sampling to search for the best lineage tree

model and the associated parameters. OncoNEM [5] and SCITE [6] infer a

over all the samples under a maximum likelihood model.

OncoNEM uses a heuristic search, whereas SCITE uses MCMC sampling to

find a maximum likelihood tree. SiCloneFit [11] infers subclonal structures

and a phylogeny via a Bayesian method under the finite site assumption.

SASC [8] and SPhyR [9] consider the k-Dollo model, a more relaxed model

in comparison than infinite site assumption. In k-Dollo model, a mutation

can be gained once in a tumor but may be lost multiple times afterwards.

SASC uses simulated annealing and SPhyR uses k-means to find the best

k-Dollo evolutionary tree. B-SCITE [12] and PhISCS [13] infer subclonal

evolution, based on a combination of single-cell and bulk sequencing data.

May 12, 2021 2/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

B-SCITE uses an MCMC method to maximize a likelihood function for

trees and sequencing data. PhISCS formulates tree reconstruction as com-

binatorial and mathematical programming problems, and uses standard

mathematical programming solvers to find the solution. Integer or inte-

ger linear programmings for the phylogenetic tree reconstruction are also

available [14,15].

The Steiner tree problem is a classic problem in theoretical computer

science with a wide range of applications in various areas including very-

large-scale integration design [16], network routing [17], civil engineering [18],

and other areas [19–22]. Given a weighted graph and some vertices marked

as terminals, the Steiner tree problem is the problem of finding the minimum

weighted tree that connects all the terminal vertices, with non-terminal

vertices that may or may not be included in the optimal tree. Although

the Steiner tree problem is NP-hard and no polynomial-time exact solution

is known, some elegant approximation algorithms are available. They

guarantee polynomial time and a constant factor approximation ratio,

and have been used for phylogenetic reconstruction [23]. This is a great

advantage compared with sampling heuristics such as MCMC for finding

an optimal solution, which may be trapped in local optima.

Here, we describe Scelestial, a method for lineage tree reconstruction

from single-cell datasets, based on the Berman approximation algorithm for

the Steiner tree problem [24]. Our method infers the evolutionary history

for single-cell data in the form of a lineage tree and imputes the missing

values accordingly. Dealing with missing values makes the problem much

harder in theory. To overcome this difficulty, we represent a hypercube

represented via [24] as a sequence with missing values as its center. This

representation, although imposes some cost in the results of the algorithm

caused by misplacement of the center from the best imputation, which we

do not know in advance, helps us to get a fast yet accurate and robust

algorithm.

May 12, 2021 3/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

2 Results

2.1 Performance in lineage tree reconstruction on sim- ulated data

We compared the performance of Scelestial to SCITE, OncoNEM, BitPhy-

logeny, SASC (as a recent instance of k-Dollo-based methods), SCIPhI,

SiFit, and SiCloneFit. For this, we generated data with the cell evolution

simulator provided by OncoNEM and with another tumor evolution sim-

ulator that we developed (Section 3.2). To assess the inferred trees, we

calculated their distance to the ground truth lineage trees as the normalized

pairwise distances between corresponding samples, as described in Sec-

tion 3.4.1. In addition, we calculated the similarity between the generated

trees and the ground truth by comparing tree splits (Section 3.4.2). For

run time evaluation, we executed all the methods on simulated tumor data

over a range of numbers of samples and sites. We also assessed the run

time performance of the methods in relation to different parameters, such

as the number of samples and sites (Section 2.5).

First, we evaluated the algorithms on data created with OncoNEM’s

simulator. The simulator creates samples by producing clones and sampling

from these clones (Section 3.3). It accepts the number of clones (i.e., the

number of nodes in the lineage tree), the number of sampled cells, the

number of sites, the false positive rate, the false negative rate, and the

missing value rate as parameters. We used 1.5% as the setting for the false

positive positives, 10% for false negatives and 7% for the missing value rate.

These parameters are consistent with observed parameters in single-cell

data [25]. We performed tests for 50 and 100 samples, 5 and 10 clones, and

20 and 50 sites, and calculated the pairwise distance between the inferred

and ground truth trees for all methods (Table 1). From these data, Scelestial

reconstructed lineage trees with the lowest distance to the ground truth

tree in 5 out of 8 cases. SASC, SCITE, and SiFit each performed best in

one of the remaining cases.

Next, we simulated 100 single-cell datasets from a solid tumor covering

May 12, 2021 4/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Table 1: Comparison of reconstruction of ground truth lineage tree from data simulated by OncoNEM, showing the distance between the inferred trees and the ground truth for all methods across eight lineage trees. The best results among all the methods for each evolutionary tree are shown in bold.

50 samples 100 samples 5 clones 10 clones 5 clones 10 clones Method 20 sites 50 sites 20 sites 50 sites 20 sites 50 sites 20 sites 50 sites BitPhylogeny 0.8947 0.7949 0.8875 0.8665 1.056 0.9805 0.8418 1.1766 OncoNEM 0.9738 0.781 0.7412 0.8044 0.9783 0.9867 0.8429 0.9426 SCITE 1.0205 0.914 1.0122 1.2909 1.0366 0.8682 0.8536 1.0021 SiFit 0.9088 0.7383 0.7147 0.7543 0.9575 0.8603 0.784 0.861 SASC 0.8912 0.8212 0.7716 0.7994 0.9289 0.924 0.8186 0.9215 SCIPhI 0.9671 1.0897 0.7049 0.9466 1.0628 1.2485 1.2483 1.2194 SiCloneFit 1.0935 0.9441 0.8728 0.8433 1.0641 1.0464 0.809 0.9617 Scelestial 0.8904 0.703 0.6647 0.726 0.9241 0.8862 0.7918 0.8771

a range of evolutionary time spans (i.e., a range of mutations per branch

in the resulting lineage trees), using a simulation method we implemented

(Section 3.2, results in Fig. 1). The simulated data provide a granular

simulation of tumor evolution and single-cell sequence data. Tumor growth

was simulated with 50 samples and 200 sites. The other parameters (missing

value rate, false positive rate, and false negative rate) were set as in the

OncoNEM simulation. The main difference between our simulation and the

OncoNEM simulation tool was that we considered the relative preference

between clones.

The lineage tree reconstruction error was calculated as the sum of all

distance errors over every pair of samples (Section 3.4.1). The lower the

lineage tree reconstruction error, the more similar the tree is to the ground

truth one. In addition, we determined the split similarity between the

resulting trees and the ground truth tree (Section 3.4.2). A higher split

similarity measure reflects a better reconstruction of the tree’s topology.

Overall, when considering the lineage tree reconstruction error, none of

the resulting trees was highly similar to the ground truth tree (Fig. 1a).

Nevertheless, there were differences in the accuracy of the inferred trees.

With these data, considering the average performance measure of 10 datasets

for each mutation rate, Scelestial performed best in 10 out of 10 cases

regarding the split similarity measure. Scelestial also performs best for

May 12, 2021 5/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

2 1

0.8 1.5

0.6 BitPhylogeny 1 OncoNEM 0.4 SCITE Scelestial similarity of the tree tree reconstruction error 0.5 0.2 SASC SCIPhI Split 0 SiFit Lineage 0 SiCloneFit 2 4 6 8 10 2 4 6 8 10 Average number of mutations on each edge of lineage tree Average number of mutations on each edge of lineage tree (a) Sample distance measure (b) Split similarity measure

Fig 1: Comparison of the methods for single-cell lineage tree reconstruction on simulated tumor data. Note that in case of the lineage tree reconstruction error (Fig. 1a), lower values show a better reconstruction. On the other hand, the split similarity measure represents (Fig. 1b) similarity between the reconstructed tree and the ground truth tree, making higher values favorable.

the distance measure in 9 of 10 cases; SiCloneFit was the best for the

remaining case. The performance of Scelestial, SiCloneFit, and OncoNEM

increased with the number of mutations (Fig. 1a). This increase in the

mutation rate had a very large effect on the performance of SiCloneFit and

OncoNEM, whereas the performance of SASC and SiFit was almost stable

across changes in the number of mutations between clones.

Overall, the relative performance of lineage tree reconstruction for

different methods was similar to that observed before (Table 1). This also

indicates that the relative difficulty of tree inference for the simulation

method provided by OncoNEM and our simulation method are similar.

2.2 Robustness across varying data properties

We next evaluated the robustness of Scelestial to varying dataset properties

over 420 datasets with varying parameters that we generated for this

purpose with our tumor simulator. We fixed the default configuration

in our simulation as 50 samples, 200 sites, 5 evolutionary nodes in the

simulated evolutionary tree, 1.5% as the false positive rate, 10% as the

false-negative rate, and 7% as the missing value rate. We set the average

May 12, 2021 6/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

number of mutations between nodes in the evolutionary tree to be 20. For

evaluating the robustness of Scelestial with respect to one parameter, we

set the other parameters to their default values and evaluated Scelestial

over a range of the parameter under study. We used the following ranges:

false positive rate: 0–50%; false negative rate: 0–30%; missing value rate:

0–50%; samples: 5–200; sites: 20–1000.

Scelestial was not very sensitive to variation in the tested parameters

(Fig. 2, Fig. 3) in terms of their influence on the tree distance error and

topological accuracy measured by split similarity. As expected, performance

decreased with increasing missing value, false negative, and false positive

rates (Fig. 3). For datasets including more than 25 sites, the sample distance

measure was stable.

2.3 Case study: a single-cell dataset from a muscle- invasive bladder tumor

We inferred lineage trees for single-cell data of a muscle-invasive bladder

tumor [26] with Scelestial, OncoNEM, SCITE, and BitPhylogeny. These

data were obtained by single-cell exome sequencing of 44 tumor cells as

well as exome sequencing of normal cells, and included 443 variable sites.

We converted this to the required input matrix for OncoNEM, SCITE,

and BitPhylogeny, in which only the reference state, variant state, and

missing values were specified. In the resulting matrix, 27% of the elements

represented the reference state, 17% represented variant states, and 55%

represented missing values. Unlike the original study [5], we used all

cancerous cells as well as 13 normal cells for lineage tree reconstruction, to

see whether a distinct cancer cell lineage would become apparent in the

inferred trees. All methods except SCITE removed the redundant inner

nodes of the trees (i.e., nodes with not more than one child). For the SCITE

method, to obtain a comparable small tree, we compressed the tree in the

same manner.

In the trees inferred by BitPhylogeny, SCITE, SiFit, SASC, and SCIPhI

(Fig. 4), normal and cancerous cells were not separated into distinct subtrees.

May 12, 2021 7/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 1 error

0.5 0.5 Sample-distance 0 0 0 % 20 % 40 % 0 200 400 600 800 1,000 Missing value rate Number of sites

1 1 error

0.5 0.5 Sample-distance 0 0 0 50 100 150 200 0 % 20 % 40 % Sample size False negative rate

1 error

0.5 Sample-distance 0 0 % 2 % 4 % 6 % 8 % 10 % False positive rate

Fig 2: Robustness of Scelestial to variation in the properties of ground truth lineage trees in terms of sample distance in the trees between the inferred and ground truth trees.

May 12, 2021 8/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 1

0.5 0.5 similarity Split

0 0 0 % 20 % 40 % 0 200 400 600 800 1,000 Missing value rate Number of sites

1 1

0.5 0.5 similarity Split

0 0 0 50 100 150 200 0 % 20 % 40 % Sample size False negative rate

1

0.5 similarity Split

0 0 % 2 % 4 % 6 % 8 % 10 % False positive rate

Fig 3: Robustness of Scelestial to variation in the properties of ground truth lineage trees in terms of topological similarity between the inferred and ground truth trees.

May 12, 2021 9/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

In some cases, they were even assigned to the same clone, which seems

unlikely as an evolutionary scenario. In contrast, the trees of OncoNEM

and Scelestial effectively separated normal and cancerous cells. OncoNEM

placed all the normal cells in one clone and Scelestial separated all the

cancerous cells and normal cells into distinct subtrees.

The evolutionary trees (Fig. 4) returned by all the algorithms except

Scelestial were directed. The most plausible scenario for cancer evolution is

the rooting of a cancer lineage close to this root or to some internal lineage

of normal cells, instead of cancer cells being placed as ancestral to normal

cells. OncoNEM’s tree includes a normal cell as a root, followed by some

cancerous colonies placed between the descendant normal cells and this

root.

2.4 Case study 2: a single-cell dataset from metastatic colorectal cancer

We inferred lineage trees by all methods on single-cell data of colorectal

cancer from two patients [25]. In this dataset, 178 single-cell samples were

gathered from the first patient: 117 samples from normal cells, 33 samples

from the primary tumor, and 28 samples from metastasis of the tumor to

the liver. The data represent 16 genomic sites with 6.7% missing values.

In the reconstructed lineage trees for the first patient (Fig. 5), OncoNEM,

BitPhylogeny, SCIPhI, and SCITE suggest clones including both normal

and cancerous cells, which, as outlined above, is unrealistic. SASC and

SiFit produced lineage trees with several normal and cancerous cells that

evolved from cancerous clones, while Scelestial again separated the two types

of cells into distinct sub-: except for three metastatic cell samples,

Scelestial suggested evolution of the primary tumor from normal cells and

evolution of the metastatic tumor from the primary tumor. Like Scelestial,

SiCloneFit placed the same three metastatic (misplaced) samples in the

Scelestial tree between normal and primary tumor cells. However, the tree

generated by SiCloneFit suggested independent evolution of the metastatic

cells and primary tumor cells, which seems less likely, though not impossible.

May 12, 2021 10/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

No sample Mixed Normal and cancerous samples Cancerous samples Normal samples

BN2, BN3, BN5, BN6, BN7, BN9, BN12, BN13, BN14, BN15, BN16, BNT BC16, BC55

BC6, BC22, BC58 BC38 BC28, BC43

BC56, BC11, BC25, BC59 BC18, BC46, BC54 BC37 BC53

BC21, BC32, BC9 BCT BC9, BC15, BC41 BC31, BC33, BC33 BC44, BC47, BC21 BC13, BC14, BC46 BC49, BC51 BC15, BC23, BC52 BC45 BC24, BC25, BC18, BC56 BC49, BC29, BC37, BC7 BC39, BC42, BC55 BC6, BC44, BC53, BC36 BC59 BC34 BC47 BN2, BN3, BN5, BC52 BC8 BN6, BN7, BN9, BC34 BN12, BN13, BC48 BN14, BN15, BC58 BC28, BC38 BCT, BC7, BN16 BC41, BC8, BC13, BC50 BC14, BC23, BC36 BC43 BC24, BC29, BC35, BC42, BNT BC22 BC50 BC32 BC11 BC39 BC31 BC54

BC35, BC16 BC48 BC45, BC51

(a) SCIPhI (b) OncoNEM

Fig 4: Lineage tree inferred by the different methods on a single-cell dataset from a muscle-invasive bladder tumour. Nodes in these trees represent clones (i.e., inferred, evolutionary genomes ancestral to the observed single-cell genomes). Blue nodes represent clones containing only normal cells, orange nodes represent clones containing only cancerous cells, and brown nodes represent clones containing both normal and cancerous cells. Text within the nodes indicates the identification number of the assigned sample cell(s) to the corresponding clone. White nodes represent nodes with no observed sample.

May 12, 2021 11/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

BC7

BC13

BC14

BC16

BC23

BC24

BC29

BC34

BC35

BC36

BC6 BC39

BC53 BC42

BC54 BC48

BC56 BC50 BC22 BC58 BC55 BC52 BC59

BC11 BC48, BC11, BC15, BC16, BC28 BC24, BC36, BN2 BC41, BC43

BN3 BC43

BN5 BC46 BNT BC6, BN14, BC16 BC14

BN15 BN6 BCT

BC33

BN14 BN7 BC9

BC54, BC21, BN13 BN9 BC15 BC39, BC32, BC53

BN12 BN12 BNT, BN7, BC21 BN12, BN5 BN9 BC11 BN13 BC31 BC9 BC52 BC47, BC55, BN7 BC58, BC25, BC43 BN14 BC28, BN16, BC51 BC31 BC22 BC32 BN6 BC38

BC15 BN15 BC45 BN16 BN5 BC59, BN2 BC7, BC50, BC33, BN3 BC58 BC50 BC13 BC51 BC18 BN16 BC51 BC49 BC18 BC6 BC47 BC41 BC39 BNT BC8 BC46 BC31, BC25 BC34 BC59 BC44 BC25 BC53 BC47

BC56 BC37 BC33 BC29, BC9, BN15, BC32 BC38 BC8, BC49 BC23 BC18, BC22, BC48 BC54 BC37 BN13, BC35, BC37, BC42 BC56, BCT BC35 BC45 BC23 BC38 BN2 BC46, BC52 BC24 BC29 BCT BC41 BC21 BC14 BC13 BC34 BC44, BN6, BC45 BC44 BN3 BC42 BC36 BC8 BC7 BC49 BN9 BC28 BC55

(c) Scelestial (d) SiCloneFit (e) BitPhylogeny

Fig 4: Lineage tree inferred by different methods on a single-cell dataset from a muscle-invasive bladder tumour.

May 12, 2021 12/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

BC18

BC49 BC55

BC51 BC54 BC23

BC14

BC42 BC36 BC47 BC13 BC25

BC50

BC46

BC43 BC7 BC8

BC31 BC39

BC15 BC29

BC24 BC59 BCT BC48 BC53

BC9 BC41 BC22

BC35 BC32 BC38 BC44 BC21

BC34

BC33 BC45 BC28

BC52

BN6

BN9 BC37

BN12 BN13 BN16 BN3 BNT BN2 BN15 BC56

BC6 BC16 BN7 BN14 BN5 BC58 BC11

(f) SiFit

BC18, BC22, BC28, BC36, BC13 BC56 BC24, BN7, BC46, BC53, BN13, BN15, BN9 BN16

BN3 BC35 BC6, BC7, BC8, BC14, BN6, BN14, BN2, BC48 BC37, BC38, BNT BN12 BC33, BC58, BN5 BC25, BC41, BC15, BC50 BC55 BC44

BC39 BC31 BC51 BC29 BCT BC34, BC59 BC9, BC47 BC32

BC42 BC16 BC43 BC52

BC21 BC45

BC11

BC49 BC23, BC54

(g) SCITE

BC59 BC46

BC43 BN12 BN7 BN2 BN5 BC11 BNT BN6 BN3 BN9 BN16 BN15 BN14 BN13 BC58 BC16 BC6

BC41 BC32 BC21 BC49 BC52 BC9 BC31 BC51

BC39 BC29 BC28 BC23 BC34 BC33 BC42 BC55 BC47 BC36 BC35 BC24 BCT BC7 BC13 BC18 BC14 BC8 BC15 BC50

BC48 BC56 BC54 BC37 BC22 BC53 BC45 BC44 BC25 BC38

(h) SASC

Fig 4: Lineage trees inferred by different methods on a single-cell dataset from a muscle-invasive bladder tumor.

May 12, 2021 13/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

However, SiCloneFit placed one metastatic cell (M176) closer to the primary

tumor. Taken together, Scelestial performed best at separating the normal

cells, primary tumor, and metastatic cells. SiCloneFit was almost as good,

with only one more seemingly misplaced cell.

From the second patient, 181 single-cell samples were taken: 113 normal

cells, 29 primary tumor cells, and 39 metastatic cells. Thirty-six features

were extracted by evaluation of the cancer mutations of the samples. Of all

sites, 7.7% were missing values.

All algorithms separated normal from cancerous cells less well than

for the first patient (Fig. 6). Similar to their trees for the first patient,

OncoNEM, BitPhylogeny, SCIPhI, and SCITE suggested several mixed

clones including normal and cancerous cells. SASC and SiFit produced

lineage trees with several normal and cancerous cells evolving from each

other. In SiFit’s tree, there were fewer normal cells evolved from cancerous

cells that in SASC’s tree, but instead had many distinct subtrees with

cancerous cells descending from normal cells. Except for one cell (M-175),

all cancerous cells were separated from normal cells into one subtree in

Scelestial’s tree. In the SiCloneFit tree, the cancerous cells M-175 and

M-178 were also misplaced outside of a cancer lineage and closer to normal

cells. Metastatic and primary tumor cells were not well separated from

one another in the trees generated by Scelestial as well as by SiCloneFit.

Overall, Scelestial and SiCloneFit created the most realistic lineage trees of

all the methods analyzed.

2.5 Run time efficiency

We compared the run times of the eight methods with default settings

on the 110 datasets generated by the OncoNEM simulator (Fig. 7). We

simulated data with 10 mutations on average at each evolutionary step, a

range of 20 to 100 samples, and a range of 50 to 200 sites. We only included

pairs of methods and cases for which the method’s run time was less than

15 minutes.

SCITE, SCIPhI, BitPhylogeny, and Scelestial were the only methods

May 12, 2021 14/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

No sample Mixed normal/cancerous/metastasis samples Metastasis tumor samples Primary tumor samples Normal samples

N-56 0.3 N-11 N-18 0.3 N-82 M-154 N-40 N-93 0.2 0.5 2.2 0.3 1.0 0.5 0.5 0.4 N-81 0.5 P-128 0.1 0.2 0.9 P-122 0.8 1.8 1.5 N-116 1.2 0.8 N-77 0.1 N-68 1.3 1.1 N-28 0.3 0.5 0.6 P-139 0.4 0.0 0.2 0.1 0.5 0.2 1.1 N-99 N-70 0.4 0.7 0.5 0.4 0.3 0.2 P-131 0.3 0.6 0.2 0.1 0.1 N-87 N-15 N-72 0.5 0.3 0.3 0.4 E-53 N-25 0.2 0.2 0.1 1.8 N-12 0.9 N-50 0.6 N-71 0.0 0.1 P-129 0.6 1.0 0.1 N-54 N-76 0.1 P-147 0.4 N-115 0.1 P-140 0.4 0.3 0.2 0.1 M-176 0.3 N-91 0.4 P-119 0.0 0.1 0.4 P-120 1.0 0.0 0.2 0.4 0.2 P-133 0.2 N-45 0.1 P-138 0.2 0.1 M-177 M-157 M-171 0.4 P-149 M-162 M-165 1.5 0.4 0.5 0.1 0.1 0.4 0.3 N-104 0.2 0.2 0.4 P-126 0.1 0.3 M-158 0.3 P-144 0.1 P-123 0.6 0.9 0.0 0.2 P-141 0.2 0.5 0.6 N-42 0.1 0.3 0.1 0.6 0.0 M-160 0.2 0.2 0.1 M-174 0.3 0.4 0.3 0.5 0.1 0.3 0.5 N-60 P-124 M-169 0.2 0.1 P-145 0.1 0.4 0.9 0.5 0.1 0.1 0.1 0.1 E-51 N-65 0.1 4.2 0.0 P-148 0.4 0.2 0.8 1.5 P-135 0.0 0.2 M-166 N-97 0.2 0.2 0.2 N-33 0.0 N-55 0.1 P-125 0.2 0.2 N-62 N-80 0.0 0.6 0.0 M-163 0.1 N-38 N-44 0.3 0.1 0.4 0.2 0.5 P-136 0.0 0.5 0.1 N-113 0.1 M-173 N-108 P-121 2.3 0.1 0.1 0.1 0.4 N-35 N-90 N-73 0.2 0.6 M-170 M-156 0.1 P-142 0.3 0.3 0.1 0.1 P-134 0.2 0.3 0.2 0.2 2.3 P-118 P-130 0.1 0.5 0.1 0.4 0.1 M-172 N-2 0.3 0.2 1.2 0.2 M-155 0.2 0.1 0.1 0.5 N-95 N-30 1.4 0.3 0.1 1.1 0.1 M-167 0.1 P-143 0.2 0.1 1.0 0.3 0.6 0.1 N-7 P-137 0.7 P-132 0.0 N-20 0.1 0.3 0.5 N-13 N-111 N-79 0.8 0.0 0.1 N-1 0.3 0.2 0.1 M-178 0.7 N-49 0.3 0.0 0.1 0.1 0.2 0.4 0.0 0.0 N-31 0.2 N-78 P-146 0.9 0.1 M-161 0.2 0.2 N-109 0.2 0.0 1.2 E-52 0.4 0.3 0.2 0.2 0.1 N-8 0.5 N-98 N-4 0.2 0.2 N-88 0.4 M-159 M-168 0.3 0.4 0.4 N-66 0.1 N-112 N-102 0.4 0.5 N-110 0.1 0.3 0.1 0.4 N-86 0.2 0.1 N-117 0.1 0.5 0.1 N-9 1.0 0.3 N-27 0.2 M-175 0.1 0.1 N-105 0.2 0.4 0.0 0.2 0.2 0.3 0.2 N-39 1.7 N-107 0.2 N-114 N-3 0.2 N-6 0.1 0.1 N-22 0.5 N-16 0.1 0.1 0.1 0.3 0.3 0.4 N-67 0.4 N-85 0.4 N-10 0.0 0.1 0.3 0.5 6.6 M-153 N-83 0.3 0.2 0.1 0.2 N-75 0.5 M-151 0.0 N-74 0.1 0.2 0.1 0.2 0.1 0.0 N-21 0.1 P-127 0.4 0.5 0.4 0.1 0.5 M-152 0.0 N-24 N-84 N-19 N-32 0.2 0.3 N-58 0.1 0.0 0.1 0.7 0.1 N-47 P-150 0.1 N-106 0.0 0.1 0.1 N-17 0.0 0.0 0.1 N-14 N-89 N-37 N-29 N-69 0.3 M-164 0.1 0.0 0.1 0.1 0.2 N-103 0.1 N-34 1.0 0.0 0.1 0.1 0.5 0.4 0.1 0.3 N-41 N-57 0.0 0.0 0.2 N-43 0.1 0.0 N-96 N-5 N-23 N-94 0.1 0.4

0.2 0.1 0.1 N-59 N-63 N-100 0.4 0.0 0.2 0.2 0.1 N-92 0.0 0.2 N-46 0.1 0.1 N-48 0.3 N-26 0.3 N-36 N-61 0.0 0.4 0.4 N-64 0.2 0.2 N-101

(a) SiFit

N-1, N-2, N-3, N-4, N-5, N-6, N-7, N-8, N-9, N-10, N-11, N-12, N-13, N-14, N-15, N-16, N-17, N-18, N-19, N-20, N-21, N-22, N-23, N-24, N-25, N-26, N-110, N-27, N-28, N-29, N-30, N-31, N-111, 1.0 N-54 N-32, N-33, N-34, N-35, N-36, 1.0 M-153, N-37, N-38, N-39, N-40, N-41, M-155 N-42, N-43, N-44, N-45, N-46, N-47, N-48, N-49, N-50, E-51, E-52, E-53, N-55, N-56, N-57, P-118, P-119, P-120, 1.0 M-156, P-121, P-122, P-124, N-58, N-59, N-60, N-61, N-62, 1.0 M-151 M-158 N-63, N-64, N-65, N-66, N-67, P-125, P-126, P-127, N-90, N-91, N-92, P-129, P-130, P-133, N-68, N-69, N-70, N-71, N-72, 1.0 N-93, N-94, N-95, 1.0 1.0 P-134, P-137, P-142, N-73, N-74, N-75, N-76, N-77, P-123, P-132, N-96, N-97, N-102, P-148, M-157, M-159, N-78, N-79, N-80, N-81, N-82, 1.0 N-98, 1.0 1.0 1.0 1.0 1.0 1.0 P-135, 1.0 P-145, P-146, P-128, P-131, 1.0 M-152 M-160, M-161, M-162, N-83, N-84, N-85, N-86, N-87, M-154 P-136 P-147, P-149, P-138, P-139, M-163, M-164, M-165, N-88, N-89, N-99, N-100, N-101, P-150 P-140, P-141, M-166, M-167, M-168, N-103, N-104, N-105, N-106, P-143, P-144 N-107, N-108, N-109, N-112, M-169, M-170, M-171, N-113, N-114, N-115, N-116, M-172, M-173, M-174, N-117 M-175, M-176, M-177, M-178

(b) SCITE

1.0 N-9 N-21 1.0

0.0 1.0 1.0 1.0 N-39 N-105 N-95 1.0 N-67 1.0 P-134 N-82 N-110 1.0 1.0 P-135 1.0 1.0 N-101 1.0 M-176 N-96 N-44 N-97 1.0 1.0 1.0 1.0 1.0 N-106 N-38 N-49 N-103 N-34 1.0 1.0 1.0 1.0 0.0 N-27 0.0 1.0 0.0 M-170 N-20 0.0 1.0 0.0 0.0 0.0 1.0 N-30 M-152 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 M-153 N-114 N-42 N-68 N-14 N-99 N-108 N-72 N-88 1.0 0.0 N-10 1.0 1.0 1.0 1.0 1.0 P-119 P-120 P-132 P-144 1.0 1.0 1.0 N-35 N-90 N-40 1.0 N-29

1.0 1.0 P-140 P-137 N-2 1.0 0.0 1.0 M-154 N-89 1.0 1.0 N-92 P-124 M-171 0.0 1.0 1.0 N-1 1.0 1.0 0.0 P-136 1.0 1.0 N-104 N-15 1.0 N-57 1.0 1.0 P-149 P-143 N-59 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 N-71 N-28 1.0 N-111 N-5 N-11 1.0 1.0 1.0 M-177 P-133 N-80 N-79 1.0 1.0 1.0 1.0 1.0 1.0 N-86 N-37 N-116 1.0 N-100 N-24 0.0 1.0 1.0 P-122 N-32 N-78 N-6 1.0 1.0 1.0 N-77 N-84 1.0 1.0 N-113 1.0 N-12 N-13 N-54 1.0 1.0 1.0 N-48 N-70 0.0 N-107 0.0 1.0 1.0 1.0 1.0 1.0 P-129 P-126 P-147 P-127 N-43 E-51 1.0 P-146 1.0 1.0 1.0 1.0 1.0 N-47 1.0 P-118 1.0 P-148 1.0 N-23 N-83 N-45 M-161 1.0 1.0 N-17 1.0 1.0 1.0 1.0 1.0 1.0 N-66 N-19 P-130 1.0 M-151 M-164 M-160 1.0 E-53 N-75 1.0 1.0 N-4 N-18 1.0 1.0 1.0 P-123 P-125 M-156 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 N-94 N-56 N-50 N-60 N-109 N-87 N-98 N-115 N-31 N-73 N-81 N-102 N-64 N-7 M-155 1.0 1.0 1.0 1.0 M-158 P-128 M-162 1.0 1.0 N-41 N-25 1.0 P-131 E-52 M-178 1.0 1.0 P-141 1.0 P-121 N-69 1.0

1.0 1.0 1.0 1.0 1.0 1.0 M-166 M-159 M-175 M-169 M-168 N-74 N-3 1.0 1.0 1.0 1.0 N-46 M-165 P-150 1.0 1.0 1.0 1.0 1.0 N-22 N-85 N-36 N-62 N-76 1.0 1.0 1.0 1.0 N-55 N-26 N-33 M-163 N-93 1.0 1.0 1.0 1.0 1.0 N-117 M-174 M-173 P-139 1.0 1.0 N-65 N-91 1.0 N-8 0.0

1.0 1.0 1.0 M-157 N-63 P-145 1.0 M-167 P-142 1.0 1.0 1.0 1.0 P-138 N-16 N-112 1.0 1.0 M-172 N-58 1.0

N-61 (c) SASC

Fig 5: Lineage tree inferred by the different methods on a single-cell dataset from the first colorectal cancer patient.

May 12, 2021 15/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

N-1 N-2

0.0 N-2

0.0 N-3 0.0 N-3

0.0 N-4 0.0 N-4

0.0 N-5 0.0 N-5

0.0 N-6 0.0 N-6 0.0 N-7

0.0 N-7 0.0 N-8

0.0 N-9 0.0 N-8

0.0 N-10 0.0 N-9 0.0 N-11

0.0 N-10 0.0 N-12

0.0 0.0 N-11 N-13

0.0 N-14 0.0 N-12 0.0 N-15

0.0 N-13 0.0 N-16

0.0 N-14 0.0 N-17

0.0 N-18 0.0 N-15

0.0 N-19

0.0 N-16 0.0 N-20

0.0 N-17 0.0 N-21

0.0 N-22 0.0 N-18

0.0 N-23

0.0 N-19 0.0 N-24

0.0 N-20 0.0 N-25

0.0 N-26 0.0 N-21

0.0 N-27 0.0 N-22 0.0 N-28

0.0 N-23 0.0 N-29

0.0 N-30 0.0 N-24

0.0 N-31 0.0 N-25 0.0 N-32

0.0 N-26 0.0 N-33

0.0 0.0 N-27 N-34

0.0 N-35 0.0 N-28 0.0 N-36

0.0 N-29 0.0 N-37

0.0 N-30 0.0 N-38

0.0 N-39 0.0 N-31

0.0 N-40

0.0 N-32 0.0 N-41

0.0 N-33 0.0 N-42

0.0 N-43 0.0 N-34

0.0 N-44

0.0 N-35 0.0 N-45

0.0 N-36 0.0 N-46

0.0 N-47 0.0 N-37

0.0 N-48 0.0 N-38 0.0 N-49

0.5 N-39 0.0 N-50

0.0 E-51 0.5 N-40

0.0 E-52 0.5 N-41 0.0 E-53

0.5 N-42 0.0 N-54

0.0 0.5 N-43 N-55

0.0 N-56 0.5 N-44 0.0 N-57

0.5 N-45 0.0 N-58 0.0

0.5 N-46 0.0 N-59 0.0 0.0 N-60 0.5 N-47 0.0 N-61 0.0 0.5 N-48 N-62 0.0

0.5 N-49 N-63 0.0

N-64 0.0 0.5 N-50

N-65 0.0 0.5 N-54 N-66 0.0

1.0 N-55 N-67 0.0

0.0 N-68 N-56 0.0

N-69 0.0 0.0 N-57 N-70 0.0 0.5 N-58 N-71 0.0 0.5 0.1 N-72 0.5 N-59 0.0 0.0 N-68 N-1 0.5 N-73 0.0 0.5 N-60 N-74 0.0 0.5

N-61 N-75 1.0 0.0

N-76 N-62 0.0 1.0 N-77 0.0 N-63 1.5 N-78 0.0 N-64 1.0 N-79 0.0

N-80 1.5 N-65 0.0

N-81 0.0 0.5 N-66 N-82 0.0 N-69 0.5 N-83 0.0

N-84 0.5 N-70 0.0

N-85 0.0 0.5 N-71 N-86 0.0

0.5 N-72 N-87 0.0

N-73 N-88 1.0 0.0

N-89 0.0 1.0 N-74 N-90 0.0 2.0 N-75 N-91 0.0

N-76 N-92 1.5 0.0

N-93 0.0 2.0 N-77

N-94 0.0 P-131 1.5 N-78 N-95 0.0 0.0 P-132

N-79 N-96 1.0 0.0 0.0 P-135 N-97 1.0 N-80 0.0 0.0 P-136 N-98 0.0 N-81 1.0 0.0 N-99 P-138 0.0

N-82 0.0 0.5 N-100 P-139 0.0 0.0

N-83 P-140 0.5 0.0 N-101 0.0

0.0 1.5 N-84 N-102 0.0 N-102 0.0 P-144 1.0 0.6 0.0

0.0 N-103 P-145 1.0 N-85 0.0

N-104 P-146 0.0 0.0 1.0 N-86

N-105 P-147 0.0 0.0 0.3 1.5 N-87 0.0 N-106 P-149

1.0 N-88 0.0 N-107 P-150

1.0 N-89 0.0 N-108 P-118

N-90 0.0 1.0 0.0 N-109 P-119

0.0

N-91 0.0 N-110 P-121 0.5 0.0

0.0 0.0 0.0 N-111 P-122 0.5 N-92 0.0 0.0 P-120 0.0 N-112 P-123 2.0 N-93 0.0 0.0 P-124 0.0 N-113 P-133 0.0 N-94 1.5 0.0 0.0 P-125 0.0 N-114 P-134

1.0 N-95 0.0 P-126 0.0 N-115 M-176

N-96 0.0 2.5 N-116 P-127 M-153 0.0

1.0 N-97 N-117 0.0 0.0 P-128 0.3 0.0 M-154 0.0 0.0 0.0 P-129 1.5 N-98 0.0 0.0 0.9 M-157 M-155 0.0 0.0 P-130 N-99 0.0 0.9 0.0 0.0 1.0 M-158 0.0 P-137 0.0 0.0 1.0 N-100 M-160 P-141 0.0 M-152 N-101 0.0 1.0 1.0 P-142 0.0 0.0 M-178 2.0 P-143 0.5 N-103 N-67 P-119 M-151 P-148 0.0 0.5 N-104 P-120 0.0 M-156 M-159 2.5 0.5 0.5 N-105 P-121 0.0 M-159 M-163 0.5 1.4 M-161 0.0 M-161 N-106 P-122 0.5 0.5 1.5 M-164 1.0 0.0 M-162 1.5 0.5 N-107 P-118 P-123 0.5 3.5 0.0 M-165 M-163 2.0 1.0 0.5 1.0 M-162 N-108 P-124 P-144 1.0 0.0 0.5 2.0 M-164 1.0 2.0 1.0 1.0 1.5 1.0 M-168 M-166 M-156 M-158 M-157 M-160 0.5 1.5 0.5 0.0 M-176 M-165 2.0 N-109 P-126 0.5 2.5 2.0 0.0 0.0 M-167 M-151 1.0 P-130 1.5 0.0 M-166 1.0 3.5 4.0 1.0 1.0 N-110 N-111 M-153 P-127 P-133 0.5 1.5 1.0 0.0 1.0 P-131 1.0 0.0 M-167 0.5 P-125 1.0 M-169 M-152 P-132 1.5 1.0 1.0 P-134 0.0 1.0 N-112 0.5 1.5 P-129 P-146 1.0 M-168 1.0 1.0 0.0 P-147 1.0 M-173 M-177 M-178 1.5 1.5 P-135 2.0 N-113 P-137 M-169 1.0 0.0 1.0 1.0 0.5 M-171 M-170 0.5 0.0 M-170 N-114 3.0 P-141 P-136 0.5 1.5 M-174 P-142 1.5 0.5 M-172 1.5 0.0 M-171 1.0 N-115 P-143 P-128 1.0 1.0 M-175 2.5 P-138 0.0 M-172 1.0 P-148 P-140 1.0 N-116 1.5 1.0 0.0 M-173 P-139 2.0 N-117 M-154 P-149 P-145 0.0 M-174 3.0 0.0 1.0 2.5 0.5 E-51 E-53 1.5 M-175 P-150 0.5 M-155 E-52 M-177 (d) Scelestial (e) SiCloneFit

Fig 5: Lineage tree inferred by the different methods on a single-cell dataset from the first colorectal cancer patient.

May 12, 2021 16/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1.0 P-145, P-148, 1.0 1.0 P-149, P-150

1.0 1.0 1.0 1.0 1.0 1.0 1.0 P-118, P-119, P-120, 1.0 P-121, P-122, P-123, P-124, P-125, P-126, 1.0 P-127, P-128, P-129, 1.0 P-130, P-131, P-132, 1.0 P-138, P-139, P-140, 1.0 P-142, P-143, P-144, 1.0 P-146, P-147, M-151, 1.0 M-156, M-157, M-158, 1.0 M-162 1.0 1.0

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 N-102 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

1.0 1.0

1.0

1.0 1.0 1.0 P-141, 1.0 1.0 M-154 1.0 1.0 1.0

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 N-69, 1.0 1.0 1.0 1.0 N-98 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 N-63, N-64, 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 N-65, N-70, 1.0 N-71, N-72, 1.0 1.0 1.0 1.0 1.0 1.0 N-73, N-74, 1.0 1.0 1.0 1.0 N-75, N-76, 1.0 1.0 1.0 N-77, N-78, 1.0 1.0 1.0 1.0 1.0 N-90, N-112, 1.0 1.0 1.0 N-114 1.0 1.0 1.0 1.0 N-1, N-2, N-3, N-4, N-5, N-6, 1.0 1.0 N-7, N-8, N-9, N-10, N-11, 1.0 1.0 N-12, N-13, N-14, N-15, N-16, 1.0 1.0 1.0 1.0 1.0 N-17, N-18, N-19, N-20, N-21, 1.0 1.0 N-54 1.0 1.0 N-22, N-23, N-24, N-25, N-26, 1.0 1.0 N-27, N-28, N-29, N-30, N-31, 1.0 1.0 1.0 N-32, N-33, N-34, N-35, N-36, 1.0 1.0 1.0 N-37, N-38, N-39, N-40, N-41, 1.0 1.0 N-42, N-43, N-44, N-45, N-46, 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 N-47, N-48, N-49, N-50, N-55, 1.0 1.0 1.0 N-56, N-57, N-58, N-59, N-60, 1.0 1.0 1.0 N-61, N-62, N-66, N-68, N-79, 1.0 1.0 1.0 N-80, N-81, N-82, N-83, N-84, 1.0 1.0 1.0 N-85, N-86, N-87, N-88, N-89, N-110, 1.0 1.0 1.0 P-135, 1.0 1.0 N-91, N-92, N-93, N-94, N-95, N-111, 1.0 1.0 P-136 N-96, N-97, N-99, N-100, N-101, M-153 1.0 1.0 N-103, N-104, N-105, N-106, 1.0 1.0 N-107, N-108, N-109, N-113, 1.0 N-115, N-116, N-117 1.0 1.0 1.0 1.0 1.0 1.0

1.0 1.0 1.0 1.0 1.0 E-52, M-152, M-159, M-161, 1.0 M-155, 1.0 1.0 1.0 1.0 M-166, M-167, M-176 M-168, M-169, 1.0 1.0 1.0 M-170, M-171, 1.0 1.0 1.0 M-172, M-173, 1.0 E-51, M-174, M-175, 1.0 1.0 1.0 1.0 E-53 1.0 M-177, M-178 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 P-134, M-160, 1.0 M-163, M-164, 1.0 1.0 M-165 1.0 1.0

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 N-67 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 P-133 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 P-137 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

(f) SCIPhI

N-1, N-2, N-3, N-4, N-5, N-6, N-7, N-8, N-9, N-10, N-11, N-12, N-13, N-14, N-15, N-16, N-17, N-18, N-19, N-20, N-21, N-22, N-23, N-24, N-25, N-26, N-27, N-28, N-29, N-30, N-31, N-32, N-33, N-34, N-35, N-36, N-37, P-132, P-136, N-38, N-39, N-40, N-41, N-42, P-146, P-147, N-43, N-44, N-45, N-46, N-47, P-149, P-150 N-48, N-49, N-50, E-51, E-52, 1.0 P-118, P-119, E-53, N-54, N-55, N-56, N-57, P-128, P-131, P-124, P-125, P-120, P-121, N-58, N-59, N-60, N-61, N-62, 3.1 M-154, 2.5 P-135, 1.5 P-138, P-139, 2.0 1.0 P-126, P-127, P-122, P-123, N-63, N-64, N-65, N-66, N-67, M-155 P-141 P-140, P-142, P-129, P-130, P-133, P-134, M-152, M-163, N-68, N-69, N-70, N-71, N-72, P-143, M-158 P-137, P-148 2.2 P-144, P-145 M-164, M-165, N-73, N-74, N-75, N-76, N-77, M-151, M-156, M-166, M-167, N-78, N-79, N-80, N-81, N-82, M-159, 1.8 M-157, 1.0 M-168, M-169, N-83, N-84, N-85, N-86, N-87, M-162, M-160, M-170, M-171, N-88, N-89, N-90, N-91, N-92, M-176 M-161 M-172, M-173, N-93, N-94, N-95, N-96, N-97, M-174, M-175, N-98, N-99, N-100, N-101, N-102, M-177, M-178 N-103, N-104, N-105, N-106, N-107, N-108, N-109, N-110, N-111, N-112, N-113, N-114, N-115, N-116, N-117, M-153

(g) OncoNEM

P-129, P-133, P-136, P-137, P-138, P-139, P-141, P-142, P-144, P-145, P-146, P-148, P-149, P-150, M-159, M-170, M-176, P-118, P-119, P-120, P-121, P-123, P-125, P-126, P-127

0.9 N-97, N-99, N-76, N-101, N-38, N-39, N-40, N-41, N-42, N-43, N-44, N-45, N-46, N-47, N-48, N-49, N-50, E-51, N-88, N-104, N-102

0.2

P-147, M-151, M-152, M-154, N-65, N-66, M-156, M-157, N-100, N-111, M-158, M-160, N-80, N-81, M-161, M-163, M-153, N-91, 0.8 M-164, M-165, N-57, N-58, M-166, M-167, N-59, N-60, M-168, M-169, N-61, N-62, M-171, M-172, N-63, N-64 M-173, M-174, M-175, M-177, M-178, P-122 0.7

0.1

P-130, P-131, P-132, P-134, P-135, P-140, P-143, N-115, P-124, P-128

N-98, N-75, N-1, N-2, N-3, N-4, N-5, N-77, N-78, N-6, N-7, N-8, N-9, N-10, N-79, N-112, 0.9 N-11, N-12, N-13, N-14, 0.4 N-95, N-96 M-162 N-15, N-16, N-17, N-18, N-19, N-20, N-21, N-22, N-23, N-24, N-25, N-26, N-27, N-28, N-29, N-30, N-31, N-32, N-33, N-34, N-35, N-36, N-37, N-54, 0.1 N-55, N-56, N-67, N-68, N-116, N-117, N-82, N-83, 0.0 N-69, N-70, N-71, N-72, N-86, N-87, N-84, N-85, N-73, N-74, N-90, N-93, N-89, M-155 E-53, N-92 N-94, N-103, N-105, N-106, N-107, N-108, N-109, 0.7 N-110, N-113, N-114

E-52

(h) BitPhylogeny

Fig 5: Lineage tree inferred by the different methods on a single-cell dataset from the first colorectal cancer patient.

May 12, 2021 17/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

I-2 I-2

0.0 I-8 5.5 N-9

0.0 N-9

0.0 N-10 0.0 N-10

0.0 0.0 N-11 N-11

0.0 N-12 0.0 N-12

0.0 N-13 0.0 N-13

0.0 N-14 0.0 N-14 0.0 N-15

0.0 N-15 0.0 N-16

0.0 0.0 N-17 N-16

0.0 N-18 0.0 N-17

0.0 N-19 0.0 N-18

0.0 N-20 0.0 N-19 0.0 N-21

0.0 N-20 0.0 N-22

0.0 0.0 N-23 N-21

0.0 N-24 0.0 N-22

0.0 N-25 0.0 N-23

0.0 N-26 0.0 N-24 0.0 N-27

0.0 N-25 0.0 N-28

0.0 0.0 N-29 N-26

0.0 N-30 0.0 N-27

0.0 N-31 1.0 N-28

0.0 N-32 1.0 N-29 0.0 N-33

0.5 N-30 0.0 N-34

0.0 N-35 2.0 N-31

0.0 N-36 1.0 N-32

0.0 N-37 1.5 N-33

0.0 N-38 1.0 N-34 0.0 N-39

1.5 N-35 0.0 N-40

0.0 N-41 1.5 N-36

0.0 N-42 1.5 N-37

0.0 N-43 0.5 N-38

0.0 N-44 0.5 N-39 0.0 N-45

2.5 N-40 0.0 N-46

0.0 N-47 1.5 N-41

0.0 N-48 1.0 N-42

0.0 N-49 1.5 N-43

0.0 N-50 1.5 N-44 0.0 N-51

1.0 N-45 0.0 N-52

0.0 N-53 1.5 N-46

0.0 N-54 1.5 N-47

0.0 N-55 1.0 N-48

0.0 N-56 1.0 N-49 0.0 N-57

1.0 N-50 0.0 N-58

0.0 N-59 0.5 N-51

0.0 N-60 0.5 N-52 0.0 0.0 N-61 0.0 1.0 N-53 0.0 N-62 0.0 2.0 N-54 N-63 0.0 2.5 N-55 N-64 0.0 1.5 N-65 N-56 0.0 1.5 N-66 0.0 N-57 1.5 N-67 0.0 0.0 1.5 N-89 I-8 N-58 0.5 N-68 0.0 0.5 N-59 N-69 0.0 1.5

N-60 N-70 0.0 0.5

N-71 N-61 0.0 0.5 I-3 3.5 N-72 0.0 N-62 1.0 4.0 1.0 I-6 I-5 N-73 0.0 3.5 1.0 N-63 I-1 0.5 4.5 5.5 N-74 I-7 5.0 0.0 I-4 0.5 N-64 N-75 0.0

N-65 N-76 1.5 0.0

N-77 N-66 0.0 0.5

N-78 0.0 0.5 N-67

N-79 0.0 4.5 0.5 N-68 MP1-175 N-80 0.0 0.5 N-69 N-81 0.0 P-114

N-70 N-82 1.5 0.0 0.0 P-117

N-83 N-71 0.0 1.0 0.0 P-118

N-84 0.0 1.0 N-72 0.0 P-119 N-85 0.0 0.0 N-73 0.0 P-120 1.5 N-86 0.0 N-74 0.0 P-122 0.5 N-87 0.0

N-75 N-88 0.0 P-124 1.0 0.0

N-89 0.0 0.0 P-125 2.0 N-76

N-90 0.0 0.0 P-126 1.0 N-77

N-91 0.0 0.0 P-127 0.5 N-78 N-92 0.0 0.0 P-128 0.0 N-79 N-93 0.0 0.0 P-129 N-80 N-94 2.0 0.0 0.0

0.0 P-130 N-95 0.0 0.0 1.5 N-83 0.0 N-96 P-131 0.0 N-85 0.0 1.0

N-97 0.0 P-132 0.0 0.5 N-86 N-98 0.0 P-133 0.0 1.0 N-87 N-99 3.5 0.0 0.5 N-82 N-113 P-134 0.0 1.0 0.5 N-88 N-81 N-100 0.5 0.5 0.0 5.0 N-84 N-112 0.0 P-135 N-90 0.0 N-101 0.2 3.5 0.0 P-136 N-91 0.0 N-102 3.0 0.0 P-137 N-92 0.0 N-103 2.5 0.0 P-138

0.0 N-104 3.5 N-93 0.0 P-139

N-105 0.0 3.5 N-94 P-140

N-106 0.0 3.5 N-95 P-141

0.0 N-107 2.0 N-96 M3-162

0.0 N-108 4.0 N-97 0.0 M2-163 0.0 N-109 3.0 N-98 0.0 M2-164 0.0 N-110 4.5 N-99 0.0 0.0 N-111 M2-165 0.0 N-100 0.4 5.0 0.0 0.0 N-112 0.0 M2-166 N-101 0.0 5.5 N-113 M2-169 0.0 4.0 N-102 MP1-175 M2-170 M2-165 4.5 N-103 6.0 5.5 I-1 M3-157 M3-156 6.5 0.0 MP1-174 M2-166 3.5 I-3 4.0 N-104 2.0 0.0 3.0 M3-158 I-4 5.0 0.0 M3-145 M2-169 5.0 0.3 0.0 MP1-174 M3-162 0.0 6.0 N-105 3.0 3.5 0.0 I-5 P-116 0.0 9.0 3.5 4.0 M3-153 M3-152 0.0 0.0 M3-147 4.5 2.0 N-106 M2-170 I-6 4.0 3.0 M2-163 2.5 P-115 M2-167 0.0 M3-148 3.5 I-7 1.5 M3-155 0.1 4.5 N-107 4.0 MP1-179 0.0 M3-150 0.0 0.0 5.0 1.5 MP1-177 M3-149 N-108 5.5 0.2 0.0 2.0 M3-147 2.0 2.0 2.5 4.5 MP1-181 3.0 0.1 0.3 0.2 0.0 MP1-180 M3-150 3.5 0.0 4.0 N-109 M3-149 0.0 P-115 P-140 M3-151 2.5 P-139 2.5 M3-151 0.0 3.0 N-110 3.0 2.0 0.0 P-116 M3-144 M3-152 2.0 4.0 0.0 0.0 P-141 2.5 N-111 3.5 P-121 0.2 M3-148 2.0 0.0 2.5 MP1-177 2.5 M3-154 M3-153 2.5 P-133 2.5 3.5 0.0 P-123 2.5 3.0 P-120 3.5 MP1-173 1.5 1.5 2.5 0.0 MP1-173 P-134 4.0 M3-143 6.0 2.0 2.5 MP1-171 MP1-178 M3-146 3.0 1.5 1.5 P-119 1.5 0.0 M3-145 2.0 2.0 P-142 P-132 M3-160 M3-146 2.5 1.0 1.5 0.0 4.0 P-122 MP1-172 0.0 M3-144 4.0 5.5 0.0 P-137 M2-167 1.5 M3-161 0.0 3.0 2.0 P-142 0.0 0.0 M3-154 3.5 2.0 P-130 2.0 M2-168 3.0 P-136 M3-159 1.5 1.5 0.0 0.0 M3-143 4.5 M3-155 P-131 MP1-172 P-138 5.5 0.0 M2-164 0.0 3.5 M3-156 2.5 0.0 P-129 P-123 0.0 MP1-176 0.0 1.5 1.5 P-126 0.0 M3-157 1.0 P-135 2.0 1.0 0.0 P-125 0.5 0.0 MP1-178 1.5 P-127 M3-158 1.0 2.5 0.0 P-117 1.5 4.0 P-114 1.5 0.0 MP1-179 0.5 1.0 P-124 M3-159 3.0 0.0

P-121 P-118 P-128 MP1-180 2.5 0.0 M3-160 2.5 M2-168 4.5 MP1-181 M3-161

MP1-176 MP1-171

(a) SiCloneFit (b) Scelestial

Fig 6: Lineage tree inferred by the different methods on a single-cell dataset from the second colorectal cancer patient.

May 12, 2021 18/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

No sample Mixed normal/cancerous/metastasis samples Metastasis tumor samples Primary tumor samples Normal samples

N-104 N-25 N-102 0.1 0.0 N-81 N-22 0.2 0.1 0.9 0.0 0.1 0.2 0.3 0.3 N-59 N-62 N-77 N-9 N-64 0.0 I-3 0.5 0.6 1.0 0.2 0.3 0.0 0.5 N-79 0.1 0.2 0.4 0.2 0.2 0.4 I-7 0.1 0.3 0.2 N-30 N-113 N-101 N-55 0.2 0.6 0.2 1.1 0.7 N-103 0.0 0.7 0.1 0.6 1.3 N-16 N-99 0.6 0.4 0.9 0.2 1.0 1.3 N-105 0.1 1.3 0.2 0.3 0.1 0.3 0.5 0.5 N-43 N-88 N-17 N-74 0.7 0.5 0.4 0.3 1.3 0.2 0.4 1.0 N-45 1.7 0.4 0.6 0.7 N-97 0.5 N-93 N-60 N-44 N-38 1.3 0.7 0.1 0.1 I-5 0.1 0.4 0.4 M2-168 0.6 0.8 0.1 I-2 N-90 0.4 N-107 N-58 N-112 0.1 0.3 0.4 0.2 0.8 0.1 0.2 1.0 N-84 N-33 0.5 0.6 MP1-175 0.1 0.4 0.4 N-14 1.2 N-73 N-106 0.3 0.1 0.5 0.7 0.1 N-31 0.2 0.4 N-70 0.4 1.6 N-12 N-71 N-28 0.4 N-42 N-108 M3-146 0.6 0.2 0.2 0.3 0.4 0.4 0.1 2.1 N-23 0.8 N-40 0.5 0.1 0.7 0.4 M3-153 0.4 0.3 0.1 0.2 N-100 N-46 0.4 M3-152 0.4 N-52 0.3 0.1 0.5 N-110 0.5 N-76 0.1 0.2 N-92 N-91 N-39 0.6 0.4 0.2 0.1 M3-155

0.5 0.2 1.9 M3-161 0.0 0.0 N-49

N-35 N-36 N-54 0.3 0.4 0.4

0.1 0.3 0.4 N-109 0.1 0.2

0.2 0.1 N-75 N-27 I-1 0.2 0.2 0.2 0.5 0.3 0.4 N-66 0.4 0.1 N-53 N-15 0.2 0.1 0.0 0.4 N-89 N-63 N-24 0.0 0.1 0.2 0.0 0.3 N-26 N-41 0.1 0.1 0.2 0.1 N-37 I-8 N-80 N-82 0.2 0.1 0.2 0.0 0.4 0.2 0.2 0.1 N-56 0.1 0.1 0.3 0.0 0.7 0.3 0.2 N-10 N-13 N-34 0.3 0.3 0.6 0.1 0.3 0.3 N-29 0.5 0.1 0.6 0.1 N-65 N-98 N-67 N-11 N-51 0.3 I-6 0.2 0.3 0.7 0.2 0.2 N-96 P-124 P-122 0.2 0.2 0.4 0.8 0.3 N-87 N-86 P-136 0.2 0.2 0.1 0.1 N-21 P-138 0.1 0.1 P-131 N-50 N-72 0.0 0.1 0.3 0.2 0.4 0.1 0.2 0.3 0.1 N-20 1.3 0.3 P-137 0.3 P-134 0.2 0.2 0.3 0.1 0.4 N-111 0.7 M3-144 0.6 0.1 0.1 N-94 1.2 0.2 0.5 1.8 0.4 0.4 0.3 0.6 0.5 MP1-177 0.1 0.0 0.0 N-47 N-61 0.9 1.4 0.0 P-128 MP1-172 0.1 0.3 1.1 0.1 0.1 0.2 I-4 0.4 0.1 0.5 0.6 N-19 N-69 N-48 0.6 0.4 0.7 0.6 N-68 P-125 0.5 0.9 N-85 0.1 0.6 N-18 P-115 0.7 0.2 0.3 M2-165 0.3 N-57 P-135 N-78 M3-162 0.6 0.7 N-32 0.5 P-142 MP1-174 0.6 M2-170 0.9 0.4 1.1 P-120 N-95 N-83 0.5 0.4 M2-169 0.3 0.6 0.3 1.0 1.0 0.2 0.3 2.2 0.2 0.5 M3-151 0.6 0.0 0.1 0.1 M3-154 0.3 0.2 0.2 0.0 0.3 P-140 0.1 0.5 1.3 0.4 P-133 MP1-179 0.1 0.0 0.1 P-139 0.3 M3-156 P-141 P-119 0.3 M2-166 0.6 0.4 0.1 0.5 0.1 0.5 M3-149 0.5 0.4 0.9 0.6 0.8 0.4 P-130 M3-160 0.2 0.5 M3-150 MP1-176 0.4 0.1 0.1 P-114 MP1-181 0.7 0.2 M2-163 MP1-178 0.1 0.2 P-117 0.3 0.5 0.4 P-123 0.2 0.8 0.1 1.0 0.1 P-132 0.4 M2-167 0.5 0.4 P-118 P-121 0.0 0.3 0.5 M3-145 0.3 MP1-180 0.3 0.1 0.9 P-127 0.2 P-116 0.3 P-126 0.5 0.3 P-129 0.7 M2-164 0.1 M3-143 0.5 M3-157 0.0

0.2 0.1 M3-159 0.0 0.1 0.3 0.2 0.0 0.3 M3-158 0.2 MP1-171 0.3

MP1-173 0.5 M3-147 0.4

M3-148

(c) SiFit

1.0 1.0 MP1-181 N-94 N-48

1.0

1.0 MP1-175 N-103 1.0 N-53 0.0 1.0 P-121

1.0 1.0 MP1-171 N-108 1.0

1.0 1.0 1.0 1.0 1.0 N-22 MP1-177 P-125 P-115 P-116 N-18 1.0 0.0 0.0 1.0 1.0 M3-151 P-122 M2-164 1.0 1.0 1.0 1.0 1.0 M2-168 N-39 N-67 N-54 1.0 M2-166 1.0 1.0 1.0 0.0 N-26 N-21 1.0 MP1-172 P-133 1.0 1.0 0.0 N-74 N-79 N-30 1.0 1.0 N-62 1.0 1.0 MP1-173 1.0 1.0 1.0 1.0 1.0 P-117 P-118 N-97 N-38 N-66 P-114 1.0 1.0 1.0 N-73 N-93 N-101 1.0 1.0 1.0 N-91 N-89 N-106 1.0 N-92

M3-147 1.0 1.0 1.0 I-7 N-83 1.0 P-140 N-59 1.0

1.0 1.0 N-65 1.0 M2-169 M3-159 M2-167 1.0 0.0 1.0 N-72

1.0 1.0 N-107 N-95 P-134 N-23 1.0 1.0 1.0 1.0 MP1-180 P-141 1.0 N-50 1.0 1.0 N-20 N-77 1.0 1.0 1.0 N-64 N-17 N-27 N-98 0.0 1.0 1.0 0.0 1.0 N-47 N-11 0.0 1.0 1.0 1.0 1.0 P-124 P-127 N-13 N-41 I-3 1.0 0.0 1.0 0.0 1.0 1.0 1.0 P-128 N-112 MP1-179 N-12 N-15 1.0 1.0 1.0 1.0 1.0 I-5 1.0 P-138 N-45 N-99 P-119 N-19 1.0 1.0 0.0 1.0 1.0 1.0 1.0 N-100 N-88 N-46 1.0 I-1 N-14 M3-154 M3-155 1.0 1.0 1.0 0.0 N-86 1.0 P-131 M3-153 1.0 N-36 1.0 1.0 M3-146 N-33

1.0 P-137 N-104 1.0 1.0

1.0 M3-143 N-60 N-10 1.0

1.0 1.0 1.0 M3-149 N-57 N-96 M3-148 1.0 1.0 N-75 N-85 1.0

0.0 1.0 1.0 M3-152 N-31 N-82 1.0 0.0 1.0 1.0 1.0 N-24 N-87 P-142 N-28 N-71 1.0 1.0 1.0 1.0 N-63 1.0 0.0 P-132 M3-156 1.0 1.0 1.0 1.0 1.0 N-35 N-37 I-8 1.0 N-49 M2-170 0.0 P-135 1.0 N-80 1.0 1.0 1.0 1.0 N-25 N-61 N-102 P-130 1.0 1.0 1.0 1.0 MP1-178 N-32 N-78 1.0 1.0 1.0 M3-145 N-51 1.0 P-136 P-129 P-126 1.0 1.0 1.0 N-42 N-84 0.0 M3-162 N-16 P-139 1.0 1.0 1.0 1.0 1.0 1.0 M3-160 P-120 P-123 1.0 MP1-176 N-56 1.0 M2-163 N-110 1.0 N-109 1.0 I-2 N-58 1.0 1.0 1.0 M3-157 M3-158 1.0 1.0 1.0 1.0 0.0 N-55 N-76 N-52

1.0 1.0 M2-165 M3-150 N-34 I-4 1.0 1.0 1.0 M3-144 N-69 1.0 1.0 1.0 N-70 N-9 1.0 M3-161 N-113 1.0 1.0 1.0 N-44 1.0 1.0 1.0 N-81 N-111 1.0 I-6 MP1-174 1.0 1.0 1.0 N-68 1.0 N-105 N-29 N-90 1.0 N-43 N-40 (d) SASC

Fig 6: Lineage tree inferred by the different methods on a single-cell dataset from the second colorectal cancer patient.

May 12, 2021 19/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

P-119, P-132, M3-154, P-137, P-140, M3-155, P-141, M3-148, 1.0 M3-160, M3-149, M3-150, MP1-173, M3-151, M3-153, 1.0 MP1-174 M3-162

P-133 1.0

P-115, P-116, P-120, P-121, P-122, P-124, 1.0 P-123, M3-145, P-125, P-126, P-135, M3-146, M3-159, P-127, P-128, 1.0 1.0 1.0 P-136, 1.0 M2-164, M2-169, P-129, P-130, P-139, MP1-171, P-131, P-134, MP1-176 MP1-172 P-138, P-142, M3-143, M2-167, M3-144, M2-168, M2-170 M3-156, 1.0 1.0 1.0 M3-157, 1.0 MP1-179 M3-158, 1.0 1.0 1.0 1.0 1.0 M3-161 1.0 1.0 MP1-180, 1.0 1.0 MP1-181 1.0 1.0 M2-165

I-1, I-3, 1.0 1.0 P-114, 1.0 I-4, I-6 1.0 M2-163 I-8, N-9, N-10, N-11, N-12, N-13, N-14, N-15, N-16, N-17, N-18, N-19, N-20, N-21, N-22, I-2, I-7, 1.0 1.0 1.0 N-23, N-24, N-25, N-26, N-27, N-94 P-118 M2-166 N-28, N-29, N-30, N-31, N-32, 1.0 1.0 N-33, N-34, N-35, N-36, N-37, N-38, N-39, N-40, N-41, N-42, 1.0 M3-147, 1.0 1.0 N-43, N-44, N-45, N-46, N-47, M3-152 P-117 N-48, N-49, N-50, N-51, N-52, I-5, N-56, N-78, N-53, N-54, N-55, N-57, N-58, N-92, N-93, N-59, N-60, N-61, N-62, N-63, 1.0 N-96, N-97, N-64, N-65, N-66, N-67, N-68, N-99, N-101, N-69, N-70, N-71, N-72, N-73, N-103, N-109, N-74, N-75, N-76, N-77, N-79, N-110, N-111, N-80, N-81, N-82, N-83, N-84, N-112, MP1-178 N-85, N-86, N-87, N-88, N-89, N-90, N-91, N-95, N-98, N-100, N-102, N-104, N-105, N-106, N-107, N-108, N-113, MP1-175, MP1-177

(e) SCITE

1.0

1.0 1.0

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

1.0 1.0 1.0

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

1.0 1.0 1.0 M3-144, 1.0 M3-160, 1.0 1.0 1.0 1.0 M3-161, 1.0 1.0 1.0 1.0 MP1-171 1.0

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

1.0

1.0

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 M3-153 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

MP1-172 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 P-133 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 M3-156, 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 M3-157, 1.0 1.0 M2-167, 1.0 1.0 M2-170, 1.0 1.0 1.0 1.0 MP1-174 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 MP1-181 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 P-119, P-124, 1.0 1.0 1.0 1.0 P-122 1.0 P-125, P-126, 1.0 1.0 1.0 1.0 P-128, P-129, 1.0 P-134, P-135, 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 P-136, P-137, 1.0 P-138, P-139, 1.0 1.0 1.0 1.0 1.0 P-142, M3-150, 1.0 1.0 1.0 M2-163, M3-155, M3-158, 1.0 1.0 1.0 1.0 1.0 MP1-179, M3-162, M2-165, 1.0 1.0 I-1, I-2, I-3, I-4, I-5, I-6, 1.0 1.0 MP1-180 1.0 1.0 1.0 1.0 M2-166, MP1-173 I-7, I-8, N-9, N-10, N-11, N-12, 1.0 1.0 1.0 N-13, N-14, N-15, N-16, N-17, 1.0 1.0 1.0 N-18, N-19, N-20, N-21, N-22, 1.0 N-23, N-24, N-25, N-26, N-27, 1.0 N-28, N-29, N-30, N-31, N-32, 1.0 1.0 1.0 N-33, N-34, N-35, N-36, N-37, 1.0 1.0 N-38, N-39, N-40, N-41, N-42, 1.0 1.0 1.0 1.0 1.0 N-43, N-44, N-45, N-46, N-47, 1.0 1.0 1.0 N-48, N-49, N-50, N-51, N-52, 1.0 1.0 1.0 1.0 1.0 N-53, N-54, N-55, N-56, N-57, N-78, N-79, 1.0 N-58, N-59, N-60, N-61, N-62, N-80, N-96, 1.0 1.0 1.0 1.0 N-63, N-64, N-65, N-66, N-67, N-112, P-131 1.0 1.0 N-68, N-69, N-70, N-71, N-72, N-73, N-74, N-75, N-76, N-77, N-81, N-82, N-83, N-84, N-85, 1.0 N-86, N-87, N-88, N-89, N-90, 1.0 N-91, N-92, N-93, N-94, N-95, 1.0 N-97, N-98, N-99, N-100, N-101, 1.0 P-117, P-118, 1.0 1.0 1.0 N-102, N-103, N-104, N-105, N-106, P-120, P-121, 1.0 N-107, N-108, N-109, N-110, N-111, P-123, P-127, 1.0 1.0 1.0 N-113, P-114, P-115, P-116, P-130, P-132, P-140, M3-152, MP1-175, MP1-177 P-141, M3-143, 1.0 1.0 1.0 1.0 1.0 1.0 M3-145, M3-146, 1.0 1.0 M3-147, M3-148, 1.0 1.0 1.0 M3-149, M3-151, 1.0 1.0 M3-154, M3-159, 1.0 1.0 1.0 M2-164, M2-168, 1.0 1.0 1.0 1.0 M2-169, MP1-176, MP1-178

1.0 1.0 1.0

1.0

(f) SCIPhI

I-2, I-8, N-9, N-10, N-11, N-12, N-13, N-14, N-15, N-16, N-17, N-18, N-19, N-20, N-21, N-22, N-23, N-24, N-25, N-26, N-27, 1.8 I-3, I-5, P-124, P-125, I-1, I-4 N-28, N-29, N-30, N-31, N-32, I-6, I-7 P-115, P-119, P-126, P-127, N-33, N-34, N-35, N-36, N-37, 3.2 P-116, 1.2 P-121, 1.8 P-128, P-129, N-38, N-39, N-40, N-41, N-42, P-120, P-122, P-131, P-132, N-43, N-44, N-45, N-46, N-47, P-130 P-134 P-135, P-136, N-48, N-49, N-50, N-51, N-52, P-137, P-138 N-53, N-54, N-55, N-56, N-57, 1.0 N-81, N-82, N-58, N-59, N-60, N-61, N-62, 1.0 N-84, N-112, N-63, N-64, N-65, N-66, N-67, N-113 N-68, N-69, N-70, N-71, N-72, P-114, P-118, N-73, N-74, N-75, N-76, N-77, P-123, P-141, M2-166 M3-145, M3-147, N-78, N-79, N-80, N-83, N-85, M3-150, M3-152, 3.1 M3-154, N-86, N-87, N-88, N-89, N-90, 3.2 1.0 M3-153, M3-155, M3-161 N-91, N-92, N-93, N-94, N-95, P-139, P-140, P-133, M3-148, P-142, M3-143, M2-168, MP1-172, P-117, M3-157 N-96, N-97, N-98, N-99, N-100, 2.8 2.0 2.0 N-101, N-102, N-103, N-104, MP1-174 M3-144, M3-146, MP1-173, M3-160, 3.0 N-105, N-106, N-107, N-108, M3-158, MP1-171, MP1-179, MP1-180 1.0 N-109, N-110, N-111, MP1-175, MP1-176 MP1-181 M3-149, MP1-177, MP1-178 M3-151, M3-156, 2.5 M3-159, M2-164 M2-167 3.3

M3-162, M2-163, 2.1 M2-169, M2-165 M2-170

(g) OncoNEM

I-3, N-101, N-102 I-2, I-4, I-8, N-10, N-11, N-12, N-13, N-14, N-15, N-16, N-18, N-20, 0.4 N-21, N-22, N-23, N-24, N-25, N-26, N-27, N-28, N-29, N-30, N-31, N-32, N-33, N-34, N-35, N-36, N-37, N-38, N-39, N-40, N-41, N-42, N-43, N-44, I-6, N-9, I-1, N-99, N-45, N-46, N-47, N-48, N-49, N-50, 0.0 N-110, N-17, N-112, N-103, N-51, N-52, N-57, N-58, N-60, N-61, N-19, N-59, N-104 N-63, N-65, N-68, N-69, N-70, N-71, 0.1 N-64 N-72, N-73, N-74, N-75, N-76, N-77, N-78, N-79, N-80, N-82, N-83, N-84, N-89, N-90, N-91, N-92, N-93, N-97, N-98, N-105, N-106, N-109, N-113, P-114, P-115, P-116, P-117, P-118, 0.6 N-107, P-119, P-120, P-121, P-122, P-123, P-131, P-124, P-125, P-127, P-128, P-129, N-111 P-130, P-132, P-133, P-134, P-135, 0.3 P-136, P-137, P-138, P-139, P-140, P-141, P-142, M3-143, M3-144, M3-145, N-66, N-67, M3-146, M3-147, M3-148, M3-149, 0.2 N-87, I-7, N-88, M3-150, M3-151, M3-152, M3-153, N-94, N-81, 0.2 N-100, I-5, M3-154, M3-155, M3-156, M3-157, N-53, N-54, N-108, N-86, M3-158, M3-159, M3-160, M3-161, N-55, N-56, N-95, N-96 M3-162, M2-163, M2-164, M2-165, N-85, MP1-176, M2-166, M2-167, M2-168, M2-169, N-62 0.1 M2-170, MP1-171, MP1-172, MP1-173, MP1-174, MP1-175, MP1-177, MP1-178, MP1-179, MP1-180, MP1-181 P-126

(h) BitPhylogeny

Fig 6: Lineage tree inferred by the different methods on a single-cell dataset from the second colorectal cancer patient.

May 12, 2021 20/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1,000 4,000 BitPhylogeny OncoNEM SCITE Scelestial 500 SASC 2,000

time (seconds) SCIPhI SiFit SiCloneFit Running 0 0 20 40 60 80 100 50 100 150 200 number of samples number of sites (a) Run time comparisons for varying sample sizes. (b) Relation of run time to the number of sites

Fig 7: Run time comparison in relation to the number of samples and sites.

finishing their task across the whole range of sample sizes within 100 seconds.

Scelestial, SCIPhI, and SCITE’s run times grew almost linearly with an

increasing number of samples. The run time of BitPhylogeny does not seem

to be directly related to the number of samples but was rather random for

the tested cases, since it applies MCMC and its run time only depends on

the number of iterations. When we increased the number of sites, the run

time of Scelestial also grew almost linearly. Next to Scelestial, BitPhylogeny

and SCITE were the fastest methods. Over all cases, Scelestial was faster

than all other methods in seven cases; BitPhylogeny was the fastest in six

cases.

With these data, as well as being fastest, Scelestial, as before, also

had the smallest average error in lineage tree reconstruction (Fig. 8b).

Among 130 cases, in terms of sample distance error, Scelestial performed

best in 112 cases and OncoNEM performed best in 22 cases. In terms of

the split similarity measure of topological correctness, Scelestial performed

best in 112 cases, OncoNEM in 18 cases, BitPhylogeny in four cases, and

SCITE in two cases. In case of ties, best performance was counted for all

methods. OncoNEM was slightly more accurate in some cases; however, it

had a substantially larger run time. Thus, in these tests Scelestial was best

overall in both run time and error. SCITE and BitPhylogeny had similar

May 12, 2021 21/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

split similarity, with SCITE being faster than BitPhylogeny on average.

Considering the pair distance error, the performance of BitPhylogeny was

slightly better than that of SCITE. The performance of OncoNEM with

respect to the pair distance error was better than that of BitPhylogeny and

SCITE in a lot of cases and on average. However, the variance of the pair

distance error and the split similarity of OncoNEM’s results were higher

than those of others. The range of variation of performance for OncoNEM is

clearly shown in the split similarity measure chart for this dataset (Fig. 8b).

According to theoretical analysis, the run time of Scelestial is polynomial

with respect to the number of samples with exponent k, which is a parameter

for the Steiner tree approximation algorithm. In Fig. 7, the quadratic form

of the run time might not be directly evident. This is normal because the

chart is cropped for large values to show the differences for all the algorithms

except OncoNEM and SiFit. The chart is also drawn logarithmically in the

y-axis, which makes it hard to see the actual growth in it. SCIPhI, SCITE,

and BitPhylogeny show similar behavior. In practice, the run times of all

the algorithms except for Scelestial and SCIPhI grew substantially with an

increasing number of sites.

3 Materials and Methods

3.1 Data format

We model the data as an m by n matrix D with single-cell samples as

columns and features as rows. The element of D corresponding to a single-

cell c and a locus f is denoted D[c, f] and represents the result of a variant

call obtained from a single-cell sample of a diploid genome, which may be one

of the 10 states from the set {A/A, T/T, C/C, G/G, A/T, A/C, A/G, T/C, T/G, C/G}

or a missing value X/X. When the data do not provide all the information

(e.g., for data obtained from OncoNEM’s simulation tools), we can convert

0/1 (reference state/variant state) matrices to the 10-state format by cod-

ing 0 as A/A and 1 as C/C. All the single-cell lineage tree reconstruction

methods we consider here support missing values in their input matrices.

May 12, 2021 22/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

BitPhylogeny BitPhylogeny OncoNEM OncoNEM SCITE SCITE Scelestial Scelestial 1000 SASC 1000 SASC SCIPhI SCIPhI SiFit SiFit SiCloneFit SiCloneFit

100 100 Running time (seconds) Running time (seconds) 10 10

1 1 0.0 0.5 1.0 1.5 2.0 0.0 0.4 0.8 1.2 Lineage tree reconstruction sample distance error Lineage tree reconstruction split similarity measure (a) Run time and lineage reconstruction error of four lineage (b) Run times and lineage reconstruction errors of four meth- tree inference methods on datasets with different numbers ods on datasets with different numbers of sites with respect of sites with respect to the pair distance error. to the split similarity measure.

Fig 8: Comparison of methods with respect to running time and lineage tree reconstruction error.

With this coding, we cannot differentiate between the case of two similar

alleles, e.g. A/A and the case of a missing allele in one strand. We hope

that considering an error as a regularization method helps us to find the

model (which is a lineage tree) that fits the data best.

3.2 Synthetic data generation via tumor simulation

We developed a tumor growth simulation method as a data source for the

evaluation of Scelestial and for a comparison with state-of-the-art methods.

The simulation has three phases: (1) simulation of evolution, (2) sampling

from the tree, and (3) simulation of sequencing.

In the simulation of evolution phase, the evolutionary process is sim-

ulated and a tree is generated. This simulation is based on evolutionary

events that happen in a tumor. We modeled the cell division, mutation,

and selection in the evolutionary process of the tumor in this simulation.

Each node in the evolutionary tree represents a cell (or representative of a

bulk of similar cells) in the history of tumor formation. To each of these,

May 12, 2021 23/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

an advantage value is assigned that shows the relative growth or division

advantage of the cells corresponding to that node. The advantage value of

a node is the same as its parent advantage value plus or minus a uniform

random number. The evolutionary tree is constructed through several steps.

In each step, one new node is generated and its parent is chosen from the

current nodes with a probability proportional to their advantage values.

We calculated the advantage value for the newly born node as a random

perturbation added to its parent’s advantage value. The actual sequence

for the new node is calculated from the sequence of the parent with some

random mutations. The number of mutations from its parent is derived

from a Poisson distribution about an average parameter specified in the

input. The locations of the mutations are chosen uniformly from all the

loci.

In the sampling phase, samples are chosen from the nodes of the tree

with a probability which is proportional to their advantage values. In the

simulation of the sequencing phase, missing values and errors are then

incorporated by stochastic processes.

3.3 Synthetic data generated by OncoNEM’s simula- tion tool

We used simulated data generated by OncoNEM’s simulation software

for our evaluation. The OncoNEM simulator is based on the evolution

of clones. First, it generates a lineage tree, then it assigns mutations to

tree nodes under the infinite site assumption. Afterwards, it generates

output sequences by sampling from the tree and generating sequences with

single-cell sequencing issues, including missing values, false positives, and

false negatives.

May 12, 2021 24/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

3.4 Comparison of a resulting lineage tree with a ground truth

We define two measures between trees: (1) a distance measure that compares

distances between samples in two trees, and (2) a similarity measure that

compares the set of splits made by two trees applied to the set of samples.

For both measures, we used a weighted version of classic 0/1 measures.

Different measures used in the literature are clustering accuracy [12],

order of mutation in a phylogenetic tree, ancestor-descendant accuracy

of mutations, different lineages, and co-clustering mutations [12]. These

measures are used for evaluating mutation trees, which is not what we

generate in the Scelestial method.

3.4.1 Sample distance measure between trees

We define the sample distance measure between two trees based on the

shortest-path matrix idea proposed by [5]. The basic idea is that we

calculate pairwise distances between each pair of input cells in a tree T .

We also create a pairwise distance matrix PDT between the inputs for

the tree. Following this, we normalize the matrix PDT to obtain PDT as P PDT [i, j] = PDT [i, j]/ x,y PDT [x, y]. Since different concepts of weight for the tree edges are used in different methods, the normalization phase

allows us to neglect absolute values and only consider the relative edge

distances in the lineage trees provided. We define the distance between

0 0 P two trees T and T as D(T,T ) := i,j |PDT [i, j] − PDT 0 [i, j]|, which represents the distance between the normalized pairwise distance matrices

for the two trees. The value of D(T, T 0) lies in the range between 0 and 2.

3.4.2 Split similarity measure between trees

We defined the split similarity between two trees T and T 0 as the similarity

between two sets of splits generated by two trees. For each edge e in a

tree T , we define the split Se as the set {A, B}, where A and B are the set of samples separated by the edge e in T . We define a similarity between

May 12, 2021 25/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

two splits {A1,B1} and {A2,B2} as the number of samples that are split similarly in two splits. However, since there is no difference between A and

B in two splits, we define the similarity score as:

max{|A1 ∩ A2| + |B1 ∩ B2|, |A1 ∩ B2| + |B1 ∩ A2|}

To calculate the distance between two sets of splits, we find the mapping

between the elements of two sets with the maximum similarity score. The

mapping similarity score is the sum of the similarity of the scores of matched

0 splits. We than defined DS(T,T ) as the normalized matching score which is the mapping score between T and T 0 divided by the mapping score of T

with T . The split score value is between 0 and 1.

3.5 The Scelestial algorithm

The Scelestial algorithm is based on an established theoretical computer

science problem called the Steiner tree problem. We incorporate an ap-

proximation algorithm for this problem provided by Berman et al. [24]

and its modification for lineage tree reconstruction [23]. We modify the

algorithm to support missing values and imputation. In the following, we

examine the details of the Steiner tree problem, the reduction of lineage

tree reconstruction to the Steiner tree problem and the incorporation of

missing values into our method.

A schematic view of the Scelestial algorithm is illustrated in Fig. 9.

The main part of the Scelestial algorithm is described in Algorithm 1.

Sub-modules of the Scelestial algorithm are presented in Algorithm 2-5.

3.5.1 The Steiner tree problem

The Scelestial algorithm is based on the Berman approximation algorithm

for the Steiner tree problem [24]. The input of a Steiner tree problem consists

of a weighted graph G = (V, E, w) and a subset of its vertices S ⊆ V , which

are called terminals. It is convenient to suppose that the weight function

w satisfies the triangle inequality (i.e. w(x, y) + w(y, z) ≤ w(x, z), for all

May 12, 2021 26/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

three vertices x, z, y ∈ V ). In the case of no missing values occurring in

the data, the triangle inequality constraint is satisfied, for example, for the

Hamming distance between nodes. However, in the case of single-cell data,

which contain a lot of missing values, we should consider this constraint

carefully.

The Steiner tree problem is the problem of finding a minimum-weight

connected subgraph of G that contains all the terminals S. The Steiner

tree problem is an NP-hard problem and it is known that under reasonable

complexity assumptions, no approximation algorithm can approximate the

result better than a factor of 96/95 ≈ 1.0105 [27].

Most approximation algorithms for the Steiner tree problem focus on

results that consist of a set of subtrees, each having k terminals at most,

for some constant k. A solution for the Steiner tree problem with this

property is called a k-restricted Steiner tree. We call the value k the

restriction number of a Steiner tree. Borchers and Du [28] showed that

restriction of the search space to k-restricted Steiner trees does not change

the approximation factor of an algorithm too much. More specifically, if

k = 2r + s where 0 ≤ s ≤ 2r, the restriction to k-restricted Steiner trees

reduces the approximation ratio to

r2r + s (r + 1)2r + s

3.5.2 Berman’s approximation algorithm for the Steiner tree

problem

The Berman et al.’s [24] approximation algorithm consists of three phases:

examination, evaluation, and application.

During the examination phase, the algorithm maintains a minimum

spanning tree M on the terminal set S. In this phase, the algorithm

considers all the subsets K of the terminals with a size of at most k (for

some fixed constant k) and all the topologies τ for the trees with K as its

leaves. For each K and τ, the algorithm finds the best lineage tree t (i.e.,

the best sequences for the internal nodes). This could be done through

May 12, 2021 27/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

dynamic programming. By adding t to the spanning tree, M will create

(k − 1) cycles. The algorithm finds a subset of edges of M with maximum

cost to be removed from M + t to obtain a tree again. These edges are

called bridges β. A value gain is defined for t, which is equal to the amount

of decrease in the resulting tree if we decide to incorporate t, which is equal

to cost(β) − cost(t), where the cost of a set of edges is defined as the sum

of the costs of its elements. If the the gain is greater than 1, the algorithm

adds t to a stack for the evaluation phase, it removes β from M, and adds

some new edges to M instead. For each edge e of β, the algorithm finds the

vertices u and v from K which are going to be disconnected after removing

e. It adds the edge (v, u) to M with the cost cost(e) − gain. At the end of

the examination phase, we will have a stack of some trees and M.

In the evaluation phase, the algorithm pops the trees t one by one from

the stack. It starts with an initially empty set of edges M 0. At each step it

checks if t does not make a loop with M 0. If this is the case, the algorithm

accepts t; otherwise, it rejects t. Finally, in the application phase, the

algorithm merges the accepted trees.

The run time of the algorithm for a general Steiner tree problem is

3 k+ 1 k−2 k−1 O(n + n 2 + N n ), where N is the number of vertices in the graph.

Note that n is the number of terminal vertices S. The algorithm is an

11/6-approximation algorithm for the Steiner tree problem, if we set k = 3.

If we set k = 4, the algorithm would be a 16/9-approximation algorithm.

For larger values of k, we do not know a better approximation factor for

the result of the algorithm, but the results of the execution of the algorithm

show that the performance of the algorithm gets better as we increase the

value of k [24].

3.5.3 Modeling lineage tree reconstruction as a Steiner tree

problem

The single-cell lineage tree inference problem can be modeled in the following

way. Let gi for 1 ≤ i ≤ n be the set of input sequences over some alphabet Σ + µ, where µ is a special character representing the missing value. Each

May 12, 2021 28/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

location on a sequence could be considered as a feature, which may be a

locus in the sequenced genomes or may represent a genomic aberration. We

suppose that all the sequences have the same length m. A cost function is

defined for any potential lineage trees to show how well a lineage tree is

fitted to the data. Normally, the cost function assigns a cost to each edge

of the tree, and the cost of a tree is the sum of its edge costs. The lineage

tree inference problem is the problem of finding a minimal cost lineage trees

that contain all the input sequences gi and potentially some other nodes. In the resulting tree, all non-input nodes are assigned a label of length m

from the alphabet Σ.

A general lineage tree inference problem that does not incorporate

missing values could be modeled as a Steiner tree problem as follows. Let

G = (V, E, w) be the graph containing all sequences of length m from the

alphabet Σ that have edge weights derived from the tree cost function.

Thus, the lineage tree problem would be equivalent to finding a Steiner tree

in this graph with the input sequences as its terminal set S.

3.5.4 Incorporating an approximation algorithm for lineage tree

reconstruction

We showed how lineage tree reconstruction could be modeled as a Steiner

tree problem in the previous section. However, through this method, the

size of the graph is |Σ|m, which is exponential to the length of input of the

lineage tree reconstruction problem. On the other hand, some approximation

algorithms do not consider all the vertices of graph G and they only work

with the best Steiner trees over certain subsets of terminals. Since we can

solve the Steiner tree problem over a small subset of terminals, we can use

these approximation approaches.

To apply Berman et al.’s algorithm for lineage tree reconstruction, as

already suggested by Alon et al. [23], for every subset S0 ⊆ S, with a of

size at most k, we consider all the possible tree topologies with k leaves at

most. Since the number k is constant, the number of these tree topologies

would also be constant. Next, for the pair consisting of a subset S0 and a

May 12, 2021 29/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

tree T , we find the minimum-cost Steiner tree by dynamic programming.

The rest is done by the Berman [24] algorithm. This approach gives us an

11/6-approximation algorithm for the lineage tree reconstruction problem

with k = 3 and a 16/9-approximation algorithm for k = 4. The cost of an

edge between two nodes with the sequences gi[t] and gj[t] for 1 ≤ t ≤ m P is t c(gi[t], gj[t]), for some cost function c. In this work, we assign a cost of 1 to every mutation and a cost of zero to non-mutated locations (i.e.,

c(x, x) = 0 for x ∈ Σ and c(x, y) = 1 for x 6= y ∈ Σ).

3.5.5 Support for missing values

To adapt the method to the single-cell setting, we incorporate missing values

into the algorithm. We can simply do this by extending the domain of the

character cost function c :Σ × Σ → R∗ to c : Σ + µ × Σ + µ → R∗. To do so, we should define c(µ, x) = c(x, µ) for x ∈ Σ as well as c(µ, µ); c(µ, x)

could be considered as the cost of imputation for a locus. One may consider

the assignment of c(µ, x) = c(µ, µ) = 0. However, this assignment violates

the triangle inequality for the graph G. To preserve the triangle inequality

between the edges’ weights, we may assign c(µ, x) = 0.5 +  for some small

constant  (e.g., 10−5). Furthermore, letting c(µ, µ) = 0 does not violate

the triangle inequality any longer.

3.5.6 Missing value imputation

The original approach of representing lineage tree reconstruction as a

Steiner tree problem does not consider the case of missing values in the

input sequences. However, for single-cell data, this is important. We

therefore designed a dynamic program that finds internal node sequences

with non-missing values. As a result, missing values may occur only in

input sequences. The translation of the missing value imputation task in

our model would be the task of filling missing values with some characters

from the alphabet Σ to minimize the cost function. To impute missing

values after finding the lineage tree, we replace each missing value in a node

with the most abundant character found within its neighbors.

May 12, 2021 30/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Algorithm 1 Pseudocode of the main part of the Scelestial algorithm

Input: Samples gi for 1 ≤ i ≤ n. Every sample gi is a string of length m. A character cost function c as defined in Section 3.5.5. k: Degree of restriction of restricted Steiner trees. m Output: The lineage tree T = (VT ,ET ). A partial function seq : VT → Σ that assigns sequences with length m to tree nodes. M ← Initialize({g1, . . . , gn}) stack ← EnumerateRestrictedTrees({g1, . . . , gn}, k, M) T ← ConstructTree(stack, M) T ← Impute({g1, . . . , gn}, T )

Algorithm 2 Initialize function (part of the Scelestial algorithm) function cˆ(a, b) Pm return i=1 c(a[i], b[i]). end function function Initialize({g1, . . . , gn}) Build a complete graph G on gi. Assign weight cˆ(gi, gj) to the edges of G. Let M = MST(G). Let cost(e) be the cost of edge e for e ∈ EM . return M end function

Algorithm 3 EnumerateRestrictedTrees function (part of the Scelestial algorithm)

function EnumerateRestrictedTrees({g1, . . . , gn}, k, M) for all K ⊂ {g1, . . . , gn} with 3 ≤ |K| ≤ k do for all tree topology τ on K as its leaves do Fill internal nodes of τ with sequences that minimize cost of the tree t. Let cost(t) be the sum of the cost of the edges of t. Let M 0 = M + t. Let β be the heaviest edges from M for which if we remove them, there would be no cycle in M 0. Let cost(β) be the sum of costs of edges in β. Let gain = cost(t) − cost(β) if gain > 0 then Let Mrep = ∅ for all edge e in t do Find the vertices u and v such that if we remove e from M 0 − β they will be disconnected. Add edge (v, u) to Mrep with the new cost cost(e)−gain end for push (t, β, Mrep) to stack Let M = M − β ∪ Mrep end if end for end for return stack end function

May 12, 2021 31/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Algorithm 4 ConstructTree function (part of the Scelestial algorithm) function ConstructTree(stack, M) Let T be an empty tree for all gi do Add vertex vi to T . end for Let seq to be a function with an empty domain. Let E = M while stack is not empty do pop (t, β, Mnew) from stack UpdateTree(T, E, M, t, β, Mnew) end while for all edges of M do if adding e to T does not make a cycle then add e to T end if end for return T end function function UpdateTree(T, E, M, t, β, Mnew) Let E = E ∪ β if Mnew ⊆ M then add t to T extend seq function to internal nodes of t else Let E = E − Mnew Let M = MST(E) end if end function

Algorithm 5 Impute function (part of the Scelestial algorithm)

function Impute({g1, . . . , gn}, T ) for all sample gi do Add vertex gi to T and add edge (gi, vi) to T . find the minimum-cost sequence for vertex vi. . This sequence serves as imputation for gi. end for return T end function

May 12, 2021 32/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 - Initialize Find the minimum spanning tree (MST) between CAC 0.5 input nodes 2 GAG

GCC ?AC 1.5

2 - Enumerate Restricted Trees For every k-subset of the samples K, and every tree CAC topologies t with K as its leaves, perform the follow- 2 0.5 ing steps: GAG GCC ?AC 1.5

2A - Fill internal nodes of t Find the best evolutionary tree τ with K as its CAC leaves by a dynamic program 2 0.5 GAG GAC

GCC ?AC 1.5 2B - Find bridges Find bridges β, which are edges that should be CAC removed from the spanning tree between S to 2 0.5 avoid cycles after adding τ to the tree. GAG GAC GCC ?AC 1.5 2C - Project τ onto S Let gain = cost(t) − cost(β) CAC 3 0.5 If gain > 0, 2 1 push t to the stack, GAG add projection of the bridges on K GCC ?AC weight of new edge = weight of bridge −gain

3 - Construct Tree For every tree t in the stack, if it is compatible with CAC the current tree, update the tree by adding t. 0.5 GAG GAC

GCC ?AC

4 - Impute Replace each node which has missing values with CAC a fully imputed one, which minimizes the cost of the tree. GAG GAC GCC CAC ?AC

Fig 9: Illustration of the Scelestial algorithm.

May 12, 2021 33/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

4 Discussion and conclusions

Inference of a tumor’s evolutionary history is a crucial step towards under-

standing of the common patterns of cancer evolution. There are a range

of methods available for phylogenetic inference with single-cell datasets.

For optimal performance, many of these probabilistic approaches require

substantial time for optimization (e.g., using MCMC). The increasing sam-

ple sizes of single-cell datasets emphasizes the need for fast and scalable

methods in this domain that also maintain high accuracy [29].

Here, we describe Scelestial, a computationally lightweight and accurate

method for lineage tree reconstruction from single-cell variant calls. The

method is based on a Steiner tree algorithm with a known approximation

guarantee, which we adapted to lineage tree reconstruction for single-cell

data with missing values by using a variation of the Steiner tree problem

called the group Steiner tree. This problem is a special case of the group

Steiner problem, in which the groups are sub-hypercubes of the original

graph. To achieve a fast algorithm, we further adapted the group Steiner

tree algorithm through modeling a hypercube corresponding to the missing

values with one representative vertex. To facilitate the use of Scelestial, we

provide an implementation as a freely available R package. From another

point of view, Scelestial could be considered as a generalization of the

neighbor-joining method. At each step, the neighbor-joining method finds

the two most similar elements (samples) and merges them to build a tree. A

generalization of this idea may consider more than two samples for merging

in each step. Determining a good objective function for ranking three sample

candidates is not trivial. Scelestial, based on Berman’s algorithm [24],

provides a generalized guaranteed neighbor-joining method.

We evaluated Scelestial’s performance with a diverse set of test cases

similar to modern single-cell datasets. The datasets were generated with

various ranges of clones, false positives, false negatives, and missing values,

all of which were derived from real single-cell datasets [25,30]. The simulated

data were produced by a tumor simulator emulating the process of tumor

May 12, 2021 34/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

growth via mutation and proliferation. In this way, data for different tumors

with various parameters were generated to assess computational methods

over a wide range of data types.

On these benchmark datasets, we compared Scelestial with seven other

state of the art phylogeny reconstruction methods, namely BitPhylogeny [4],

OncoNEM [5], SCITE [6], SASC [8], SCIPhI [10], SiFit [7], and SiCloneFit

[11]. As the comparisons were not straightforward, we excluded B-SCITE

[12], which uses a combination of single-cell and bulk data, as well as

the approach of Kim and Simons [3], where the input is hard-coded, and

PhISCS [13], as it does not infer a lineage tree. Of the methods using the k-

Dollo assumption, namely SASC [8], which is based on simulated annealing,

and SPhyR [9], which is based on integer programming, we included SASC

in the comparison. For the assessment of lineage tree quality, we applied

two commonly used metrics from population genetics and .

In this comparison, Scelestial performed best at reconstructing the ground

truth tree’s topology and also the similarity of the inferred to the ground

truth tree when branch lengths were considered.

When the methods were applied to cancer data, only Scelestial inferred

lineage trees that separated all cancerous cells from normal cells for all three

datasets, except for one cell. SiCloneFit had the second-best accuracy with

two misplaced cells, and OncoNEM was the only other method that did

not mix cancerous and normal cells in one single clone, although it mixed

cancerous and normal cells in the evolutionary tree. The run time analysis

showed Scelestial to perform up to two orders of magnitude faster than the

other methods when the default settings were used. This is particularly

important, as the number of single-cell genomes published in individual

studies continues to be on the rise, as are multi-dataset meta-studies, as

in [1].

Overall, Scelestial substantially improves lineage tree reconstruction

from single-cell variant calls across key criteria such as the scalability, run

times, and accuracy of lineage tree reconstruction on both real and simu-

lated data. Cell lineage reconstructions based on diverse tumor datasets

May 12, 2021 35/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

in combination with massive data gathering, resulting from fast advancing

technologies, provide a better understanding of the evolutionary landscape

and the associated mutations of tumors, as may also indicate the dependen-

cies between them [31–33]. These factors may make Scelestial instrumental

in furthering our understanding of the mutational landscape and the mech-

anisms of cancer formation and survival, as omics technologies continue to

thrive. Furthermore, the results of this paper can be seen as a case study

for translating concepts from theoretical computer science into advances in

computational biology.

Acknowledgments

This work was funded by a Helmholtz Incubator grant (Sparse2Big ZT-I-

0007).

References

1. L¨ahnemannD, K¨osterJ, Szczurek E, McCarthy DJ, Hicks SC, Robin-

son MD, et al. Eleven grand challenges in single-cell data science.

Genome biology. 2020;21(1):1–35.

2. Kuipers J, Jahn K, Beerenwinkel N. Advances in understanding

tumour evolution through single-cell sequencing. Biochimica et Bio-

physica Acta (BBA)-Reviews on Cancer. 2017;1867(2):127–138.

3. Kim KI, Simon R. Using single cell sequencing data to model the

evolutionary history of a tumor. BMC bioinformatics. 2014;15(1):27.

4. Yuan K, Sakoparnig T, Markowetz F, Beerenwinkel N. BitPhylogeny:

a probabilistic framework for reconstructing intra-tumor phylogenies.

Genome biology. 2015;16(1):36.

5. Ross EM, Markowetz F. OncoNEM: inferring tumor evolution from

single-cell sequencing data. Genome biology. 2016;17(1):69.

May 12, 2021 36/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

6. Jahn K, Kuipers J, Beerenwinkel N. Tree inference for single-cell

data. Genome biology. 2016;17(1):86.

7. Zafar H, Tzen A, Navin N, Chen K, Nakhleh L. SiFit: inferring

tumor trees from single-cell sequencing data under finite-sites models.

Genome biology. 2017;18(1):178.

8. Inferring Cancer Progression from Single-cell Sequencing while Al-

lowing Mutation Losses; 2018.

9. El-Kebir M. SPhyR: tumor phylogeny estimation from single-cell se-

quencing data under loss and error. Bioinformatics. 2018;34(17):i671–

i679.

10. Singer J, Kuipers J, Jahn K, Beerenwinkel N. Single-cell mutation

identification via phylogenetic inference. Nature communications.

2018;9(1):1–8.

11. Zafar H, Navin N, Chen K, Nakhleh L. SiCloneFit: Bayesian in-

ference of population structure, genotype, and phylogeny of tumor

clones from single-cell genome sequencing data. Genome research.

2019;29(11):1847–1859.

12. Malikic S, Jahn K, Kuipers J, Sahinalp SC, Beerenwinkel N. Inte-

grative inference of subclonal tumour evolution from single-cell and

bulk sequencing data. Nature communications. 2019;10(1):1–12.

13. Malikic S, Mehrabadi FR, Ciccolella S, Rahman MK, Ricketts C,

Haghshenas E, et al. PhISCS: a combinatorial approach for subperfect

tumor phylogeny reconstruction via integrative use of single-cell and

bulk sequencing data. Genome research. 2019;29(11):1860–1877.

14. Catanzaro D, Ravi R, Schwartz R. A mixed integer linear pro-

gramming model to reconstruct phylogenies from single nucleotide

polymorphism haplotypes under the maximum parsimony criterion.

Algorithms for Molecular Biology. 2013;8(1):3.

May 12, 2021 37/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

15. Sridhar S, Lam F, Blelloch GE, Ravi R, Schwartz R. Mixed integer

linear programming for maximum-parsimony phylogeny inference.

IEEE/ACM Transactions on Computational Biology and Bioinfor-

matics. 2008;5(3):323–331.

16. Caldwell AE, Kahng AB, Mantik S, Markov IL, Zelikovsky A. On

wirelength estimations for row-based placement. IEEE Transac-

tions on Computer-Aided Design of Integrated Circuits and Systems.

1999;18(9):1265–1278.

17. Peyer S. Shortest paths and Steiner trees in VLSI routing [PhD dis-

sertation]. PhD thesis, Research Institute for Discrete Mathematics,

University of Bonn; 2007.

18. Gilbert EN, Pollak HO. Steiner minimal trees. SIAM Journal on

Applied Mathematics. 1968;16(1):1–29.

19. Cieslik D. The steiner ratio. vol. 10. Springer Science & Business

Media; 2013.

20. Du DZ, Smith J, Rubinstein JH. Advances in Steiner trees. vol. 6.

Springer Science & Business Media; 2013.

21. Ivanov AO, Tuzhilin AA. Minimal NetworksThe Steiner Problem

and Its Generalizations. CRC press; 1994.

22. Pr¨omelHJ, Steger A. The Steiner tree problem: a tour through

graphs, algorithms, and complexity. Springer Science & Business

Media; 2012.

23. Alon N, Chor B, Pardi F, Rapoport A. Approximate maximum par-

simony and ancestral maximum likelihood. IEEE/ACM Transactions

on Computational Biology and Bioinformatics. 2008;7(1):183–187.

24. Berman P, Ramaiyer V. Improved approximations for the Steiner

tree problem. Journal of Algorithms. 1994;17(3):381–408.

May 12, 2021 38/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

25. Leung ML, Davis A, Gao R, Casasent A, Wang Y, Sei E, et al. Single-

cell DNA sequencing reveals a late-dissemination model in metastatic

colorectal cancer. Genome research. 2017;27(8):1287–1299.

26. Li Y, Xu X, Song L, Hou Y, Li Z, Tsang S, et al. Single-cell sequencing

analysis characterizes common and cell-lineage-specific mutations in

a muscle-invasive bladder cancer. GigaScience. 2012;1(1):12.

27. Chleb´ıkM, Chleb´ıkov´aJ. The Steiner tree problem on graphs: Inap-

proximability results. Theoretical Computer Science. 2008;406(3):207–

214.

28. Borchers A, Du DZ. The k-Steiner Ratio in Graphs. SIAM Journal

on Computing. 1997;26(3):857–869.

29. L¨ahnemannD, K¨osterJ, Fischer U, Borkhardt A, McHardy AC,

Sch¨onhuth A. ProSolo: Accurate Variant Calling from Single Cell

DNA Sequencing Data. bioRxiv. 2020;.

30. Nowell PC. The clonal evolution of tumor cell populations. Science.

1976;194(4260):23–28.

31. Croce CM. miRNAs in the spotlight: understanding cancer gene

dependency. Nature medicine. 2011;17(8):935–936.

32. Tsherniak A, Vazquez F, Montgomery PG, Weir BA, Kryukov

G, Cowley GS, et al. Defining a cancer dependency map. Cell.

2017;170(3):564–576.

33. Dempster JM, Pacini C, Pantel S, Behan FM, Green T, Krill-Burger

J, et al. Agreement between two large pan-cancer CRISPR-Cas9 gene

dependency data sets. Nature communications. 2019;10(1):1–14.

May 12, 2021 39/39