bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
Scelestial: fast and accurate single-cell lineage tree inference based on a Steiner tree approximation algorithm M.-H. Foroughmand-Araabi1, S. Goliaei1, A. C. McHardy1*
1 Department of Computational Biology of Infection Research, Helmholtz
Centre for Infection Research, Braunschweig, Germany
* Corresponding author ([email protected])
Abstract
Single-cell genome sequencing provides a highly granular view of biological
systems but is affected by high error rates, allelic amplification bias, and
uneven genome coverage. This creates a need for data-specific computational
methods, for purposes such as for cell lineage tree inference. The objective
of cell lineage tree reconstruction is to infer the evolutionary process that
generated a set of observed cell genomes. Lineage trees may enable a
better understanding of tumor formation and growth, as well as of organ
development for healthy body cells. We describe a method, Scelestial,
for lineage tree reconstruction from single-cell data, which is based on an
approximation algorithm for the Steiner tree problem and is a generalization
of the neighbor-joining method. We adapt the algorithm to efficiently select
a limited subset of potential sequences as internal nodes, in the presence
of missing values, and to minimize cost by lineage tree-based missing
value imputation. In a comparison against seven state-of-the-art single-cell
lineage tree reconstruction algorithms - BitPhylogeny, OncoNEM, SCITE,
SiFit, SASC, SCIPhI, and SiCloneFit - on simulated and real single-cell
tumor samples, Scelestial performed best at reconstructing trees in terms
of accuracy and run time. Scelestial has been implemented in C++. It is
also available as an R package named RScelestial.
May 12, 2021 1/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
1 Introduction
Lineage trees describe the evolutionary process that created a sample of
clonally related entities, such as individual cells within an organ or a tumor.
A lineage tree also suggests the constitution of a founder cell, as well as
of its descendants, and the evolutionary events that occurred during lin-
eage formation. Single-cell genomic data provide a highly resolved view
of cellular evolution, substantially more than bulk genome sequencing [1].
However, they also come with high rates of missing values and distorted
allele frequencies arising from amplification biases and sequencing errors [2].
Lineage tree reconstruction from single-cell data therefore requires specific
considerations for handling missing values and errors. Specific approaches
to tackle this problem include the methods of Kim and Simon [3], Bit-
Phylogeny [4], OncoNEM [5], SCITE [6], SiFit [7], SASC [8], SPhyR [9],
SCIPhI [10], SiCloneFit [11], B-SCITE [12], and PhISCS [13]. Kim and
Simon [3] make use of the infinite site assumption and infer a “mutation
tree” based on calculating a probability for ordering mutations in a lineage
and constructing a tree, finding the maximum spanning tree in this graph.
BitPhylogeny [4] provides a stochastic process through a graphical model
that stochastically generates the given input data and uses Markov Chain
Monte Carlo (MCMC) for sampling to search for the best lineage tree
model and the associated parameters. OncoNEM [5] and SCITE [6] infer a
phylogenetic tree over all the samples under a maximum likelihood model.
OncoNEM uses a heuristic search, whereas SCITE uses MCMC sampling to
find a maximum likelihood tree. SiCloneFit [11] infers subclonal structures
and a phylogeny via a Bayesian method under the finite site assumption.
SASC [8] and SPhyR [9] consider the k-Dollo model, a more relaxed model
in comparison than infinite site assumption. In k-Dollo model, a mutation
can be gained once in a tumor but may be lost multiple times afterwards.
SASC uses simulated annealing and SPhyR uses k-means to find the best
k-Dollo evolutionary tree. B-SCITE [12] and PhISCS [13] infer subclonal
evolution, based on a combination of single-cell and bulk sequencing data.
May 12, 2021 2/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
B-SCITE uses an MCMC method to maximize a likelihood function for
trees and sequencing data. PhISCS formulates tree reconstruction as com-
binatorial and mathematical programming problems, and uses standard
mathematical programming solvers to find the solution. Integer or inte-
ger linear programmings for the phylogenetic tree reconstruction are also
available [14,15].
The Steiner tree problem is a classic problem in theoretical computer
science with a wide range of applications in various areas including very-
large-scale integration design [16], network routing [17], civil engineering [18],
and other areas [19–22]. Given a weighted graph and some vertices marked
as terminals, the Steiner tree problem is the problem of finding the minimum
weighted tree that connects all the terminal vertices, with non-terminal
vertices that may or may not be included in the optimal tree. Although
the Steiner tree problem is NP-hard and no polynomial-time exact solution
is known, some elegant approximation algorithms are available. They
guarantee polynomial time and a constant factor approximation ratio,
and have been used for phylogenetic reconstruction [23]. This is a great
advantage compared with sampling heuristics such as MCMC for finding
an optimal solution, which may be trapped in local optima.
Here, we describe Scelestial, a method for lineage tree reconstruction
from single-cell datasets, based on the Berman approximation algorithm for
the Steiner tree problem [24]. Our method infers the evolutionary history
for single-cell data in the form of a lineage tree and imputes the missing
values accordingly. Dealing with missing values makes the problem much
harder in theory. To overcome this difficulty, we represent a hypercube
represented via [24] as a sequence with missing values as its center. This
representation, although imposes some cost in the results of the algorithm
caused by misplacement of the center from the best imputation, which we
do not know in advance, helps us to get a fast yet accurate and robust
algorithm.
May 12, 2021 3/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
2 Results
2.1 Performance in lineage tree reconstruction on sim- ulated data
We compared the performance of Scelestial to SCITE, OncoNEM, BitPhy-
logeny, SASC (as a recent instance of k-Dollo-based methods), SCIPhI,
SiFit, and SiCloneFit. For this, we generated data with the cell evolution
simulator provided by OncoNEM and with another tumor evolution sim-
ulator that we developed (Section 3.2). To assess the inferred trees, we
calculated their distance to the ground truth lineage trees as the normalized
pairwise distances between corresponding samples, as described in Sec-
tion 3.4.1. In addition, we calculated the similarity between the generated
trees and the ground truth by comparing tree splits (Section 3.4.2). For
run time evaluation, we executed all the methods on simulated tumor data
over a range of numbers of samples and sites. We also assessed the run
time performance of the methods in relation to different parameters, such
as the number of samples and sites (Section 2.5).
First, we evaluated the algorithms on data created with OncoNEM’s
simulator. The simulator creates samples by producing clones and sampling
from these clones (Section 3.3). It accepts the number of clones (i.e., the
number of nodes in the lineage tree), the number of sampled cells, the
number of sites, the false positive rate, the false negative rate, and the
missing value rate as parameters. We used 1.5% as the setting for the false
positive positives, 10% for false negatives and 7% for the missing value rate.
These parameters are consistent with observed parameters in single-cell
data [25]. We performed tests for 50 and 100 samples, 5 and 10 clones, and
20 and 50 sites, and calculated the pairwise distance between the inferred
and ground truth trees for all methods (Table 1). From these data, Scelestial
reconstructed lineage trees with the lowest distance to the ground truth
tree in 5 out of 8 cases. SASC, SCITE, and SiFit each performed best in
one of the remaining cases.
Next, we simulated 100 single-cell datasets from a solid tumor covering
May 12, 2021 4/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
Table 1: Comparison of reconstruction of ground truth lineage tree from data simulated by OncoNEM, showing the distance between the inferred trees and the ground truth for all methods across eight lineage trees. The best results among all the methods for each evolutionary tree are shown in bold.
50 samples 100 samples 5 clones 10 clones 5 clones 10 clones Method 20 sites 50 sites 20 sites 50 sites 20 sites 50 sites 20 sites 50 sites BitPhylogeny 0.8947 0.7949 0.8875 0.8665 1.056 0.9805 0.8418 1.1766 OncoNEM 0.9738 0.781 0.7412 0.8044 0.9783 0.9867 0.8429 0.9426 SCITE 1.0205 0.914 1.0122 1.2909 1.0366 0.8682 0.8536 1.0021 SiFit 0.9088 0.7383 0.7147 0.7543 0.9575 0.8603 0.784 0.861 SASC 0.8912 0.8212 0.7716 0.7994 0.9289 0.924 0.8186 0.9215 SCIPhI 0.9671 1.0897 0.7049 0.9466 1.0628 1.2485 1.2483 1.2194 SiCloneFit 1.0935 0.9441 0.8728 0.8433 1.0641 1.0464 0.809 0.9617 Scelestial 0.8904 0.703 0.6647 0.726 0.9241 0.8862 0.7918 0.8771
a range of evolutionary time spans (i.e., a range of mutations per branch
in the resulting lineage trees), using a simulation method we implemented
(Section 3.2, results in Fig. 1). The simulated data provide a granular
simulation of tumor evolution and single-cell sequence data. Tumor growth
was simulated with 50 samples and 200 sites. The other parameters (missing
value rate, false positive rate, and false negative rate) were set as in the
OncoNEM simulation. The main difference between our simulation and the
OncoNEM simulation tool was that we considered the relative preference
between clones.
The lineage tree reconstruction error was calculated as the sum of all
distance errors over every pair of samples (Section 3.4.1). The lower the
lineage tree reconstruction error, the more similar the tree is to the ground
truth one. In addition, we determined the split similarity between the
resulting trees and the ground truth tree (Section 3.4.2). A higher split
similarity measure reflects a better reconstruction of the tree’s topology.
Overall, when considering the lineage tree reconstruction error, none of
the resulting trees was highly similar to the ground truth tree (Fig. 1a).
Nevertheless, there were differences in the accuracy of the inferred trees.
With these data, considering the average performance measure of 10 datasets
for each mutation rate, Scelestial performed best in 10 out of 10 cases
regarding the split similarity measure. Scelestial also performs best for
May 12, 2021 5/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
2 1
0.8 1.5
0.6 BitPhylogeny 1 OncoNEM 0.4 SCITE Scelestial similarity of the tree tree reconstruction error 0.5 0.2 SASC SCIPhI Split 0 SiFit Lineage 0 SiCloneFit 2 4 6 8 10 2 4 6 8 10 Average number of mutations on each edge of lineage tree Average number of mutations on each edge of lineage tree (a) Sample distance measure (b) Split similarity measure
Fig 1: Comparison of the methods for single-cell lineage tree reconstruction on simulated tumor data. Note that in case of the lineage tree reconstruction error (Fig. 1a), lower values show a better reconstruction. On the other hand, the split similarity measure represents (Fig. 1b) similarity between the reconstructed tree and the ground truth tree, making higher values favorable.
the distance measure in 9 of 10 cases; SiCloneFit was the best for the
remaining case. The performance of Scelestial, SiCloneFit, and OncoNEM
increased with the number of mutations (Fig. 1a). This increase in the
mutation rate had a very large effect on the performance of SiCloneFit and
OncoNEM, whereas the performance of SASC and SiFit was almost stable
across changes in the number of mutations between clones.
Overall, the relative performance of lineage tree reconstruction for
different methods was similar to that observed before (Table 1). This also
indicates that the relative difficulty of tree inference for the simulation
method provided by OncoNEM and our simulation method are similar.
2.2 Robustness across varying data properties
We next evaluated the robustness of Scelestial to varying dataset properties
over 420 datasets with varying parameters that we generated for this
purpose with our tumor simulator. We fixed the default configuration
in our simulation as 50 samples, 200 sites, 5 evolutionary nodes in the
simulated evolutionary tree, 1.5% as the false positive rate, 10% as the
false-negative rate, and 7% as the missing value rate. We set the average
May 12, 2021 6/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
number of mutations between nodes in the evolutionary tree to be 20. For
evaluating the robustness of Scelestial with respect to one parameter, we
set the other parameters to their default values and evaluated Scelestial
over a range of the parameter under study. We used the following ranges:
false positive rate: 0–50%; false negative rate: 0–30%; missing value rate:
0–50%; samples: 5–200; sites: 20–1000.
Scelestial was not very sensitive to variation in the tested parameters
(Fig. 2, Fig. 3) in terms of their influence on the tree distance error and
topological accuracy measured by split similarity. As expected, performance
decreased with increasing missing value, false negative, and false positive
rates (Fig. 3). For datasets including more than 25 sites, the sample distance
measure was stable.
2.3 Case study: a single-cell dataset from a muscle- invasive bladder tumor
We inferred lineage trees for single-cell data of a muscle-invasive bladder
tumor [26] with Scelestial, OncoNEM, SCITE, and BitPhylogeny. These
data were obtained by single-cell exome sequencing of 44 tumor cells as
well as exome sequencing of normal cells, and included 443 variable sites.
We converted this to the required input matrix for OncoNEM, SCITE,
and BitPhylogeny, in which only the reference state, variant state, and
missing values were specified. In the resulting matrix, 27% of the elements
represented the reference state, 17% represented variant states, and 55%
represented missing values. Unlike the original study [5], we used all
cancerous cells as well as 13 normal cells for lineage tree reconstruction, to
see whether a distinct cancer cell lineage would become apparent in the
inferred trees. All methods except SCITE removed the redundant inner
nodes of the trees (i.e., nodes with not more than one child). For the SCITE
method, to obtain a comparable small tree, we compressed the tree in the
same manner.
In the trees inferred by BitPhylogeny, SCITE, SiFit, SASC, and SCIPhI
(Fig. 4), normal and cancerous cells were not separated into distinct subtrees.
May 12, 2021 7/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
1 1 error
0.5 0.5 Sample-distance 0 0 0 % 20 % 40 % 0 200 400 600 800 1,000 Missing value rate Number of sites
1 1 error
0.5 0.5 Sample-distance 0 0 0 50 100 150 200 0 % 20 % 40 % Sample size False negative rate
1 error
0.5 Sample-distance 0 0 % 2 % 4 % 6 % 8 % 10 % False positive rate
Fig 2: Robustness of Scelestial to variation in the properties of ground truth lineage trees in terms of sample distance in the trees between the inferred and ground truth trees.
May 12, 2021 8/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
1 1
0.5 0.5 similarity Split
0 0 0 % 20 % 40 % 0 200 400 600 800 1,000 Missing value rate Number of sites
1 1
0.5 0.5 similarity Split
0 0 0 50 100 150 200 0 % 20 % 40 % Sample size False negative rate
1
0.5 similarity Split
0 0 % 2 % 4 % 6 % 8 % 10 % False positive rate
Fig 3: Robustness of Scelestial to variation in the properties of ground truth lineage trees in terms of topological similarity between the inferred and ground truth trees.
May 12, 2021 9/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
In some cases, they were even assigned to the same clone, which seems
unlikely as an evolutionary scenario. In contrast, the trees of OncoNEM
and Scelestial effectively separated normal and cancerous cells. OncoNEM
placed all the normal cells in one clone and Scelestial separated all the
cancerous cells and normal cells into distinct subtrees.
The evolutionary trees (Fig. 4) returned by all the algorithms except
Scelestial were directed. The most plausible scenario for cancer evolution is
the rooting of a cancer lineage close to this root or to some internal lineage
of normal cells, instead of cancer cells being placed as ancestral to normal
cells. OncoNEM’s tree includes a normal cell as a root, followed by some
cancerous colonies placed between the descendant normal cells and this
root.
2.4 Case study 2: a single-cell dataset from metastatic colorectal cancer
We inferred lineage trees by all methods on single-cell data of colorectal
cancer from two patients [25]. In this dataset, 178 single-cell samples were
gathered from the first patient: 117 samples from normal cells, 33 samples
from the primary tumor, and 28 samples from metastasis of the tumor to
the liver. The data represent 16 genomic sites with 6.7% missing values.
In the reconstructed lineage trees for the first patient (Fig. 5), OncoNEM,
BitPhylogeny, SCIPhI, and SCITE suggest clones including both normal
and cancerous cells, which, as outlined above, is unrealistic. SASC and
SiFit produced lineage trees with several normal and cancerous cells that
evolved from cancerous clones, while Scelestial again separated the two types
of cells into distinct sub-clades: except for three metastatic cell samples,
Scelestial suggested evolution of the primary tumor from normal cells and
evolution of the metastatic tumor from the primary tumor. Like Scelestial,
SiCloneFit placed the same three metastatic (misplaced) samples in the
Scelestial tree between normal and primary tumor cells. However, the tree
generated by SiCloneFit suggested independent evolution of the metastatic
cells and primary tumor cells, which seems less likely, though not impossible.
May 12, 2021 10/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
No sample Mixed Normal and cancerous samples Cancerous samples Normal samples
BN2, BN3, BN5, BN6, BN7, BN9, BN12, BN13, BN14, BN15, BN16, BNT BC16, BC55
BC6, BC22, BC58 BC38 BC28, BC43
BC56, BC11, BC25, BC59 BC18, BC46, BC54 BC37 BC53
BC21, BC32, BC9 BCT BC9, BC15, BC41 BC31, BC33, BC33 BC44, BC47, BC21 BC13, BC14, BC46 BC49, BC51 BC15, BC23, BC52 BC45 BC24, BC25, BC18, BC56 BC49, BC29, BC37, BC7 BC39, BC42, BC55 BC6, BC44, BC53, BC36 BC59 BC34 BC47 BN2, BN3, BN5, BC52 BC8 BN6, BN7, BN9, BC34 BN12, BN13, BC48 BN14, BN15, BC58 BC28, BC38 BCT, BC7, BN16 BC41, BC8, BC13, BC50 BC14, BC23, BC36 BC43 BC24, BC29, BC35, BC42, BNT BC22 BC50 BC32 BC11 BC39 BC31 BC54
BC35, BC16 BC48 BC45, BC51
(a) SCIPhI (b) OncoNEM
Fig 4: Lineage tree inferred by the different methods on a single-cell dataset from a muscle-invasive bladder tumour. Nodes in these trees represent clones (i.e., inferred, evolutionary genomes ancestral to the observed single-cell genomes). Blue nodes represent clones containing only normal cells, orange nodes represent clones containing only cancerous cells, and brown nodes represent clones containing both normal and cancerous cells. Text within the nodes indicates the identification number of the assigned sample cell(s) to the corresponding clone. White nodes represent nodes with no observed sample.
May 12, 2021 11/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
BC7
BC13
BC14
BC16
BC23
BC24
BC29
BC34
BC35
BC36
BC6 BC39
BC53 BC42
BC54 BC48
BC56 BC50 BC22 BC58 BC55 BC52 BC59
BC11 BC48, BC11, BC15, BC16, BC28 BC24, BC36, BN2 BC41, BC43
BN3 BC43
BN5 BC46 BNT BC6, BN14, BC16 BC14
BN15 BN6 BCT
BC33
BN14 BN7 BC9
BC54, BC21, BN13 BN9 BC15 BC39, BC32, BC53
BN12 BN12 BNT, BN7, BC21 BN12, BN5 BN9 BC11 BN13 BC31 BC9 BC52 BC47, BC55, BN7 BC58, BC25, BC43 BN14 BC28, BN16, BC51 BC31 BC22 BC32 BN6 BC38
BC15 BN15 BC45 BN16 BN5 BC59, BN2 BC7, BC50, BC33, BN3 BC58 BC50 BC13 BC51 BC18 BN16 BC51 BC49 BC18 BC6 BC47 BC41 BC39 BNT BC8 BC46 BC31, BC25 BC34 BC59 BC44 BC25 BC53 BC47
BC56 BC37 BC33 BC29, BC9, BN15, BC32 BC38 BC8, BC49 BC23 BC18, BC22, BC48 BC54 BC37 BN13, BC35, BC37, BC42 BC56, BCT BC35 BC45 BC23 BC38 BN2 BC46, BC52 BC24 BC29 BCT BC41 BC21 BC14 BC13 BC34 BC44, BN6, BC45 BC44 BN3 BC42 BC36 BC8 BC7 BC49 BN9 BC28 BC55
(c) Scelestial (d) SiCloneFit (e) BitPhylogeny
Fig 4: Lineage tree inferred by different methods on a single-cell dataset from a muscle-invasive bladder tumour.
May 12, 2021 12/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
BC18
BC49 BC55
BC51 BC54 BC23
BC14
BC42 BC36 BC47 BC13 BC25
BC50
BC46
BC43 BC7 BC8
BC31 BC39
BC15 BC29
BC24 BC59 BCT BC48 BC53
BC9 BC41 BC22
BC35 BC32 BC38 BC44 BC21
BC34
BC33 BC45 BC28
BC52
BN6
BN9 BC37
BN12 BN13 BN16 BN3 BNT BN2 BN15 BC56
BC6 BC16 BN7 BN14 BN5 BC58 BC11
(f) SiFit
BC18, BC22, BC28, BC36, BC13 BC56 BC24, BN7, BC46, BC53, BN13, BN15, BN9 BN16
BN3 BC35 BC6, BC7, BC8, BC14, BN6, BN14, BN2, BC48 BC37, BC38, BNT BN12 BC33, BC58, BN5 BC25, BC41, BC15, BC50 BC55 BC44
BC39 BC31 BC51 BC29 BCT BC34, BC59 BC9, BC47 BC32
BC42 BC16 BC43 BC52
BC21 BC45
BC11
BC49 BC23, BC54
(g) SCITE
BC59 BC46
BC43 BN12 BN7 BN2 BN5 BC11 BNT BN6 BN3 BN9 BN16 BN15 BN14 BN13 BC58 BC16 BC6
BC41 BC32 BC21 BC49 BC52 BC9 BC31 BC51
BC39 BC29 BC28 BC23 BC34 BC33 BC42 BC55 BC47 BC36 BC35 BC24 BCT BC7 BC13 BC18 BC14 BC8 BC15 BC50
BC48 BC56 BC54 BC37 BC22 BC53 BC45 BC44 BC25 BC38
(h) SASC
Fig 4: Lineage trees inferred by different methods on a single-cell dataset from a muscle-invasive bladder tumor.
May 12, 2021 13/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
However, SiCloneFit placed one metastatic cell (M176) closer to the primary
tumor. Taken together, Scelestial performed best at separating the normal
cells, primary tumor, and metastatic cells. SiCloneFit was almost as good,
with only one more seemingly misplaced cell.
From the second patient, 181 single-cell samples were taken: 113 normal
cells, 29 primary tumor cells, and 39 metastatic cells. Thirty-six features
were extracted by evaluation of the cancer mutations of the samples. Of all
sites, 7.7% were missing values.
All algorithms separated normal from cancerous cells less well than
for the first patient (Fig. 6). Similar to their trees for the first patient,
OncoNEM, BitPhylogeny, SCIPhI, and SCITE suggested several mixed
clones including normal and cancerous cells. SASC and SiFit produced
lineage trees with several normal and cancerous cells evolving from each
other. In SiFit’s tree, there were fewer normal cells evolved from cancerous
cells that in SASC’s tree, but instead had many distinct subtrees with
cancerous cells descending from normal cells. Except for one cell (M-175),
all cancerous cells were separated from normal cells into one subtree in
Scelestial’s tree. In the SiCloneFit tree, the cancerous cells M-175 and
M-178 were also misplaced outside of a cancer lineage and closer to normal
cells. Metastatic and primary tumor cells were not well separated from
one another in the trees generated by Scelestial as well as by SiCloneFit.
Overall, Scelestial and SiCloneFit created the most realistic lineage trees of
all the methods analyzed.
2.5 Run time efficiency
We compared the run times of the eight methods with default settings
on the 110 datasets generated by the OncoNEM simulator (Fig. 7). We
simulated data with 10 mutations on average at each evolutionary step, a
range of 20 to 100 samples, and a range of 50 to 200 sites. We only included
pairs of methods and cases for which the method’s run time was less than
15 minutes.
SCITE, SCIPhI, BitPhylogeny, and Scelestial were the only methods
May 12, 2021 14/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
No sample Mixed normal/cancerous/metastasis samples Metastasis tumor samples Primary tumor samples Normal samples
N-56 0.3 N-11 N-18 0.3 N-82 M-154 N-40 N-93 0.2 0.5 2.2 0.3 1.0 0.5 0.5 0.4 N-81 0.5 P-128 0.1 0.2 0.9 P-122 0.8 1.8 1.5 N-116 1.2 0.8 N-77 0.1 N-68 1.3 1.1 N-28 0.3 0.5 0.6 P-139 0.4 0.0 0.2 0.1 0.5 0.2 1.1 N-99 N-70 0.4 0.7 0.5 0.4 0.3 0.2 P-131 0.3 0.6 0.2 0.1 0.1 N-87 N-15 N-72 0.5 0.3 0.3 0.4 E-53 N-25 0.2 0.2 0.1 1.8 N-12 0.9 N-50 0.6 N-71 0.0 0.1 P-129 0.6 1.0 0.1 N-54 N-76 0.1 P-147 0.4 N-115 0.1 P-140 0.4 0.3 0.2 0.1 M-176 0.3 N-91 0.4 P-119 0.0 0.1 0.4 P-120 1.0 0.0 0.2 0.4 0.2 P-133 0.2 N-45 0.1 P-138 0.2 0.1 M-177 M-157 M-171 0.4 P-149 M-162 M-165 1.5 0.4 0.5 0.1 0.1 0.4 0.3 N-104 0.2 0.2 0.4 P-126 0.1 0.3 M-158 0.3 P-144 0.1 P-123 0.6 0.9 0.0 0.2 P-141 0.2 0.5 0.6 N-42 0.1 0.3 0.1 0.6 0.0 M-160 0.2 0.2 0.1 M-174 0.3 0.4 0.3 0.5 0.1 0.3 0.5 N-60 P-124 M-169 0.2 0.1 P-145 0.1 0.4 0.9 0.5 0.1 0.1 0.1 0.1 E-51 N-65 0.1 4.2 0.0 P-148 0.4 0.2 0.8 1.5 P-135 0.0 0.2 M-166 N-97 0.2 0.2 0.2 N-33 0.0 N-55 0.1 P-125 0.2 0.2 N-62 N-80 0.0 0.6 0.0 M-163 0.1 N-38 N-44 0.3 0.1 0.4 0.2 0.5 P-136 0.0 0.5 0.1 N-113 0.1 M-173 N-108 P-121 2.3 0.1 0.1 0.1 0.4 N-35 N-90 N-73 0.2 0.6 M-170 M-156 0.1 P-142 0.3 0.3 0.1 0.1 P-134 0.2 0.3 0.2 0.2 2.3 P-118 P-130 0.1 0.5 0.1 0.4 0.1 M-172 N-2 0.3 0.2 1.2 0.2 M-155 0.2 0.1 0.1 0.5 N-95 N-30 1.4 0.3 0.1 1.1 0.1 M-167 0.1 P-143 0.2 0.1 1.0 0.3 0.6 0.1 N-7 P-137 0.7 P-132 0.0 N-20 0.1 0.3 0.5 N-13 N-111 N-79 0.8 0.0 0.1 N-1 0.3 0.2 0.1 M-178 0.7 N-49 0.3 0.0 0.1 0.1 0.2 0.4 0.0 0.0 N-31 0.2 N-78 P-146 0.9 0.1 M-161 0.2 0.2 N-109 0.2 0.0 1.2 E-52 0.4 0.3 0.2 0.2 0.1 N-8 0.5 N-98 N-4 0.2 0.2 N-88 0.4 M-159 M-168 0.3 0.4 0.4 N-66 0.1 N-112 N-102 0.4 0.5 N-110 0.1 0.3 0.1 0.4 N-86 0.2 0.1 N-117 0.1 0.5 0.1 N-9 1.0 0.3 N-27 0.2 M-175 0.1 0.1 N-105 0.2 0.4 0.0 0.2 0.2 0.3 0.2 N-39 1.7 N-107 0.2 N-114 N-3 0.2 N-6 0.1 0.1 N-22 0.5 N-16 0.1 0.1 0.1 0.3 0.3 0.4 N-67 0.4 N-85 0.4 N-10 0.0 0.1 0.3 0.5 6.6 M-153 N-83 0.3 0.2 0.1 0.2 N-75 0.5 M-151 0.0 N-74 0.1 0.2 0.1 0.2 0.1 0.0 N-21 0.1 P-127 0.4 0.5 0.4 0.1 0.5 M-152 0.0 N-24 N-84 N-19 N-32 0.2 0.3 N-58 0.1 0.0 0.1 0.7 0.1 N-47 P-150 0.1 N-106 0.0 0.1 0.1 N-17 0.0 0.0 0.1 N-14 N-89 N-37 N-29 N-69 0.3 M-164 0.1 0.0 0.1 0.1 0.2 N-103 0.1 N-34 1.0 0.0 0.1 0.1 0.5 0.4 0.1 0.3 N-41 N-57 0.0 0.0 0.2 N-43 0.1 0.0 N-96 N-5 N-23 N-94 0.1 0.4
0.2 0.1 0.1 N-59 N-63 N-100 0.4 0.0 0.2 0.2 0.1 N-92 0.0 0.2 N-46 0.1 0.1 N-48 0.3 N-26 0.3 N-36 N-61 0.0 0.4 0.4 N-64 0.2 0.2 N-101
(a) SiFit
N-1, N-2, N-3, N-4, N-5, N-6, N-7, N-8, N-9, N-10, N-11, N-12, N-13, N-14, N-15, N-16, N-17, N-18, N-19, N-20, N-21, N-22, N-23, N-24, N-25, N-26, N-110, N-27, N-28, N-29, N-30, N-31, N-111, 1.0 N-54 N-32, N-33, N-34, N-35, N-36, 1.0 M-153, N-37, N-38, N-39, N-40, N-41, M-155 N-42, N-43, N-44, N-45, N-46, N-47, N-48, N-49, N-50, E-51, E-52, E-53, N-55, N-56, N-57, P-118, P-119, P-120, 1.0 M-156, P-121, P-122, P-124, N-58, N-59, N-60, N-61, N-62, 1.0 M-151 M-158 N-63, N-64, N-65, N-66, N-67, P-125, P-126, P-127, N-90, N-91, N-92, P-129, P-130, P-133, N-68, N-69, N-70, N-71, N-72, 1.0 N-93, N-94, N-95, 1.0 1.0 P-134, P-137, P-142, N-73, N-74, N-75, N-76, N-77, P-123, P-132, N-96, N-97, N-102, P-148, M-157, M-159, N-78, N-79, N-80, N-81, N-82, 1.0 N-98, 1.0 1.0 1.0 1.0 1.0 1.0 P-135, 1.0 P-145, P-146, P-128, P-131, 1.0 M-152 M-160, M-161, M-162, N-83, N-84, N-85, N-86, N-87, M-154 P-136 P-147, P-149, P-138, P-139, M-163, M-164, M-165, N-88, N-89, N-99, N-100, N-101, P-150 P-140, P-141, M-166, M-167, M-168, N-103, N-104, N-105, N-106, P-143, P-144 N-107, N-108, N-109, N-112, M-169, M-170, M-171, N-113, N-114, N-115, N-116, M-172, M-173, M-174, N-117 M-175, M-176, M-177, M-178
(b) SCITE
1.0 N-9 N-21 1.0
0.0 1.0 1.0 1.0 N-39 N-105 N-95 1.0 N-67 1.0 P-134 N-82 N-110 1.0 1.0 P-135 1.0 1.0 N-101 1.0 M-176 N-96 N-44 N-97 1.0 1.0 1.0 1.0 1.0 N-106 N-38 N-49 N-103 N-34 1.0 1.0 1.0 1.0 0.0 N-27 0.0 1.0 0.0 M-170 N-20 0.0 1.0 0.0 0.0 0.0 1.0 N-30 M-152 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 M-153 N-114 N-42 N-68 N-14 N-99 N-108 N-72 N-88 1.0 0.0 N-10 1.0 1.0 1.0 1.0 1.0 P-119 P-120 P-132 P-144 1.0 1.0 1.0 N-35 N-90 N-40 1.0 N-29
1.0 1.0 P-140 P-137 N-2 1.0 0.0 1.0 M-154 N-89 1.0 1.0 N-92 P-124 M-171 0.0 1.0 1.0 N-1 1.0 1.0 0.0 P-136 1.0 1.0 N-104 N-15 1.0 N-57 1.0 1.0 P-149 P-143 N-59 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 N-71 N-28 1.0 N-111 N-5 N-11 1.0 1.0 1.0 M-177 P-133 N-80 N-79 1.0 1.0 1.0 1.0 1.0 1.0 N-86 N-37 N-116 1.0 N-100 N-24 0.0 1.0 1.0 P-122 N-32 N-78 N-6 1.0 1.0 1.0 N-77 N-84 1.0 1.0 N-113 1.0 N-12 N-13 N-54 1.0 1.0 1.0 N-48 N-70 0.0 N-107 0.0 1.0 1.0 1.0 1.0 1.0 P-129 P-126 P-147 P-127 N-43 E-51 1.0 P-146 1.0 1.0 1.0 1.0 1.0 N-47 1.0 P-118 1.0 P-148 1.0 N-23 N-83 N-45 M-161 1.0 1.0 N-17 1.0 1.0 1.0 1.0 1.0 1.0 N-66 N-19 P-130 1.0 M-151 M-164 M-160 1.0 E-53 N-75 1.0 1.0 N-4 N-18 1.0 1.0 1.0 P-123 P-125 M-156 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 N-94 N-56 N-50 N-60 N-109 N-87 N-98 N-115 N-31 N-73 N-81 N-102 N-64 N-7 M-155 1.0 1.0 1.0 1.0 M-158 P-128 M-162 1.0 1.0 N-41 N-25 1.0 P-131 E-52 M-178 1.0 1.0 P-141 1.0 P-121 N-69 1.0
1.0 1.0 1.0 1.0 1.0 1.0 M-166 M-159 M-175 M-169 M-168 N-74 N-3 1.0 1.0 1.0 1.0 N-46 M-165 P-150 1.0 1.0 1.0 1.0 1.0 N-22 N-85 N-36 N-62 N-76 1.0 1.0 1.0 1.0 N-55 N-26 N-33 M-163 N-93 1.0 1.0 1.0 1.0 1.0 N-117 M-174 M-173 P-139 1.0 1.0 N-65 N-91 1.0 N-8 0.0
1.0 1.0 1.0 M-157 N-63 P-145 1.0 M-167 P-142 1.0 1.0 1.0 1.0 P-138 N-16 N-112 1.0 1.0 M-172 N-58 1.0
N-61 (c) SASC
Fig 5: Lineage tree inferred by the different methods on a single-cell dataset from the first colorectal cancer patient.
May 12, 2021 15/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
N-1 N-2
0.0 N-2
0.0 N-3 0.0 N-3
0.0 N-4 0.0 N-4
0.0 N-5 0.0 N-5
0.0 N-6 0.0 N-6 0.0 N-7
0.0 N-7 0.0 N-8
0.0 N-9 0.0 N-8
0.0 N-10 0.0 N-9 0.0 N-11
0.0 N-10 0.0 N-12
0.0 0.0 N-11 N-13
0.0 N-14 0.0 N-12 0.0 N-15
0.0 N-13 0.0 N-16
0.0 N-14 0.0 N-17
0.0 N-18 0.0 N-15
0.0 N-19
0.0 N-16 0.0 N-20
0.0 N-17 0.0 N-21
0.0 N-22 0.0 N-18
0.0 N-23
0.0 N-19 0.0 N-24
0.0 N-20 0.0 N-25
0.0 N-26 0.0 N-21
0.0 N-27 0.0 N-22 0.0 N-28
0.0 N-23 0.0 N-29
0.0 N-30 0.0 N-24
0.0 N-31 0.0 N-25 0.0 N-32
0.0 N-26 0.0 N-33
0.0 0.0 N-27 N-34
0.0 N-35 0.0 N-28 0.0 N-36
0.0 N-29 0.0 N-37
0.0 N-30 0.0 N-38
0.0 N-39 0.0 N-31
0.0 N-40
0.0 N-32 0.0 N-41
0.0 N-33 0.0 N-42
0.0 N-43 0.0 N-34
0.0 N-44
0.0 N-35 0.0 N-45
0.0 N-36 0.0 N-46
0.0 N-47 0.0 N-37
0.0 N-48 0.0 N-38 0.0 N-49
0.5 N-39 0.0 N-50
0.0 E-51 0.5 N-40
0.0 E-52 0.5 N-41 0.0 E-53
0.5 N-42 0.0 N-54
0.0 0.5 N-43 N-55
0.0 N-56 0.5 N-44 0.0 N-57
0.5 N-45 0.0 N-58 0.0
0.5 N-46 0.0 N-59 0.0 0.0 N-60 0.5 N-47 0.0 N-61 0.0 0.5 N-48 N-62 0.0
0.5 N-49 N-63 0.0
N-64 0.0 0.5 N-50
N-65 0.0 0.5 N-54 N-66 0.0
1.0 N-55 N-67 0.0
0.0 N-68 N-56 0.0
N-69 0.0 0.0 N-57 N-70 0.0 0.5 N-58 N-71 0.0 0.5 0.1 N-72 0.5 N-59 0.0 0.0 N-68 N-1 0.5 N-73 0.0 0.5 N-60 N-74 0.0 0.5
N-61 N-75 1.0 0.0
N-76 N-62 0.0 1.0 N-77 0.0 N-63 1.5 N-78 0.0 N-64 1.0 N-79 0.0
N-80 1.5 N-65 0.0
N-81 0.0 0.5 N-66 N-82 0.0 N-69 0.5 N-83 0.0
N-84 0.5 N-70 0.0
N-85 0.0 0.5 N-71 N-86 0.0
0.5 N-72 N-87 0.0
N-73 N-88 1.0 0.0
N-89 0.0 1.0 N-74 N-90 0.0 2.0 N-75 N-91 0.0
N-76 N-92 1.5 0.0
N-93 0.0 2.0 N-77
N-94 0.0 P-131 1.5 N-78 N-95 0.0 0.0 P-132
N-79 N-96 1.0 0.0 0.0 P-135 N-97 1.0 N-80 0.0 0.0 P-136 N-98 0.0 N-81 1.0 0.0 N-99 P-138 0.0
N-82 0.0 0.5 N-100 P-139 0.0 0.0
N-83 P-140 0.5 0.0 N-101 0.0
0.0 1.5 N-84 N-102 0.0 N-102 0.0 P-144 1.0 0.6 0.0
0.0 N-103 P-145 1.0 N-85 0.0
N-104 P-146 0.0 0.0 1.0 N-86
N-105 P-147 0.0 0.0 0.3 1.5 N-87 0.0 N-106 P-149
1.0 N-88 0.0 N-107 P-150
1.0 N-89 0.0 N-108 P-118
N-90 0.0 1.0 0.0 N-109 P-119
0.0
N-91 0.0 N-110 P-121 0.5 0.0
0.0 0.0 0.0 N-111 P-122 0.5 N-92 0.0 0.0 P-120 0.0 N-112 P-123 2.0 N-93 0.0 0.0 P-124 0.0 N-113 P-133 0.0 N-94 1.5 0.0 0.0 P-125 0.0 N-114 P-134
1.0 N-95 0.0 P-126 0.0 N-115 M-176
N-96 0.0 2.5 N-116 P-127 M-153 0.0
1.0 N-97 N-117 0.0 0.0 P-128 0.3 0.0 M-154 0.0 0.0 0.0 P-129 1.5 N-98 0.0 0.0 0.9 M-157 M-155 0.0 0.0 P-130 N-99 0.0 0.9 0.0 0.0 1.0 M-158 0.0 P-137 0.0 0.0 1.0 N-100 M-160 P-141 0.0 M-152 N-101 0.0 1.0 1.0 P-142 0.0 0.0 M-178 2.0 P-143 0.5 N-103 N-67 P-119 M-151 P-148 0.0 0.5 N-104 P-120 0.0 M-156 M-159 2.5 0.5 0.5 N-105 P-121 0.0 M-159 M-163 0.5 1.4 M-161 0.0 M-161 N-106 P-122 0.5 0.5 1.5 M-164 1.0 0.0 M-162 1.5 0.5 N-107 P-118 P-123 0.5 3.5 0.0 M-165 M-163 2.0 1.0 0.5 1.0 M-162 N-108 P-124 P-144 1.0 0.0 0.5 2.0 M-164 1.0 2.0 1.0 1.0 1.5 1.0 M-168 M-166 M-156 M-158 M-157 M-160 0.5 1.5 0.5 0.0 M-176 M-165 2.0 N-109 P-126 0.5 2.5 2.0 0.0 0.0 M-167 M-151 1.0 P-130 1.5 0.0 M-166 1.0 3.5 4.0 1.0 1.0 N-110 N-111 M-153 P-127 P-133 0.5 1.5 1.0 0.0 1.0 P-131 1.0 0.0 M-167 0.5 P-125 1.0 M-169 M-152 P-132 1.5 1.0 1.0 P-134 0.0 1.0 N-112 0.5 1.5 P-129 P-146 1.0 M-168 1.0 1.0 0.0 P-147 1.0 M-173 M-177 M-178 1.5 1.5 P-135 2.0 N-113 P-137 M-169 1.0 0.0 1.0 1.0 0.5 M-171 M-170 0.5 0.0 M-170 N-114 3.0 P-141 P-136 0.5 1.5 M-174 P-142 1.5 0.5 M-172 1.5 0.0 M-171 1.0 N-115 P-143 P-128 1.0 1.0 M-175 2.5 P-138 0.0 M-172 1.0 P-148 P-140 1.0 N-116 1.5 1.0 0.0 M-173 P-139 2.0 N-117 M-154 P-149 P-145 0.0 M-174 3.0 0.0 1.0 2.5 0.5 E-51 E-53 1.5 M-175 P-150 0.5 M-155 E-52 M-177 (d) Scelestial (e) SiCloneFit
Fig 5: Lineage tree inferred by the different methods on a single-cell dataset from the first colorectal cancer patient.
May 12, 2021 16/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
1.0 P-145, P-148, 1.0 1.0 P-149, P-150
1.0 1.0 1.0 1.0 1.0 1.0 1.0 P-118, P-119, P-120, 1.0 P-121, P-122, P-123, P-124, P-125, P-126, 1.0 P-127, P-128, P-129, 1.0 P-130, P-131, P-132, 1.0 P-138, P-139, P-140, 1.0 P-142, P-143, P-144, 1.0 P-146, P-147, M-151, 1.0 M-156, M-157, M-158, 1.0 M-162 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 N-102 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0
1.0
1.0 1.0 1.0 P-141, 1.0 1.0 M-154 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 N-69, 1.0 1.0 1.0 1.0 N-98 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 N-63, N-64, 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 N-65, N-70, 1.0 N-71, N-72, 1.0 1.0 1.0 1.0 1.0 1.0 N-73, N-74, 1.0 1.0 1.0 1.0 N-75, N-76, 1.0 1.0 1.0 N-77, N-78, 1.0 1.0 1.0 1.0 1.0 N-90, N-112, 1.0 1.0 1.0 N-114 1.0 1.0 1.0 1.0 N-1, N-2, N-3, N-4, N-5, N-6, 1.0 1.0 N-7, N-8, N-9, N-10, N-11, 1.0 1.0 N-12, N-13, N-14, N-15, N-16, 1.0 1.0 1.0 1.0 1.0 N-17, N-18, N-19, N-20, N-21, 1.0 1.0 N-54 1.0 1.0 N-22, N-23, N-24, N-25, N-26, 1.0 1.0 N-27, N-28, N-29, N-30, N-31, 1.0 1.0 1.0 N-32, N-33, N-34, N-35, N-36, 1.0 1.0 1.0 N-37, N-38, N-39, N-40, N-41, 1.0 1.0 N-42, N-43, N-44, N-45, N-46, 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 N-47, N-48, N-49, N-50, N-55, 1.0 1.0 1.0 N-56, N-57, N-58, N-59, N-60, 1.0 1.0 1.0 N-61, N-62, N-66, N-68, N-79, 1.0 1.0 1.0 N-80, N-81, N-82, N-83, N-84, 1.0 1.0 1.0 N-85, N-86, N-87, N-88, N-89, N-110, 1.0 1.0 1.0 P-135, 1.0 1.0 N-91, N-92, N-93, N-94, N-95, N-111, 1.0 1.0 P-136 N-96, N-97, N-99, N-100, N-101, M-153 1.0 1.0 N-103, N-104, N-105, N-106, 1.0 1.0 N-107, N-108, N-109, N-113, 1.0 N-115, N-116, N-117 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 E-52, M-152, M-159, M-161, 1.0 M-155, 1.0 1.0 1.0 1.0 M-166, M-167, M-176 M-168, M-169, 1.0 1.0 1.0 M-170, M-171, 1.0 1.0 1.0 M-172, M-173, 1.0 E-51, M-174, M-175, 1.0 1.0 1.0 1.0 E-53 1.0 M-177, M-178 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 P-134, M-160, 1.0 M-163, M-164, 1.0 1.0 M-165 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 N-67 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 P-133 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 P-137 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
(f) SCIPhI
N-1, N-2, N-3, N-4, N-5, N-6, N-7, N-8, N-9, N-10, N-11, N-12, N-13, N-14, N-15, N-16, N-17, N-18, N-19, N-20, N-21, N-22, N-23, N-24, N-25, N-26, N-27, N-28, N-29, N-30, N-31, N-32, N-33, N-34, N-35, N-36, N-37, P-132, P-136, N-38, N-39, N-40, N-41, N-42, P-146, P-147, N-43, N-44, N-45, N-46, N-47, P-149, P-150 N-48, N-49, N-50, E-51, E-52, 1.0 P-118, P-119, E-53, N-54, N-55, N-56, N-57, P-128, P-131, P-124, P-125, P-120, P-121, N-58, N-59, N-60, N-61, N-62, 3.1 M-154, 2.5 P-135, 1.5 P-138, P-139, 2.0 1.0 P-126, P-127, P-122, P-123, N-63, N-64, N-65, N-66, N-67, M-155 P-141 P-140, P-142, P-129, P-130, P-133, P-134, M-152, M-163, N-68, N-69, N-70, N-71, N-72, P-143, M-158 P-137, P-148 2.2 P-144, P-145 M-164, M-165, N-73, N-74, N-75, N-76, N-77, M-151, M-156, M-166, M-167, N-78, N-79, N-80, N-81, N-82, M-159, 1.8 M-157, 1.0 M-168, M-169, N-83, N-84, N-85, N-86, N-87, M-162, M-160, M-170, M-171, N-88, N-89, N-90, N-91, N-92, M-176 M-161 M-172, M-173, N-93, N-94, N-95, N-96, N-97, M-174, M-175, N-98, N-99, N-100, N-101, N-102, M-177, M-178 N-103, N-104, N-105, N-106, N-107, N-108, N-109, N-110, N-111, N-112, N-113, N-114, N-115, N-116, N-117, M-153
(g) OncoNEM
P-129, P-133, P-136, P-137, P-138, P-139, P-141, P-142, P-144, P-145, P-146, P-148, P-149, P-150, M-159, M-170, M-176, P-118, P-119, P-120, P-121, P-123, P-125, P-126, P-127
0.9 N-97, N-99, N-76, N-101, N-38, N-39, N-40, N-41, N-42, N-43, N-44, N-45, N-46, N-47, N-48, N-49, N-50, E-51, N-88, N-104, N-102
0.2
P-147, M-151, M-152, M-154, N-65, N-66, M-156, M-157, N-100, N-111, M-158, M-160, N-80, N-81, M-161, M-163, M-153, N-91, 0.8 M-164, M-165, N-57, N-58, M-166, M-167, N-59, N-60, M-168, M-169, N-61, N-62, M-171, M-172, N-63, N-64 M-173, M-174, M-175, M-177, M-178, P-122 0.7
0.1
P-130, P-131, P-132, P-134, P-135, P-140, P-143, N-115, P-124, P-128
N-98, N-75, N-1, N-2, N-3, N-4, N-5, N-77, N-78, N-6, N-7, N-8, N-9, N-10, N-79, N-112, 0.9 N-11, N-12, N-13, N-14, 0.4 N-95, N-96 M-162 N-15, N-16, N-17, N-18, N-19, N-20, N-21, N-22, N-23, N-24, N-25, N-26, N-27, N-28, N-29, N-30, N-31, N-32, N-33, N-34, N-35, N-36, N-37, N-54, 0.1 N-55, N-56, N-67, N-68, N-116, N-117, N-82, N-83, 0.0 N-69, N-70, N-71, N-72, N-86, N-87, N-84, N-85, N-73, N-74, N-90, N-93, N-89, M-155 E-53, N-92 N-94, N-103, N-105, N-106, N-107, N-108, N-109, 0.7 N-110, N-113, N-114
E-52
(h) BitPhylogeny
Fig 5: Lineage tree inferred by the different methods on a single-cell dataset from the first colorectal cancer patient.
May 12, 2021 17/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
I-2 I-2
0.0 I-8 5.5 N-9
0.0 N-9
0.0 N-10 0.0 N-10
0.0 0.0 N-11 N-11
0.0 N-12 0.0 N-12
0.0 N-13 0.0 N-13
0.0 N-14 0.0 N-14 0.0 N-15
0.0 N-15 0.0 N-16
0.0 0.0 N-17 N-16
0.0 N-18 0.0 N-17
0.0 N-19 0.0 N-18
0.0 N-20 0.0 N-19 0.0 N-21
0.0 N-20 0.0 N-22
0.0 0.0 N-23 N-21
0.0 N-24 0.0 N-22
0.0 N-25 0.0 N-23
0.0 N-26 0.0 N-24 0.0 N-27
0.0 N-25 0.0 N-28
0.0 0.0 N-29 N-26
0.0 N-30 0.0 N-27
0.0 N-31 1.0 N-28
0.0 N-32 1.0 N-29 0.0 N-33
0.5 N-30 0.0 N-34
0.0 N-35 2.0 N-31
0.0 N-36 1.0 N-32
0.0 N-37 1.5 N-33
0.0 N-38 1.0 N-34 0.0 N-39
1.5 N-35 0.0 N-40
0.0 N-41 1.5 N-36
0.0 N-42 1.5 N-37
0.0 N-43 0.5 N-38
0.0 N-44 0.5 N-39 0.0 N-45
2.5 N-40 0.0 N-46
0.0 N-47 1.5 N-41
0.0 N-48 1.0 N-42
0.0 N-49 1.5 N-43
0.0 N-50 1.5 N-44 0.0 N-51
1.0 N-45 0.0 N-52
0.0 N-53 1.5 N-46
0.0 N-54 1.5 N-47
0.0 N-55 1.0 N-48
0.0 N-56 1.0 N-49 0.0 N-57
1.0 N-50 0.0 N-58
0.0 N-59 0.5 N-51
0.0 N-60 0.5 N-52 0.0 0.0 N-61 0.0 1.0 N-53 0.0 N-62 0.0 2.0 N-54 N-63 0.0 2.5 N-55 N-64 0.0 1.5 N-65 N-56 0.0 1.5 N-66 0.0 N-57 1.5 N-67 0.0 0.0 1.5 N-89 I-8 N-58 0.5 N-68 0.0 0.5 N-59 N-69 0.0 1.5
N-60 N-70 0.0 0.5
N-71 N-61 0.0 0.5 I-3 3.5 N-72 0.0 N-62 1.0 4.0 1.0 I-6 I-5 N-73 0.0 3.5 1.0 N-63 I-1 0.5 4.5 5.5 N-74 I-7 5.0 0.0 I-4 0.5 N-64 N-75 0.0
N-65 N-76 1.5 0.0
N-77 N-66 0.0 0.5
N-78 0.0 0.5 N-67
N-79 0.0 4.5 0.5 N-68 MP1-175 N-80 0.0 0.5 N-69 N-81 0.0 P-114
N-70 N-82 1.5 0.0 0.0 P-117
N-83 N-71 0.0 1.0 0.0 P-118
N-84 0.0 1.0 N-72 0.0 P-119 N-85 0.0 0.0 N-73 0.0 P-120 1.5 N-86 0.0 N-74 0.0 P-122 0.5 N-87 0.0
N-75 N-88 0.0 P-124 1.0 0.0
N-89 0.0 0.0 P-125 2.0 N-76
N-90 0.0 0.0 P-126 1.0 N-77
N-91 0.0 0.0 P-127 0.5 N-78 N-92 0.0 0.0 P-128 0.0 N-79 N-93 0.0 0.0 P-129 N-80 N-94 2.0 0.0 0.0
0.0 P-130 N-95 0.0 0.0 1.5 N-83 0.0 N-96 P-131 0.0 N-85 0.0 1.0
N-97 0.0 P-132 0.0 0.5 N-86 N-98 0.0 P-133 0.0 1.0 N-87 N-99 3.5 0.0 0.5 N-82 N-113 P-134 0.0 1.0 0.5 N-88 N-81 N-100 0.5 0.5 0.0 5.0 N-84 N-112 0.0 P-135 N-90 0.0 N-101 0.2 3.5 0.0 P-136 N-91 0.0 N-102 3.0 0.0 P-137 N-92 0.0 N-103 2.5 0.0 P-138
0.0 N-104 3.5 N-93 0.0 P-139
N-105 0.0 3.5 N-94 P-140
N-106 0.0 3.5 N-95 P-141
0.0 N-107 2.0 N-96 M3-162
0.0 N-108 4.0 N-97 0.0 M2-163 0.0 N-109 3.0 N-98 0.0 M2-164 0.0 N-110 4.5 N-99 0.0 0.0 N-111 M2-165 0.0 N-100 0.4 5.0 0.0 0.0 N-112 0.0 M2-166 N-101 0.0 5.5 N-113 M2-169 0.0 4.0 N-102 MP1-175 M2-170 M2-165 4.5 N-103 6.0 5.5 I-1 M3-157 M3-156 6.5 0.0 MP1-174 M2-166 3.5 I-3 4.0 N-104 2.0 0.0 3.0 M3-158 I-4 5.0 0.0 M3-145 M2-169 5.0 0.3 0.0 MP1-174 M3-162 0.0 6.0 N-105 3.0 3.5 0.0 I-5 P-116 0.0 9.0 3.5 4.0 M3-153 M3-152 0.0 0.0 M3-147 4.5 2.0 N-106 M2-170 I-6 4.0 3.0 M2-163 2.5 P-115 M2-167 0.0 M3-148 3.5 I-7 1.5 M3-155 0.1 4.5 N-107 4.0 MP1-179 0.0 M3-150 0.0 0.0 5.0 1.5 MP1-177 M3-149 N-108 5.5 0.2 0.0 2.0 M3-147 2.0 2.0 2.5 4.5 MP1-181 3.0 0.1 0.3 0.2 0.0 MP1-180 M3-150 3.5 0.0 4.0 N-109 M3-149 0.0 P-115 P-140 M3-151 2.5 P-139 2.5 M3-151 0.0 3.0 N-110 3.0 2.0 0.0 P-116 M3-144 M3-152 2.0 4.0 0.0 0.0 P-141 2.5 N-111 3.5 P-121 0.2 M3-148 2.0 0.0 2.5 MP1-177 2.5 M3-154 M3-153 2.5 P-133 2.5 3.5 0.0 P-123 2.5 3.0 P-120 3.5 MP1-173 1.5 1.5 2.5 0.0 MP1-173 P-134 4.0 M3-143 6.0 2.0 2.5 MP1-171 MP1-178 M3-146 3.0 1.5 1.5 P-119 1.5 0.0 M3-145 2.0 2.0 P-142 P-132 M3-160 M3-146 2.5 1.0 1.5 0.0 4.0 P-122 MP1-172 0.0 M3-144 4.0 5.5 0.0 P-137 M2-167 1.5 M3-161 0.0 3.0 2.0 P-142 0.0 0.0 M3-154 3.5 2.0 P-130 2.0 M2-168 3.0 P-136 M3-159 1.5 1.5 0.0 0.0 M3-143 4.5 M3-155 P-131 MP1-172 P-138 5.5 0.0 M2-164 0.0 3.5 M3-156 2.5 0.0 P-129 P-123 0.0 MP1-176 0.0 1.5 1.5 P-126 0.0 M3-157 1.0 P-135 2.0 1.0 0.0 P-125 0.5 0.0 MP1-178 1.5 P-127 M3-158 1.0 2.5 0.0 P-117 1.5 4.0 P-114 1.5 0.0 MP1-179 0.5 1.0 P-124 M3-159 3.0 0.0
P-121 P-118 P-128 MP1-180 2.5 0.0 M3-160 2.5 M2-168 4.5 MP1-181 M3-161
MP1-176 MP1-171
(a) SiCloneFit (b) Scelestial
Fig 6: Lineage tree inferred by the different methods on a single-cell dataset from the second colorectal cancer patient.
May 12, 2021 18/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
No sample Mixed normal/cancerous/metastasis samples Metastasis tumor samples Primary tumor samples Normal samples
N-104 N-25 N-102 0.1 0.0 N-81 N-22 0.2 0.1 0.9 0.0 0.1 0.2 0.3 0.3 N-59 N-62 N-77 N-9 N-64 0.0 I-3 0.5 0.6 1.0 0.2 0.3 0.0 0.5 N-79 0.1 0.2 0.4 0.2 0.2 0.4 I-7 0.1 0.3 0.2 N-30 N-113 N-101 N-55 0.2 0.6 0.2 1.1 0.7 N-103 0.0 0.7 0.1 0.6 1.3 N-16 N-99 0.6 0.4 0.9 0.2 1.0 1.3 N-105 0.1 1.3 0.2 0.3 0.1 0.3 0.5 0.5 N-43 N-88 N-17 N-74 0.7 0.5 0.4 0.3 1.3 0.2 0.4 1.0 N-45 1.7 0.4 0.6 0.7 N-97 0.5 N-93 N-60 N-44 N-38 1.3 0.7 0.1 0.1 I-5 0.1 0.4 0.4 M2-168 0.6 0.8 0.1 I-2 N-90 0.4 N-107 N-58 N-112 0.1 0.3 0.4 0.2 0.8 0.1 0.2 1.0 N-84 N-33 0.5 0.6 MP1-175 0.1 0.4 0.4 N-14 1.2 N-73 N-106 0.3 0.1 0.5 0.7 0.1 N-31 0.2 0.4 N-70 0.4 1.6 N-12 N-71 N-28 0.4 N-42 N-108 M3-146 0.6 0.2 0.2 0.3 0.4 0.4 0.1 2.1 N-23 0.8 N-40 0.5 0.1 0.7 0.4 M3-153 0.4 0.3 0.1 0.2 N-100 N-46 0.4 M3-152 0.4 N-52 0.3 0.1 0.5 N-110 0.5 N-76 0.1 0.2 N-92 N-91 N-39 0.6 0.4 0.2 0.1 M3-155
0.5 0.2 1.9 M3-161 0.0 0.0 N-49
N-35 N-36 N-54 0.3 0.4 0.4
0.1 0.3 0.4 N-109 0.1 0.2
0.2 0.1 N-75 N-27 I-1 0.2 0.2 0.2 0.5 0.3 0.4 N-66 0.4 0.1 N-53 N-15 0.2 0.1 0.0 0.4 N-89 N-63 N-24 0.0 0.1 0.2 0.0 0.3 N-26 N-41 0.1 0.1 0.2 0.1 N-37 I-8 N-80 N-82 0.2 0.1 0.2 0.0 0.4 0.2 0.2 0.1 N-56 0.1 0.1 0.3 0.0 0.7 0.3 0.2 N-10 N-13 N-34 0.3 0.3 0.6 0.1 0.3 0.3 N-29 0.5 0.1 0.6 0.1 N-65 N-98 N-67 N-11 N-51 0.3 I-6 0.2 0.3 0.7 0.2 0.2 N-96 P-124 P-122 0.2 0.2 0.4 0.8 0.3 N-87 N-86 P-136 0.2 0.2 0.1 0.1 N-21 P-138 0.1 0.1 P-131 N-50 N-72 0.0 0.1 0.3 0.2 0.4 0.1 0.2 0.3 0.1 N-20 1.3 0.3 P-137 0.3 P-134 0.2 0.2 0.3 0.1 0.4 N-111 0.7 M3-144 0.6 0.1 0.1 N-94 1.2 0.2 0.5 1.8 0.4 0.4 0.3 0.6 0.5 MP1-177 0.1 0.0 0.0 N-47 N-61 0.9 1.4 0.0 P-128 MP1-172 0.1 0.3 1.1 0.1 0.1 0.2 I-4 0.4 0.1 0.5 0.6 N-19 N-69 N-48 0.6 0.4 0.7 0.6 N-68 P-125 0.5 0.9 N-85 0.1 0.6 N-18 P-115 0.7 0.2 0.3 M2-165 0.3 N-57 P-135 N-78 M3-162 0.6 0.7 N-32 0.5 P-142 MP1-174 0.6 M2-170 0.9 0.4 1.1 P-120 N-95 N-83 0.5 0.4 M2-169 0.3 0.6 0.3 1.0 1.0 0.2 0.3 2.2 0.2 0.5 M3-151 0.6 0.0 0.1 0.1 M3-154 0.3 0.2 0.2 0.0 0.3 P-140 0.1 0.5 1.3 0.4 P-133 MP1-179 0.1 0.0 0.1 P-139 0.3 M3-156 P-141 P-119 0.3 M2-166 0.6 0.4 0.1 0.5 0.1 0.5 M3-149 0.5 0.4 0.9 0.6 0.8 0.4 P-130 M3-160 0.2 0.5 M3-150 MP1-176 0.4 0.1 0.1 P-114 MP1-181 0.7 0.2 M2-163 MP1-178 0.1 0.2 P-117 0.3 0.5 0.4 P-123 0.2 0.8 0.1 1.0 0.1 P-132 0.4 M2-167 0.5 0.4 P-118 P-121 0.0 0.3 0.5 M3-145 0.3 MP1-180 0.3 0.1 0.9 P-127 0.2 P-116 0.3 P-126 0.5 0.3 P-129 0.7 M2-164 0.1 M3-143 0.5 M3-157 0.0
0.2 0.1 M3-159 0.0 0.1 0.3 0.2 0.0 0.3 M3-158 0.2 MP1-171 0.3
MP1-173 0.5 M3-147 0.4
M3-148
(c) SiFit
1.0 1.0 MP1-181 N-94 N-48
1.0
1.0 MP1-175 N-103 1.0 N-53 0.0 1.0 P-121
1.0 1.0 MP1-171 N-108 1.0
1.0 1.0 1.0 1.0 1.0 N-22 MP1-177 P-125 P-115 P-116 N-18 1.0 0.0 0.0 1.0 1.0 M3-151 P-122 M2-164 1.0 1.0 1.0 1.0 1.0 M2-168 N-39 N-67 N-54 1.0 M2-166 1.0 1.0 1.0 0.0 N-26 N-21 1.0 MP1-172 P-133 1.0 1.0 0.0 N-74 N-79 N-30 1.0 1.0 N-62 1.0 1.0 MP1-173 1.0 1.0 1.0 1.0 1.0 P-117 P-118 N-97 N-38 N-66 P-114 1.0 1.0 1.0 N-73 N-93 N-101 1.0 1.0 1.0 N-91 N-89 N-106 1.0 N-92
M3-147 1.0 1.0 1.0 I-7 N-83 1.0 P-140 N-59 1.0
1.0 1.0 N-65 1.0 M2-169 M3-159 M2-167 1.0 0.0 1.0 N-72
1.0 1.0 N-107 N-95 P-134 N-23 1.0 1.0 1.0 1.0 MP1-180 P-141 1.0 N-50 1.0 1.0 N-20 N-77 1.0 1.0 1.0 N-64 N-17 N-27 N-98 0.0 1.0 1.0 0.0 1.0 N-47 N-11 0.0 1.0 1.0 1.0 1.0 P-124 P-127 N-13 N-41 I-3 1.0 0.0 1.0 0.0 1.0 1.0 1.0 P-128 N-112 MP1-179 N-12 N-15 1.0 1.0 1.0 1.0 1.0 I-5 1.0 P-138 N-45 N-99 P-119 N-19 1.0 1.0 0.0 1.0 1.0 1.0 1.0 N-100 N-88 N-46 1.0 I-1 N-14 M3-154 M3-155 1.0 1.0 1.0 0.0 N-86 1.0 P-131 M3-153 1.0 N-36 1.0 1.0 M3-146 N-33
1.0 P-137 N-104 1.0 1.0
1.0 M3-143 N-60 N-10 1.0
1.0 1.0 1.0 M3-149 N-57 N-96 M3-148 1.0 1.0 N-75 N-85 1.0
0.0 1.0 1.0 M3-152 N-31 N-82 1.0 0.0 1.0 1.0 1.0 N-24 N-87 P-142 N-28 N-71 1.0 1.0 1.0 1.0 N-63 1.0 0.0 P-132 M3-156 1.0 1.0 1.0 1.0 1.0 N-35 N-37 I-8 1.0 N-49 M2-170 0.0 P-135 1.0 N-80 1.0 1.0 1.0 1.0 N-25 N-61 N-102 P-130 1.0 1.0 1.0 1.0 MP1-178 N-32 N-78 1.0 1.0 1.0 M3-145 N-51 1.0 P-136 P-129 P-126 1.0 1.0 1.0 N-42 N-84 0.0 M3-162 N-16 P-139 1.0 1.0 1.0 1.0 1.0 1.0 M3-160 P-120 P-123 1.0 MP1-176 N-56 1.0 M2-163 N-110 1.0 N-109 1.0 I-2 N-58 1.0 1.0 1.0 M3-157 M3-158 1.0 1.0 1.0 1.0 0.0 N-55 N-76 N-52
1.0 1.0 M2-165 M3-150 N-34 I-4 1.0 1.0 1.0 M3-144 N-69 1.0 1.0 1.0 N-70 N-9 1.0 M3-161 N-113 1.0 1.0 1.0 N-44 1.0 1.0 1.0 N-81 N-111 1.0 I-6 MP1-174 1.0 1.0 1.0 N-68 1.0 N-105 N-29 N-90 1.0 N-43 N-40 (d) SASC
Fig 6: Lineage tree inferred by the different methods on a single-cell dataset from the second colorectal cancer patient.
May 12, 2021 19/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
P-119, P-132, M3-154, P-137, P-140, M3-155, P-141, M3-148, 1.0 M3-160, M3-149, M3-150, MP1-173, M3-151, M3-153, 1.0 MP1-174 M3-162
P-133 1.0
P-115, P-116, P-120, P-121, P-122, P-124, 1.0 P-123, M3-145, P-125, P-126, P-135, M3-146, M3-159, P-127, P-128, 1.0 1.0 1.0 P-136, 1.0 M2-164, M2-169, P-129, P-130, P-139, MP1-171, P-131, P-134, MP1-176 MP1-172 P-138, P-142, M3-143, M2-167, M3-144, M2-168, M2-170 M3-156, 1.0 1.0 1.0 M3-157, 1.0 MP1-179 M3-158, 1.0 1.0 1.0 1.0 1.0 M3-161 1.0 1.0 MP1-180, 1.0 1.0 MP1-181 1.0 1.0 M2-165
I-1, I-3, 1.0 1.0 P-114, 1.0 I-4, I-6 1.0 M2-163 I-8, N-9, N-10, N-11, N-12, N-13, N-14, N-15, N-16, N-17, N-18, N-19, N-20, N-21, N-22, I-2, I-7, 1.0 1.0 1.0 N-23, N-24, N-25, N-26, N-27, N-94 P-118 M2-166 N-28, N-29, N-30, N-31, N-32, 1.0 1.0 N-33, N-34, N-35, N-36, N-37, N-38, N-39, N-40, N-41, N-42, 1.0 M3-147, 1.0 1.0 N-43, N-44, N-45, N-46, N-47, M3-152 P-117 N-48, N-49, N-50, N-51, N-52, I-5, N-56, N-78, N-53, N-54, N-55, N-57, N-58, N-92, N-93, N-59, N-60, N-61, N-62, N-63, 1.0 N-96, N-97, N-64, N-65, N-66, N-67, N-68, N-99, N-101, N-69, N-70, N-71, N-72, N-73, N-103, N-109, N-74, N-75, N-76, N-77, N-79, N-110, N-111, N-80, N-81, N-82, N-83, N-84, N-112, MP1-178 N-85, N-86, N-87, N-88, N-89, N-90, N-91, N-95, N-98, N-100, N-102, N-104, N-105, N-106, N-107, N-108, N-113, MP1-175, MP1-177
(e) SCITE
1.0
1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 M3-144, 1.0 M3-160, 1.0 1.0 1.0 1.0 M3-161, 1.0 1.0 1.0 1.0 MP1-171 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0
1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 M3-153 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
MP1-172 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 P-133 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 M3-156, 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 M3-157, 1.0 1.0 M2-167, 1.0 1.0 M2-170, 1.0 1.0 1.0 1.0 MP1-174 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 MP1-181 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 P-119, P-124, 1.0 1.0 1.0 1.0 P-122 1.0 P-125, P-126, 1.0 1.0 1.0 1.0 P-128, P-129, 1.0 P-134, P-135, 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 P-136, P-137, 1.0 P-138, P-139, 1.0 1.0 1.0 1.0 1.0 P-142, M3-150, 1.0 1.0 1.0 M2-163, M3-155, M3-158, 1.0 1.0 1.0 1.0 1.0 MP1-179, M3-162, M2-165, 1.0 1.0 I-1, I-2, I-3, I-4, I-5, I-6, 1.0 1.0 MP1-180 1.0 1.0 1.0 1.0 M2-166, MP1-173 I-7, I-8, N-9, N-10, N-11, N-12, 1.0 1.0 1.0 N-13, N-14, N-15, N-16, N-17, 1.0 1.0 1.0 N-18, N-19, N-20, N-21, N-22, 1.0 N-23, N-24, N-25, N-26, N-27, 1.0 N-28, N-29, N-30, N-31, N-32, 1.0 1.0 1.0 N-33, N-34, N-35, N-36, N-37, 1.0 1.0 N-38, N-39, N-40, N-41, N-42, 1.0 1.0 1.0 1.0 1.0 N-43, N-44, N-45, N-46, N-47, 1.0 1.0 1.0 N-48, N-49, N-50, N-51, N-52, 1.0 1.0 1.0 1.0 1.0 N-53, N-54, N-55, N-56, N-57, N-78, N-79, 1.0 N-58, N-59, N-60, N-61, N-62, N-80, N-96, 1.0 1.0 1.0 1.0 N-63, N-64, N-65, N-66, N-67, N-112, P-131 1.0 1.0 N-68, N-69, N-70, N-71, N-72, N-73, N-74, N-75, N-76, N-77, N-81, N-82, N-83, N-84, N-85, 1.0 N-86, N-87, N-88, N-89, N-90, 1.0 N-91, N-92, N-93, N-94, N-95, 1.0 N-97, N-98, N-99, N-100, N-101, 1.0 P-117, P-118, 1.0 1.0 1.0 N-102, N-103, N-104, N-105, N-106, P-120, P-121, 1.0 N-107, N-108, N-109, N-110, N-111, P-123, P-127, 1.0 1.0 1.0 N-113, P-114, P-115, P-116, P-130, P-132, P-140, M3-152, MP1-175, MP1-177 P-141, M3-143, 1.0 1.0 1.0 1.0 1.0 1.0 M3-145, M3-146, 1.0 1.0 M3-147, M3-148, 1.0 1.0 1.0 M3-149, M3-151, 1.0 1.0 M3-154, M3-159, 1.0 1.0 1.0 M2-164, M2-168, 1.0 1.0 1.0 1.0 M2-169, MP1-176, MP1-178
1.0 1.0 1.0
1.0
(f) SCIPhI
I-2, I-8, N-9, N-10, N-11, N-12, N-13, N-14, N-15, N-16, N-17, N-18, N-19, N-20, N-21, N-22, N-23, N-24, N-25, N-26, N-27, 1.8 I-3, I-5, P-124, P-125, I-1, I-4 N-28, N-29, N-30, N-31, N-32, I-6, I-7 P-115, P-119, P-126, P-127, N-33, N-34, N-35, N-36, N-37, 3.2 P-116, 1.2 P-121, 1.8 P-128, P-129, N-38, N-39, N-40, N-41, N-42, P-120, P-122, P-131, P-132, N-43, N-44, N-45, N-46, N-47, P-130 P-134 P-135, P-136, N-48, N-49, N-50, N-51, N-52, P-137, P-138 N-53, N-54, N-55, N-56, N-57, 1.0 N-81, N-82, N-58, N-59, N-60, N-61, N-62, 1.0 N-84, N-112, N-63, N-64, N-65, N-66, N-67, N-113 N-68, N-69, N-70, N-71, N-72, P-114, P-118, N-73, N-74, N-75, N-76, N-77, P-123, P-141, M2-166 M3-145, M3-147, N-78, N-79, N-80, N-83, N-85, M3-150, M3-152, 3.1 M3-154, N-86, N-87, N-88, N-89, N-90, 3.2 1.0 M3-153, M3-155, M3-161 N-91, N-92, N-93, N-94, N-95, P-139, P-140, P-133, M3-148, P-142, M3-143, M2-168, MP1-172, P-117, M3-157 N-96, N-97, N-98, N-99, N-100, 2.8 2.0 2.0 N-101, N-102, N-103, N-104, MP1-174 M3-144, M3-146, MP1-173, M3-160, 3.0 N-105, N-106, N-107, N-108, M3-158, MP1-171, MP1-179, MP1-180 1.0 N-109, N-110, N-111, MP1-175, MP1-176 MP1-181 M3-149, MP1-177, MP1-178 M3-151, M3-156, 2.5 M3-159, M2-164 M2-167 3.3
M3-162, M2-163, 2.1 M2-169, M2-165 M2-170
(g) OncoNEM
I-3, N-101, N-102 I-2, I-4, I-8, N-10, N-11, N-12, N-13, N-14, N-15, N-16, N-18, N-20, 0.4 N-21, N-22, N-23, N-24, N-25, N-26, N-27, N-28, N-29, N-30, N-31, N-32, N-33, N-34, N-35, N-36, N-37, N-38, N-39, N-40, N-41, N-42, N-43, N-44, I-6, N-9, I-1, N-99, N-45, N-46, N-47, N-48, N-49, N-50, 0.0 N-110, N-17, N-112, N-103, N-51, N-52, N-57, N-58, N-60, N-61, N-19, N-59, N-104 N-63, N-65, N-68, N-69, N-70, N-71, 0.1 N-64 N-72, N-73, N-74, N-75, N-76, N-77, N-78, N-79, N-80, N-82, N-83, N-84, N-89, N-90, N-91, N-92, N-93, N-97, N-98, N-105, N-106, N-109, N-113, P-114, P-115, P-116, P-117, P-118, 0.6 N-107, P-119, P-120, P-121, P-122, P-123, P-131, P-124, P-125, P-127, P-128, P-129, N-111 P-130, P-132, P-133, P-134, P-135, 0.3 P-136, P-137, P-138, P-139, P-140, P-141, P-142, M3-143, M3-144, M3-145, N-66, N-67, M3-146, M3-147, M3-148, M3-149, 0.2 N-87, I-7, N-88, M3-150, M3-151, M3-152, M3-153, N-94, N-81, 0.2 N-100, I-5, M3-154, M3-155, M3-156, M3-157, N-53, N-54, N-108, N-86, M3-158, M3-159, M3-160, M3-161, N-55, N-56, N-95, N-96 M3-162, M2-163, M2-164, M2-165, N-85, MP1-176, M2-166, M2-167, M2-168, M2-169, N-62 0.1 M2-170, MP1-171, MP1-172, MP1-173, MP1-174, MP1-175, MP1-177, MP1-178, MP1-179, MP1-180, MP1-181 P-126
(h) BitPhylogeny
Fig 6: Lineage tree inferred by the different methods on a single-cell dataset from the second colorectal cancer patient.
May 12, 2021 20/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
1,000 4,000 BitPhylogeny OncoNEM SCITE Scelestial 500 SASC 2,000
time (seconds) SCIPhI SiFit SiCloneFit Running 0 0 20 40 60 80 100 50 100 150 200 number of samples number of sites (a) Run time comparisons for varying sample sizes. (b) Relation of run time to the number of sites
Fig 7: Run time comparison in relation to the number of samples and sites.
finishing their task across the whole range of sample sizes within 100 seconds.
Scelestial, SCIPhI, and SCITE’s run times grew almost linearly with an
increasing number of samples. The run time of BitPhylogeny does not seem
to be directly related to the number of samples but was rather random for
the tested cases, since it applies MCMC and its run time only depends on
the number of iterations. When we increased the number of sites, the run
time of Scelestial also grew almost linearly. Next to Scelestial, BitPhylogeny
and SCITE were the fastest methods. Over all cases, Scelestial was faster
than all other methods in seven cases; BitPhylogeny was the fastest in six
cases.
With these data, as well as being fastest, Scelestial, as before, also
had the smallest average error in lineage tree reconstruction (Fig. 8b).
Among 130 cases, in terms of sample distance error, Scelestial performed
best in 112 cases and OncoNEM performed best in 22 cases. In terms of
the split similarity measure of topological correctness, Scelestial performed
best in 112 cases, OncoNEM in 18 cases, BitPhylogeny in four cases, and
SCITE in two cases. In case of ties, best performance was counted for all
methods. OncoNEM was slightly more accurate in some cases; however, it
had a substantially larger run time. Thus, in these tests Scelestial was best
overall in both run time and error. SCITE and BitPhylogeny had similar
May 12, 2021 21/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
split similarity, with SCITE being faster than BitPhylogeny on average.
Considering the pair distance error, the performance of BitPhylogeny was
slightly better than that of SCITE. The performance of OncoNEM with
respect to the pair distance error was better than that of BitPhylogeny and
SCITE in a lot of cases and on average. However, the variance of the pair
distance error and the split similarity of OncoNEM’s results were higher
than those of others. The range of variation of performance for OncoNEM is
clearly shown in the split similarity measure chart for this dataset (Fig. 8b).
According to theoretical analysis, the run time of Scelestial is polynomial
with respect to the number of samples with exponent k, which is a parameter
for the Steiner tree approximation algorithm. In Fig. 7, the quadratic form
of the run time might not be directly evident. This is normal because the
chart is cropped for large values to show the differences for all the algorithms
except OncoNEM and SiFit. The chart is also drawn logarithmically in the
y-axis, which makes it hard to see the actual growth in it. SCIPhI, SCITE,
and BitPhylogeny show similar behavior. In practice, the run times of all
the algorithms except for Scelestial and SCIPhI grew substantially with an
increasing number of sites.
3 Materials and Methods
3.1 Data format
We model the data as an m by n matrix D with single-cell samples as
columns and features as rows. The element of D corresponding to a single-
cell c and a locus f is denoted D[c, f] and represents the result of a variant
call obtained from a single-cell sample of a diploid genome, which may be one
of the 10 states from the set {A/A, T/T, C/C, G/G, A/T, A/C, A/G, T/C, T/G, C/G}
or a missing value X/X. When the data do not provide all the information
(e.g., for data obtained from OncoNEM’s simulation tools), we can convert
0/1 (reference state/variant state) matrices to the 10-state format by cod-
ing 0 as A/A and 1 as C/C. All the single-cell lineage tree reconstruction
methods we consider here support missing values in their input matrices.
May 12, 2021 22/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
BitPhylogeny BitPhylogeny OncoNEM OncoNEM SCITE SCITE Scelestial Scelestial 1000 SASC 1000 SASC SCIPhI SCIPhI SiFit SiFit SiCloneFit SiCloneFit
100 100 Running time (seconds) Running time (seconds) 10 10
1 1 0.0 0.5 1.0 1.5 2.0 0.0 0.4 0.8 1.2 Lineage tree reconstruction sample distance error Lineage tree reconstruction split similarity measure (a) Run time and lineage reconstruction error of four lineage (b) Run times and lineage reconstruction errors of four meth- tree inference methods on datasets with different numbers ods on datasets with different numbers of sites with respect of sites with respect to the pair distance error. to the split similarity measure.
Fig 8: Comparison of methods with respect to running time and lineage tree reconstruction error.
With this coding, we cannot differentiate between the case of two similar
alleles, e.g. A/A and the case of a missing allele in one strand. We hope
that considering an error as a regularization method helps us to find the
model (which is a lineage tree) that fits the data best.
3.2 Synthetic data generation via tumor simulation
We developed a tumor growth simulation method as a data source for the
evaluation of Scelestial and for a comparison with state-of-the-art methods.
The simulation has three phases: (1) simulation of evolution, (2) sampling
from the tree, and (3) simulation of sequencing.
In the simulation of evolution phase, the evolutionary process is sim-
ulated and a tree is generated. This simulation is based on evolutionary
events that happen in a tumor. We modeled the cell division, mutation,
and selection in the evolutionary process of the tumor in this simulation.
Each node in the evolutionary tree represents a cell (or representative of a
bulk of similar cells) in the history of tumor formation. To each of these,
May 12, 2021 23/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
an advantage value is assigned that shows the relative growth or division
advantage of the cells corresponding to that node. The advantage value of
a node is the same as its parent advantage value plus or minus a uniform
random number. The evolutionary tree is constructed through several steps.
In each step, one new node is generated and its parent is chosen from the
current nodes with a probability proportional to their advantage values.
We calculated the advantage value for the newly born node as a random
perturbation added to its parent’s advantage value. The actual sequence
for the new node is calculated from the sequence of the parent with some
random mutations. The number of mutations from its parent is derived
from a Poisson distribution about an average parameter specified in the
input. The locations of the mutations are chosen uniformly from all the
loci.
In the sampling phase, samples are chosen from the nodes of the tree
with a probability which is proportional to their advantage values. In the
simulation of the sequencing phase, missing values and errors are then
incorporated by stochastic processes.
3.3 Synthetic data generated by OncoNEM’s simula- tion tool
We used simulated data generated by OncoNEM’s simulation software
for our evaluation. The OncoNEM simulator is based on the evolution
of clones. First, it generates a lineage tree, then it assigns mutations to
tree nodes under the infinite site assumption. Afterwards, it generates
output sequences by sampling from the tree and generating sequences with
single-cell sequencing issues, including missing values, false positives, and
false negatives.
May 12, 2021 24/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
3.4 Comparison of a resulting lineage tree with a ground truth
We define two measures between trees: (1) a distance measure that compares
distances between samples in two trees, and (2) a similarity measure that
compares the set of splits made by two trees applied to the set of samples.
For both measures, we used a weighted version of classic 0/1 measures.
Different measures used in the literature are clustering accuracy [12],
order of mutation in a phylogenetic tree, ancestor-descendant accuracy
of mutations, different lineages, and co-clustering mutations [12]. These
measures are used for evaluating mutation trees, which is not what we
generate in the Scelestial method.
3.4.1 Sample distance measure between trees
We define the sample distance measure between two trees based on the
shortest-path matrix idea proposed by [5]. The basic idea is that we
calculate pairwise distances between each pair of input cells in a tree T .
We also create a pairwise distance matrix PDT between the inputs for
the tree. Following this, we normalize the matrix PDT to obtain PDT as P PDT [i, j] = PDT [i, j]/ x,y PDT [x, y]. Since different concepts of weight for the tree edges are used in different methods, the normalization phase
allows us to neglect absolute values and only consider the relative edge
distances in the lineage trees provided. We define the distance between
0 0 P two trees T and T as D(T,T ) := i,j |PDT [i, j] − PDT 0 [i, j]|, which represents the distance between the normalized pairwise distance matrices
for the two trees. The value of D(T, T 0) lies in the range between 0 and 2.
3.4.2 Split similarity measure between trees
We defined the split similarity between two trees T and T 0 as the similarity
between two sets of splits generated by two trees. For each edge e in a
tree T , we define the split Se as the set {A, B}, where A and B are the set of samples separated by the edge e in T . We define a similarity between
May 12, 2021 25/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
two splits {A1,B1} and {A2,B2} as the number of samples that are split similarly in two splits. However, since there is no difference between A and
B in two splits, we define the similarity score as:
max{|A1 ∩ A2| + |B1 ∩ B2|, |A1 ∩ B2| + |B1 ∩ A2|}
To calculate the distance between two sets of splits, we find the mapping
between the elements of two sets with the maximum similarity score. The
mapping similarity score is the sum of the similarity of the scores of matched
0 splits. We than defined DS(T,T ) as the normalized matching score which is the mapping score between T and T 0 divided by the mapping score of T
with T . The split score value is between 0 and 1.
3.5 The Scelestial algorithm
The Scelestial algorithm is based on an established theoretical computer
science problem called the Steiner tree problem. We incorporate an ap-
proximation algorithm for this problem provided by Berman et al. [24]
and its modification for lineage tree reconstruction [23]. We modify the
algorithm to support missing values and imputation. In the following, we
examine the details of the Steiner tree problem, the reduction of lineage
tree reconstruction to the Steiner tree problem and the incorporation of
missing values into our method.
A schematic view of the Scelestial algorithm is illustrated in Fig. 9.
The main part of the Scelestial algorithm is described in Algorithm 1.
Sub-modules of the Scelestial algorithm are presented in Algorithm 2-5.
3.5.1 The Steiner tree problem
The Scelestial algorithm is based on the Berman approximation algorithm
for the Steiner tree problem [24]. The input of a Steiner tree problem consists
of a weighted graph G = (V, E, w) and a subset of its vertices S ⊆ V , which
are called terminals. It is convenient to suppose that the weight function
w satisfies the triangle inequality (i.e. w(x, y) + w(y, z) ≤ w(x, z), for all
May 12, 2021 26/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
three vertices x, z, y ∈ V ). In the case of no missing values occurring in
the data, the triangle inequality constraint is satisfied, for example, for the
Hamming distance between nodes. However, in the case of single-cell data,
which contain a lot of missing values, we should consider this constraint
carefully.
The Steiner tree problem is the problem of finding a minimum-weight
connected subgraph of G that contains all the terminals S. The Steiner
tree problem is an NP-hard problem and it is known that under reasonable
complexity assumptions, no approximation algorithm can approximate the
result better than a factor of 96/95 ≈ 1.0105 [27].
Most approximation algorithms for the Steiner tree problem focus on
results that consist of a set of subtrees, each having k terminals at most,
for some constant k. A solution for the Steiner tree problem with this
property is called a k-restricted Steiner tree. We call the value k the
restriction number of a Steiner tree. Borchers and Du [28] showed that
restriction of the search space to k-restricted Steiner trees does not change
the approximation factor of an algorithm too much. More specifically, if
k = 2r + s where 0 ≤ s ≤ 2r, the restriction to k-restricted Steiner trees
reduces the approximation ratio to
r2r + s (r + 1)2r + s
3.5.2 Berman’s approximation algorithm for the Steiner tree
problem
The Berman et al.’s [24] approximation algorithm consists of three phases:
examination, evaluation, and application.
During the examination phase, the algorithm maintains a minimum
spanning tree M on the terminal set S. In this phase, the algorithm
considers all the subsets K of the terminals with a size of at most k (for
some fixed constant k) and all the topologies τ for the trees with K as its
leaves. For each K and τ, the algorithm finds the best lineage tree t (i.e.,
the best sequences for the internal nodes). This could be done through
May 12, 2021 27/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
dynamic programming. By adding t to the spanning tree, M will create
(k − 1) cycles. The algorithm finds a subset of edges of M with maximum
cost to be removed from M + t to obtain a tree again. These edges are
called bridges β. A value gain is defined for t, which is equal to the amount
of decrease in the resulting tree if we decide to incorporate t, which is equal
to cost(β) − cost(t), where the cost of a set of edges is defined as the sum
of the costs of its elements. If the the gain is greater than 1, the algorithm
adds t to a stack for the evaluation phase, it removes β from M, and adds
some new edges to M instead. For each edge e of β, the algorithm finds the
vertices u and v from K which are going to be disconnected after removing
e. It adds the edge (v, u) to M with the cost cost(e) − gain. At the end of
the examination phase, we will have a stack of some trees and M.
In the evaluation phase, the algorithm pops the trees t one by one from
the stack. It starts with an initially empty set of edges M 0. At each step it
checks if t does not make a loop with M 0. If this is the case, the algorithm
accepts t; otherwise, it rejects t. Finally, in the application phase, the
algorithm merges the accepted trees.
The run time of the algorithm for a general Steiner tree problem is
3 k+ 1 k−2 k−1 O(n + n 2 + N n ), where N is the number of vertices in the graph.
Note that n is the number of terminal vertices S. The algorithm is an
11/6-approximation algorithm for the Steiner tree problem, if we set k = 3.
If we set k = 4, the algorithm would be a 16/9-approximation algorithm.
For larger values of k, we do not know a better approximation factor for
the result of the algorithm, but the results of the execution of the algorithm
show that the performance of the algorithm gets better as we increase the
value of k [24].
3.5.3 Modeling lineage tree reconstruction as a Steiner tree
problem
The single-cell lineage tree inference problem can be modeled in the following
way. Let gi for 1 ≤ i ≤ n be the set of input sequences over some alphabet Σ + µ, where µ is a special character representing the missing value. Each
May 12, 2021 28/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
location on a sequence could be considered as a feature, which may be a
locus in the sequenced genomes or may represent a genomic aberration. We
suppose that all the sequences have the same length m. A cost function is
defined for any potential lineage trees to show how well a lineage tree is
fitted to the data. Normally, the cost function assigns a cost to each edge
of the tree, and the cost of a tree is the sum of its edge costs. The lineage
tree inference problem is the problem of finding a minimal cost lineage trees
that contain all the input sequences gi and potentially some other nodes. In the resulting tree, all non-input nodes are assigned a label of length m
from the alphabet Σ.
A general lineage tree inference problem that does not incorporate
missing values could be modeled as a Steiner tree problem as follows. Let
G = (V, E, w) be the graph containing all sequences of length m from the
alphabet Σ that have edge weights derived from the tree cost function.
Thus, the lineage tree problem would be equivalent to finding a Steiner tree
in this graph with the input sequences as its terminal set S.
3.5.4 Incorporating an approximation algorithm for lineage tree
reconstruction
We showed how lineage tree reconstruction could be modeled as a Steiner
tree problem in the previous section. However, through this method, the
size of the graph is |Σ|m, which is exponential to the length of input of the
lineage tree reconstruction problem. On the other hand, some approximation
algorithms do not consider all the vertices of graph G and they only work
with the best Steiner trees over certain subsets of terminals. Since we can
solve the Steiner tree problem over a small subset of terminals, we can use
these approximation approaches.
To apply Berman et al.’s algorithm for lineage tree reconstruction, as
already suggested by Alon et al. [23], for every subset S0 ⊆ S, with a of
size at most k, we consider all the possible tree topologies with k leaves at
most. Since the number k is constant, the number of these tree topologies
would also be constant. Next, for the pair consisting of a subset S0 and a
May 12, 2021 29/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
tree T , we find the minimum-cost Steiner tree by dynamic programming.
The rest is done by the Berman [24] algorithm. This approach gives us an
11/6-approximation algorithm for the lineage tree reconstruction problem
with k = 3 and a 16/9-approximation algorithm for k = 4. The cost of an
edge between two nodes with the sequences gi[t] and gj[t] for 1 ≤ t ≤ m P is t c(gi[t], gj[t]), for some cost function c. In this work, we assign a cost of 1 to every mutation and a cost of zero to non-mutated locations (i.e.,
c(x, x) = 0 for x ∈ Σ and c(x, y) = 1 for x 6= y ∈ Σ).
3.5.5 Support for missing values
To adapt the method to the single-cell setting, we incorporate missing values
into the algorithm. We can simply do this by extending the domain of the
character cost function c :Σ × Σ → R∗ to c : Σ + µ × Σ + µ → R∗. To do so, we should define c(µ, x) = c(x, µ) for x ∈ Σ as well as c(µ, µ); c(µ, x)
could be considered as the cost of imputation for a locus. One may consider
the assignment of c(µ, x) = c(µ, µ) = 0. However, this assignment violates
the triangle inequality for the graph G. To preserve the triangle inequality
between the edges’ weights, we may assign c(µ, x) = 0.5 + for some small
constant (e.g., 10−5). Furthermore, letting c(µ, µ) = 0 does not violate
the triangle inequality any longer.
3.5.6 Missing value imputation
The original approach of representing lineage tree reconstruction as a
Steiner tree problem does not consider the case of missing values in the
input sequences. However, for single-cell data, this is important. We
therefore designed a dynamic program that finds internal node sequences
with non-missing values. As a result, missing values may occur only in
input sequences. The translation of the missing value imputation task in
our model would be the task of filling missing values with some characters
from the alphabet Σ to minimize the cost function. To impute missing
values after finding the lineage tree, we replace each missing value in a node
with the most abundant character found within its neighbors.
May 12, 2021 30/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
Algorithm 1 Pseudocode of the main part of the Scelestial algorithm
Input: Samples gi for 1 ≤ i ≤ n. Every sample gi is a string of length m. A character cost function c as defined in Section 3.5.5. k: Degree of restriction of restricted Steiner trees. m Output: The lineage tree T = (VT ,ET ). A partial function seq : VT → Σ that assigns sequences with length m to tree nodes. M ← Initialize({g1, . . . , gn}) stack ← EnumerateRestrictedTrees({g1, . . . , gn}, k, M) T ← ConstructTree(stack, M) T ← Impute({g1, . . . , gn}, T )
Algorithm 2 Initialize function (part of the Scelestial algorithm) function cˆ(a, b) Pm return i=1 c(a[i], b[i]). end function function Initialize({g1, . . . , gn}) Build a complete graph G on gi. Assign weight cˆ(gi, gj) to the edges of G. Let M = MST(G). Let cost(e) be the cost of edge e for e ∈ EM . return M end function
Algorithm 3 EnumerateRestrictedTrees function (part of the Scelestial algorithm)
function EnumerateRestrictedTrees({g1, . . . , gn}, k, M) for all K ⊂ {g1, . . . , gn} with 3 ≤ |K| ≤ k do for all tree topology τ on K as its leaves do Fill internal nodes of τ with sequences that minimize cost of the tree t. Let cost(t) be the sum of the cost of the edges of t. Let M 0 = M + t. Let β be the heaviest edges from M for which if we remove them, there would be no cycle in M 0. Let cost(β) be the sum of costs of edges in β. Let gain = cost(t) − cost(β) if gain > 0 then Let Mrep = ∅ for all edge e in t do Find the vertices u and v such that if we remove e from M 0 − β they will be disconnected. Add edge (v, u) to Mrep with the new cost cost(e)−gain end for push (t, β, Mrep) to stack Let M = M − β ∪ Mrep end if end for end for return stack end function
May 12, 2021 31/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
Algorithm 4 ConstructTree function (part of the Scelestial algorithm) function ConstructTree(stack, M) Let T be an empty tree for all gi do Add vertex vi to T . end for Let seq to be a function with an empty domain. Let E = M while stack is not empty do pop (t, β, Mnew) from stack UpdateTree(T, E, M, t, β, Mnew) end while for all edges of M do if adding e to T does not make a cycle then add e to T end if end for return T end function function UpdateTree(T, E, M, t, β, Mnew) Let E = E ∪ β if Mnew ⊆ M then add t to T extend seq function to internal nodes of t else Let E = E − Mnew Let M = MST(E) end if end function
Algorithm 5 Impute function (part of the Scelestial algorithm)
function Impute({g1, . . . , gn}, T ) for all sample gi do Add vertex gi to T and add edge (gi, vi) to T . find the minimum-cost sequence for vertex vi. . This sequence serves as imputation for gi. end for return T end function
May 12, 2021 32/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
1 - Initialize Find the minimum spanning tree (MST) between CAC 0.5 input nodes 2 GAG
GCC ?AC 1.5
2 - Enumerate Restricted Trees For every k-subset of the samples K, and every tree CAC topologies t with K as its leaves, perform the follow- 2 0.5 ing steps: GAG GCC ?AC 1.5
2A - Fill internal nodes of t Find the best evolutionary tree τ with K as its CAC leaves by a dynamic program 2 0.5 GAG GAC
GCC ?AC 1.5 2B - Find bridges Find bridges β, which are edges that should be CAC removed from the spanning tree between S to 2 0.5 avoid cycles after adding τ to the tree. GAG GAC GCC ?AC 1.5 2C - Project τ onto S Let gain = cost(t) − cost(β) CAC 3 0.5 If gain > 0, 2 1 push t to the stack, GAG add projection of the bridges on K GCC ?AC weight of new edge = weight of bridge −gain
3 - Construct Tree For every tree t in the stack, if it is compatible with CAC the current tree, update the tree by adding t. 0.5 GAG GAC
GCC ?AC
4 - Impute Replace each node which has missing values with CAC a fully imputed one, which minimizes the cost of the tree. GAG GAC GCC CAC ?AC
Fig 9: Illustration of the Scelestial algorithm.
May 12, 2021 33/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
4 Discussion and conclusions
Inference of a tumor’s evolutionary history is a crucial step towards under-
standing of the common patterns of cancer evolution. There are a range
of methods available for phylogenetic inference with single-cell datasets.
For optimal performance, many of these probabilistic approaches require
substantial time for optimization (e.g., using MCMC). The increasing sam-
ple sizes of single-cell datasets emphasizes the need for fast and scalable
methods in this domain that also maintain high accuracy [29].
Here, we describe Scelestial, a computationally lightweight and accurate
method for lineage tree reconstruction from single-cell variant calls. The
method is based on a Steiner tree algorithm with a known approximation
guarantee, which we adapted to lineage tree reconstruction for single-cell
data with missing values by using a variation of the Steiner tree problem
called the group Steiner tree. This problem is a special case of the group
Steiner problem, in which the groups are sub-hypercubes of the original
graph. To achieve a fast algorithm, we further adapted the group Steiner
tree algorithm through modeling a hypercube corresponding to the missing
values with one representative vertex. To facilitate the use of Scelestial, we
provide an implementation as a freely available R package. From another
point of view, Scelestial could be considered as a generalization of the
neighbor-joining method. At each step, the neighbor-joining method finds
the two most similar elements (samples) and merges them to build a tree. A
generalization of this idea may consider more than two samples for merging
in each step. Determining a good objective function for ranking three sample
candidates is not trivial. Scelestial, based on Berman’s algorithm [24],
provides a generalized guaranteed neighbor-joining method.
We evaluated Scelestial’s performance with a diverse set of test cases
similar to modern single-cell datasets. The datasets were generated with
various ranges of clones, false positives, false negatives, and missing values,
all of which were derived from real single-cell datasets [25,30]. The simulated
data were produced by a tumor simulator emulating the process of tumor
May 12, 2021 34/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
growth via mutation and proliferation. In this way, data for different tumors
with various parameters were generated to assess computational methods
over a wide range of data types.
On these benchmark datasets, we compared Scelestial with seven other
state of the art phylogeny reconstruction methods, namely BitPhylogeny [4],
OncoNEM [5], SCITE [6], SASC [8], SCIPhI [10], SiFit [7], and SiCloneFit
[11]. As the comparisons were not straightforward, we excluded B-SCITE
[12], which uses a combination of single-cell and bulk data, as well as
the approach of Kim and Simons [3], where the input is hard-coded, and
PhISCS [13], as it does not infer a lineage tree. Of the methods using the k-
Dollo assumption, namely SASC [8], which is based on simulated annealing,
and SPhyR [9], which is based on integer programming, we included SASC
in the comparison. For the assessment of lineage tree quality, we applied
two commonly used metrics from population genetics and phylogenetics.
In this comparison, Scelestial performed best at reconstructing the ground
truth tree’s topology and also the similarity of the inferred to the ground
truth tree when branch lengths were considered.
When the methods were applied to cancer data, only Scelestial inferred
lineage trees that separated all cancerous cells from normal cells for all three
datasets, except for one cell. SiCloneFit had the second-best accuracy with
two misplaced cells, and OncoNEM was the only other method that did
not mix cancerous and normal cells in one single clone, although it mixed
cancerous and normal cells in the evolutionary tree. The run time analysis
showed Scelestial to perform up to two orders of magnitude faster than the
other methods when the default settings were used. This is particularly
important, as the number of single-cell genomes published in individual
studies continues to be on the rise, as are multi-dataset meta-studies, as
in [1].
Overall, Scelestial substantially improves lineage tree reconstruction
from single-cell variant calls across key criteria such as the scalability, run
times, and accuracy of lineage tree reconstruction on both real and simu-
lated data. Cell lineage reconstructions based on diverse tumor datasets
May 12, 2021 35/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
in combination with massive data gathering, resulting from fast advancing
technologies, provide a better understanding of the evolutionary landscape
and the associated mutations of tumors, as may also indicate the dependen-
cies between them [31–33]. These factors may make Scelestial instrumental
in furthering our understanding of the mutational landscape and the mech-
anisms of cancer formation and survival, as omics technologies continue to
thrive. Furthermore, the results of this paper can be seen as a case study
for translating concepts from theoretical computer science into advances in
computational biology.
Acknowledgments
This work was funded by a Helmholtz Incubator grant (Sparse2Big ZT-I-
0007).
References
1. L¨ahnemannD, K¨osterJ, Szczurek E, McCarthy DJ, Hicks SC, Robin-
son MD, et al. Eleven grand challenges in single-cell data science.
Genome biology. 2020;21(1):1–35.
2. Kuipers J, Jahn K, Beerenwinkel N. Advances in understanding
tumour evolution through single-cell sequencing. Biochimica et Bio-
physica Acta (BBA)-Reviews on Cancer. 2017;1867(2):127–138.
3. Kim KI, Simon R. Using single cell sequencing data to model the
evolutionary history of a tumor. BMC bioinformatics. 2014;15(1):27.
4. Yuan K, Sakoparnig T, Markowetz F, Beerenwinkel N. BitPhylogeny:
a probabilistic framework for reconstructing intra-tumor phylogenies.
Genome biology. 2015;16(1):36.
5. Ross EM, Markowetz F. OncoNEM: inferring tumor evolution from
single-cell sequencing data. Genome biology. 2016;17(1):69.
May 12, 2021 36/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
6. Jahn K, Kuipers J, Beerenwinkel N. Tree inference for single-cell
data. Genome biology. 2016;17(1):86.
7. Zafar H, Tzen A, Navin N, Chen K, Nakhleh L. SiFit: inferring
tumor trees from single-cell sequencing data under finite-sites models.
Genome biology. 2017;18(1):178.
8. Inferring Cancer Progression from Single-cell Sequencing while Al-
lowing Mutation Losses; 2018.
9. El-Kebir M. SPhyR: tumor phylogeny estimation from single-cell se-
quencing data under loss and error. Bioinformatics. 2018;34(17):i671–
i679.
10. Singer J, Kuipers J, Jahn K, Beerenwinkel N. Single-cell mutation
identification via phylogenetic inference. Nature communications.
2018;9(1):1–8.
11. Zafar H, Navin N, Chen K, Nakhleh L. SiCloneFit: Bayesian in-
ference of population structure, genotype, and phylogeny of tumor
clones from single-cell genome sequencing data. Genome research.
2019;29(11):1847–1859.
12. Malikic S, Jahn K, Kuipers J, Sahinalp SC, Beerenwinkel N. Inte-
grative inference of subclonal tumour evolution from single-cell and
bulk sequencing data. Nature communications. 2019;10(1):1–12.
13. Malikic S, Mehrabadi FR, Ciccolella S, Rahman MK, Ricketts C,
Haghshenas E, et al. PhISCS: a combinatorial approach for subperfect
tumor phylogeny reconstruction via integrative use of single-cell and
bulk sequencing data. Genome research. 2019;29(11):1860–1877.
14. Catanzaro D, Ravi R, Schwartz R. A mixed integer linear pro-
gramming model to reconstruct phylogenies from single nucleotide
polymorphism haplotypes under the maximum parsimony criterion.
Algorithms for Molecular Biology. 2013;8(1):3.
May 12, 2021 37/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
15. Sridhar S, Lam F, Blelloch GE, Ravi R, Schwartz R. Mixed integer
linear programming for maximum-parsimony phylogeny inference.
IEEE/ACM Transactions on Computational Biology and Bioinfor-
matics. 2008;5(3):323–331.
16. Caldwell AE, Kahng AB, Mantik S, Markov IL, Zelikovsky A. On
wirelength estimations for row-based placement. IEEE Transac-
tions on Computer-Aided Design of Integrated Circuits and Systems.
1999;18(9):1265–1278.
17. Peyer S. Shortest paths and Steiner trees in VLSI routing [PhD dis-
sertation]. PhD thesis, Research Institute for Discrete Mathematics,
University of Bonn; 2007.
18. Gilbert EN, Pollak HO. Steiner minimal trees. SIAM Journal on
Applied Mathematics. 1968;16(1):1–29.
19. Cieslik D. The steiner ratio. vol. 10. Springer Science & Business
Media; 2013.
20. Du DZ, Smith J, Rubinstein JH. Advances in Steiner trees. vol. 6.
Springer Science & Business Media; 2013.
21. Ivanov AO, Tuzhilin AA. Minimal NetworksThe Steiner Problem
and Its Generalizations. CRC press; 1994.
22. Pr¨omelHJ, Steger A. The Steiner tree problem: a tour through
graphs, algorithms, and complexity. Springer Science & Business
Media; 2012.
23. Alon N, Chor B, Pardi F, Rapoport A. Approximate maximum par-
simony and ancestral maximum likelihood. IEEE/ACM Transactions
on Computational Biology and Bioinformatics. 2008;7(1):183–187.
24. Berman P, Ramaiyer V. Improved approximations for the Steiner
tree problem. Journal of Algorithms. 1994;17(3):381–408.
May 12, 2021 38/39 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.24.445405; this version posted May 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
25. Leung ML, Davis A, Gao R, Casasent A, Wang Y, Sei E, et al. Single-
cell DNA sequencing reveals a late-dissemination model in metastatic
colorectal cancer. Genome research. 2017;27(8):1287–1299.
26. Li Y, Xu X, Song L, Hou Y, Li Z, Tsang S, et al. Single-cell sequencing
analysis characterizes common and cell-lineage-specific mutations in
a muscle-invasive bladder cancer. GigaScience. 2012;1(1):12.
27. Chleb´ıkM, Chleb´ıkov´aJ. The Steiner tree problem on graphs: Inap-
proximability results. Theoretical Computer Science. 2008;406(3):207–
214.
28. Borchers A, Du DZ. The k-Steiner Ratio in Graphs. SIAM Journal
on Computing. 1997;26(3):857–869.
29. L¨ahnemannD, K¨osterJ, Fischer U, Borkhardt A, McHardy AC,
Sch¨onhuth A. ProSolo: Accurate Variant Calling from Single Cell
DNA Sequencing Data. bioRxiv. 2020;.
30. Nowell PC. The clonal evolution of tumor cell populations. Science.
1976;194(4260):23–28.
31. Croce CM. miRNAs in the spotlight: understanding cancer gene
dependency. Nature medicine. 2011;17(8):935–936.
32. Tsherniak A, Vazquez F, Montgomery PG, Weir BA, Kryukov
G, Cowley GS, et al. Defining a cancer dependency map. Cell.
2017;170(3):564–576.
33. Dempster JM, Pacini C, Pantel S, Behan FM, Green T, Krill-Burger
J, et al. Agreement between two large pan-cancer CRISPR-Cas9 gene
dependency data sets. Nature communications. 2019;10(1):1–14.
May 12, 2021 39/39