Universal Prediction of Cell Cycle Position Using Transfer Learning

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Universal prediction of cell cycle position using transfer learning

Shijie C. Zheng1, Genevieve Stein-O’Brien2,3,4,5, Jonathan J. Augustin3, Jared Slosberg2, Giovanni A. Carosso2, Briana Winer2, Gloria Shin2, Hans T. Bjornsson2,6,7,8, Loyal A. Goff2,3,4,*, and Kasper D. Hansen1,2,*

1Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health 2Department of Genetic Medicine, Johns Hopkins School of Medicine 3Department of Neuroscience, Johns Hopkins School of Medicine 4Kavli Neurodiscovery Institute, Johns Hopkins University 5Division of Biostatistics and Bioinformatics, Department of Oncology, Johns Hopkins School of Medicine 6Department of Pediatrics, Johns Hopkins University School of Medicine 7Faculty of Medicine, University of Iceland 8Landspitali University Hospital *Correspondence to [email protected] (LAG), [email protected] (KDH)

ABSTRACT

The cell cycle is a highly conserved, continuous process which controls faithful replication and division of cells. Single-cell technologies have enabled increasingly precise measurements of the cell cycle as both as a biological process of interest and as a possible confounding factor. Despite its importance and conservation, there is no universally applicable approach to infer position in the cell cycle with high-resolution from single-cell RNA-seq data. Here, we present tricycle, an R/Bioconductor package, to address this challenge by leveraging key features of the biology of the cell cycle, the mathematical properties of principal component analysis of periodic functions, and the ubiquitous applicability of transfer learning. We show that tricycle can predict any cell’s position in the cell cycle regardless of the cell type, species of origin, and even sequencing assay. The accuracy of tricycle compares favorably to gold-standard experimental assays which generally require specialized measurements in speciﬁcally constructed in vitro systems. Unlike gold-standard assays, tricycle is easily applicable to any single-cell RNA-seq dataset. Tricycle is highly scalable, universally accurate, and eminently pertinent for atlas-level data.

INTRODUCTION served mechanism with and integral role in gen- erating the diversity of cell types within multicel- The cell cycle is the biological process which con- lular organisms. As a result, maladaptive modiﬁ- trols faithful replication and division of cells across cations of the cell cycle can have devastating con- all species of life. Despite existing as a continuous sequences in development and disease (McConnell process, cell cycle has historically been character- and Kaznowski, 1991; Ambros, 1999; Ohnuma and ized as having four discrete stages during which Harris, 2003). Despite its importance, many of the the cell performs growth and maintenance (G1), molecular mechanism regulating and interacting replicates its DNA (S), increases further in size and with cell cycle remain poorly understood. prepares for mitosis (G2), and undergoes mitosis High-throughput expression data has been uti- and cytokinesis (M). Cell cycle is a highly con- lized for studying the cell cycle since the seminal

Zheng et al. | 2021 | bioRχiv | Page 1 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

work on the yeast cell cycle by Spellman et al. (1998) al. (2019). After projection, we infer cell cycle pseu- and Cho et al. (1998) at the dawn of the microar- dotime as the polar angle around the origin. This ray era. This work used various approaches to syn- pseudotime variable takes values in [0, 2π] and is chronize cells in specific cell cycle stages followed unrelated to wall time, but rather represents pro- by assaying cells in bulk. The data from Spellman gression through the cell cycle phases. We refer to et al. (1998) were later used by Alter et al. (2000) this psedudotime variable as cell cycle position to to show that principal component analysis reveals avoid confusion with wall time and to emphasize a circular pattern which represents the cyclical na- its periodic nature. ture of the cell cycle; widely cited as one of the first To define a reference cell cycle embedding, we examples of the use of principal component anal- leverage key features of principal component analysis and singular value decomposition in analysis ysis of cell cycle genes. Previous work has found of high-throughput expression data. Subsequent that principal component analysis on expression work sought to systematically identify both period- data sometimes yield an ellipsoid pattern. This was ically expressed genes and cell cycle marker genes first described by Alter et al. (2000); it has later and deposited these into widely used databases been observed independently in multiple data sets (Whitfield et al., 2002; Gauthier et al., 2008). (Schwabe et al., 2020; Liu et al., 2017; Mahdessian et Single-cell technologies have enabled the ability al., 2021). Here, we demonstrate that the ellipsoid to study the effects of cell cycle in multicellular or- pattern is a consequence of a link between Fourier ganisms with a degree of sensitivity and accuracy analysis of periodic functions and principal com- only previously available in monocelluar or clonal ponent analysis. The shape is created by the fact systems. Thus, cell cycle has been the subject of that cell cycle genes are periodic with a single peak substantial interest, both as a biological variable of of expression (which differs between genes). Thus, interest and as a possible confounding feature for there is a direct link between progression through other comparisons of interest (Buettner et al., 2015). the cell cycle process and angular position on the A number of methods have been developed to es- ellipsoid. timate cell cycle state from single-cell expression We use the first two principal components to de- data (Leng et al., 2015; Scialdone et al., 2015; Liu fine a reference embedding representing the cell et al., 2017; Stuart et al., 2019; Hsiao et al., 2020; cycle. Because this reference embedding is a low Schwabe et al., 2020). These methods differ broadly dimensional linear space, we obtain an orthogonal in the definition of cell cycle state (discrete stages vs. projection operator allowing us to project any new continuous pseudotime) as well as the use of spe- data set into the reference embedding. We show cial training data. Most of these methods have been that projecting new data into the reference cell cy- demonstrated to be effective on datasets consisting cle embedding overcomes technical and biological of a single cell type. Despite the conservation of the challenges posed by data sets where substantial cell cycle process, none of these methods have been variation is explained by one or more factors differ- shown to be applicable across single-cell technolo- ent from cell cycle, such as cellular differentiation. gies and mammalian tissues. Principal component analysis and periodic RESULTS functions To gain insight into gene expression dynamics over Transfer learning the cell cycle, we start by analyzing principal com- To develop a universal method for estimating a con- ponent analysis of periodic functions. Our model is tinuous cell cycle pseudotime for a single-cell ex- a collection of periodic functions with a single peak, pression data set independent of technology, cell taking the form type, or species, we leverage transfer learning via xg(θ) = Ag cos(θ − Lg) dimensionality reduction (Pan et al., 2008). We define a reference cell cycle embedding (or latent with a gene-specific amplitude (Ag) and location of space) into which we project a new data set; an ap- the peak (Lg) with 0 ≤ θ < 2π representing the un- proach originally advocated for in Stein-O’Brien et known cell cycle position. Figure 1a,b depicts the

Zheng et al. | 2021 | bioRχiv | Page 2 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a Unobserved

Gene Gene 1 Gene2

0 c Inference of the time order d Recovered Expression -1 Gene Gene 1 Gene2 1 2 0 0.5π 1π 1.5π 2π ^ Timepoint θ θ 0 0 Observed

b Expression -1 PC1 (11.7%) Gene Gene 1 Gene2 -2 1 0 0.5π 1π 1.5π 2π -5 0 5 ^ Recoverd timepoint θ 0 PC1 (77.2%) Expression -1

0 0.5π 1π 1.5π 2π Permuted timepoint θ'

Figure 1. Principal component analysis recovers time ordering in simulations. Simulations are based on cosine functions with Gaussian noise (Methods). (a) Expression vs. time for 2 genes with different peak locations and amplitudes. Each of the two gene peaks are replicated 50 times for a total of 500 genes and 1,000 time points (cells). (b) Expression vs. permuted time, representing the unknown time order of observed data which obscures the periodicity of the functions. (c) Principal component analysis of the data from (b) and (a); the two datasets have equivalent principal components. We infer cell cycle position (θˆ) by the angle of the ellipsoid. The red dot indicates θ = 0. (d) Expression vs. inferred cell cycle position.

unobserved (true) time ordering, observed on a dis- (Methods). The assumption of a single expression crete grid of time points, together with a random peak for each gene is supported by empirical data permutation of these time points; this represents for genes in the cell cycle expression program (see the observed data which is not ordered by time. A below). The two first principal components of this key insight is the fact that the first two principal data can be represented as components are the same for the observed and the ( ) = t ( ( ) ( )) unobserved data (Figure 1c), when performed on pc1 θ b1 cos θ , sin θ t a discrete set of observation times. The unknown pc2(θ) = b2(cos(θ), sin(θ)) time order can be inferred from the principal component plot as the angle of each point, making it where b1, b2 are two dimensional vectors which are possible to fully reconstruct the unobserved time linear functions of the Eigenvectors and -values of a × order (Figure 1d), i.e., the first two principal com- 2 2 matrix entirely determined by the set of peak ( ) ponents form an orthogonal projection into a two- locations and amplitudes Lg, Ag (Methods). No dimensional space representing the periodic time. matter how many distinct peak locations and ampli- For this result to hold, it is required that the gene tudes are present, the space representing periodic expression data exhibits at least two distinct peak time will always be two-dimensional. Higher di- locations (not separated by exactly π) and that each mensions are only required when individual genes gene has at most one peak over the time period have multiple peaks. Previous empirical investiga- tions of cell cycle using expression data supports

Zheng et al. | 2021 | bioRχiv | Page 3 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

the observation of a 2-dimensional space for princi- with a sparse/empty interior (Figure 2a). Using the pal component analysis Buettner et al. (2015) and modified-Schwabe cell cycle stage predictor, we ob- Schwabe et al. (2020). serve a strong relationship between polar angle on The simulated data depicted in Figure 1 has the ellipsoid and predicted cell cycle stage. Gaussian noise, but we have verified that the result The strong relationship between polar angle on holds for data generated using the negative bino- the ellipsoid and predicted cell cycle stage was also mial distribution with an associated mean-variance observed on an independent dataset on cultured relationship. Using the negative binomial dis- primary mouse hippocampal progenitors from a tributed data required more than 2 distinct peaks to wild-type mouse as well as from a Kmt2d+/βgeo be stable (Supplementary Figures S1, S2). For both mouse, a previously described model of Kabuki distributions, this approach is robust to downsam- syndrome (Carosso et al., 2019). The data were pro- pling of the data similar to what is seen with the cessed similarly to the neurosphere data. Again, increased sparsity from droplet based sequencing we select the top 500 most variable cell cycle genes technology. In simulations, we can recover cell cy- and perform a principal component analysis (Fig- cle position with as little as 10 total counts per cell ure 2b) which reveal an ellipsoid pattern. The shape across 100 genes (depending on noise levels and of the principal component plot differs between heights of the peaks) (Supplementary Figure S3). the two datasets, but the weights used to form the first two principal components are highly concor- Recovering cell cycle position using principal dant (Figure 2c, Supplementary Figure S5 for PC2) component analysis on cell cycle genes for the 318 genes present in both cell cycle embeddings. Almost all of the highly ranked genes (ab- We next assess our model on experimental data, solute weights > 0.1, highlighted in red and la- and learn an embedding representing cell cycle. belled with gene name) represent important regula- We use 10x Genomics Chromium single-cell RNA- tors of, or participants in, the cell cycle. For exam- Sequencing (scRNA-Seq) data on two replicate cul- ple, the highest ranked gene is Topoisomerase 2A tures of E14.5 mouse cortical neurospheres (Meth- Top2a which controls the topological state of DNA ods), integrated using Seurat 3 and transformed to strands and catalyzes the breaking and rejoining of log2-scale. The use of an alignment method (CCA DNA to relieve supercoiling tension during DNA in Seurat3) to integrate the two samples is impor- replication and transcription (Lodish et al., 2008). tant for the quality of the ellipsoid, by maximizing Also highly ranked are Smc2 and Smc4 which com- the correlation structure between the two samples. pose the core subunits of condensin, which regu- Since neurospheres are maintained in a prolifera- lates chromosome assembly and segregation (Ono tive state, we expect that cell cycle phase is an im- et al., 2003; Wei-Shan et al., 2019). portant contributor to the variation in expression Given our mathematical analysis as well as the within this single-cell dataset. To confirm this ex- strong empirical relationship between polar angle pectation, we consider a UMAP representation of on the ellipsoid and cell cycle stage predictions, we the data based on all variable genes (Supplemen- define a method to learn cell cycle position as the tary Figure S4) colored according to the predictions polar angle around the origin on the coordinate from two separate cell cycle stage estimation utili- plane which we denote by θ. We center the coor- ties (cyclone and a modification of Schwabe et al. dinate plane on (0, 0) whose location corresponds (2020) we call modified-Schwabe, see Methods); to cells with zero expression for all 500 variable cell this analysis demonstrates that the cell cycle is a cycle genes. major source of transcriptional variation in the neu- To demonstrate that cell cycle position reflects the rosphere dataset. true biological cell cycle progression, we consider We then perform principal component analysis expression dynamics of specific cell cycle genes. of the top 500 most variable genes amongst the For Top2a and Smc2 the peak expressions are ob- roughly 1700 genes annotated with the Gene On- served at G2 stage around π (Figure 2e), consis- tology cell cycle term (GO:0007049, Methods) (Ash- tent with their known increased expression through burner et al., 2000). As suggested by our model, the S phase and into G2 (Heck et al., 1988; Belluti first two principal components form an ellipsoid et al., 2013; Wei-Shan et al., 2019). Furthermore,

Zheng et al. | 2021 | bioRχiv | Page 4 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a mNeurosphere (n=12805) b mHippNPC (n=9188) c Overlapped genes(n=321)

CC Stage 10 10 PCC = 0.93 G1.S Mcm6 0.0 Cdkn2c 5 S 5 G2 Topbp1 Plk4 Nasp Mcm7 G2.M θ θ Bub3 Kif18a Mdc1 0 Rad21 M.G1 0 -0.1 Tipin NA Nde1 Ezh2 Tacc3 Nusap1 Cit PC2 (2%) -5 PC2 (2%) -5 Kif11 Ncapd2 Smc4 Ckap2 -0.2 Smc2 -10 Top2a Hjurp Dbf4 -10 Lmnb1 Knl1

-10 0 10 -15 -10 -5 0 5 10 mHippNPC PC1 weights -0.20 -0.15 -0.10 -0.05 0.00 PC1 (11%) PC1 (7%) mNeurosphere PC1 weights

d Top2A e Smc4 f TotalUMIs

Dataset-loess 5 mHippNPC mNeurosphere 4 17 4 Dataset mHippNPC 3 16 3 mNeurosphere 15 2 2 (TotalUMIs) (expression) (expression) 2

2 2 14 1 1 log log log 13 0 0 0 0.5π 1π 1.5π 2π 0 0.5π 1π 1.5π 2π 0 0.5π 1π 1.5π 2π PCA θ PCA θ PCA θ

Figure 2. The cell cycle ellipsoid and cell cycle position. (a) Top 2 principal components of GO cell cycle genes from E14.5 primary mouse cortical neurospheres, in which the variation is primarily driven by cell cycle. Each point represents a single cell, which is colored by 5 stage cell cycle representation, inferred using the modiﬁed Schwabe method (Schwabe et al., 2020). The cell cycle position θ (with values in [0, 2π); sometimes called cell cycle pseudotime) is the polar angle. (b) As in (a), but for a dataset of primary mouse hippocampal progenitor cells from both a mouse model of Kabuki syndrome and a wildtype. (c) A comparison of the weights on principal component 1 between the cortical neurosphere and hippocampal progenitor datasets. Genes with high weights (|score| > 0.1 for either vector) are highlighted in red. (d,e) The expression dynamics of (d) Top2A and (e) Smc4 using the inferred cell cycle position, with a periodic loess line (Methods). (f) The dynamics of total UMI using the inferred cell cycle position, with a periodic loess line, illustrating the high agreement of the dynamics between datasets.

the dynamics are highly similar between the inde- to the end of G2 and the middle of M stage re- pendently analyzed cortical neurosphere and hip- spectively. We observe the total UMI number be- pocampal NPC datasets, which supports the obser- gins to increase at the beginning of G1/S phase vation that the two different embeddings yield con- and to decrease sharply as cells progress through cordant cell cycle positions (despite each including M phase. The difference between the maximum dataset-speciﬁc genes). These observations hold for and minimum of the periodic loess line is 1, cor- all genes with high weights (Supplementary Fig- responding to a two-fold difference in total UMI, ure S6. This approach serves as an internal control which is known to be proportional to cell size (Mar- in any single-cell RNA-seq data set and can be used guerat and Bahler,¨ 2012; Padovan-Merhar et al., to assess the quality of any continuous ordering. 2015). This observation, and the timing with re- Next, we directly relate θ to the measured tran- spect to cell cycle position, is consistent with the

scription values. Figure 2d shows the log2 trans- approximate reduction in cellular volume by one formed total UMI numbers against θ, with a pe- half as a result of cytokinesis in M phase and the riodic loess smoother for each dataset. In both formation of two daughter cells of roughly equal datasets, the maximum level is reached around π size. and the minimum around 1.5π, which corresponds Note that these principal component analyses

Zheng et al. | 2021 | bioRχiv | Page 5 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a UMAP (mPancreas) b PCA of GO (mPancreas) c PCA of Ductal (n=895)

Cell type 4 4 Ductal 3 CC Stage Ngn3 low EP G1.S Ngn3 high EP 0 0 0 S Pre-endocrine G2 Beta UMAP-2 -4 G2.M

PC2 (3%) PC2 (3%) -3 -4 Alpha M.G1 Delta Epsilon -8 -6 -10 -5 0 5 10 -10 -5 0 5 -5 0 5 UMAP-1 PC1 (8%) PC1 (8%)

Figure 3. When principal component analysis fails to describe the cell cycle. Data is from the developing mouse pancreas. (a) UMAP embedding using all variable genes. Cells are colored by cell type. (b) PCA plot of the cell cycle genes; this reﬂects the differentiation path in (a). (c) PCA plot of the cell cycle genes for ductal cells only; this plot reﬂects cell cycle.

are differentiating G2/M cells from G1/G0 cells existing mouse developing pancreas dataset, with on the first principal component. This is in con- cell type labels (Bastidas-Ponce et al., 2019). A ma- trast to the mathematical analysis where the start- jor source of variation in this dataset is cellular dif- ing point (θ = 0) can be any location (red point in ferentiation as demonstrated by a standard UMAP Figure 1) as there is no clear starting point for a pe- embedding (based on all variable genes) illustrat- riodic function. That the first principal component ing the previously described (Bastidas-Ponce et al., differentiates G2M from G1/G0 can be explained 2019) differentiation trajectories (Figure 3a). When by the nature of principal component analysis. Be- we perform principal component analysis using fore principal component analysis we subtract each only the variable cell cycle genes, the resulting PCA gene’s mean expression. However, genes mark- plot still reflects the differentiation trajectory and ing G2/M usually have very high expression com- does not resemble the ellipsoid pattern observed in pared to other stages, with G0/G1 being the lowest the previous section (Figure 3b,c). Note that PC1 (Supplementary Figure S7), ensuring that this be- has some relationship with cell cycle since the dif- comes the first principal component. A clustering ferentiation path goes from cycling to non-cycling analysis of the expression patterns provides further cells, but it also reflects the progression from cy- evidence that cell cycle genes have a single peak cling multipotent cells to terminally differentiated pattern of expression (Supplementary Figure S7). cells. This result strongly suggests that some of the Thus, the observed behavior of the cell cycle genes cell cycle genes may participate in biological pro- in these data sets fits the theoretical requirements cesses other than the cell cycle and demonstrates of our model. that PCA of cell cycle genes does not always exclu- In summary, principal component analysis of the sively capture cell cycle variance. cell cycle genes predicts cell cycle progression for However, when we perform principal compo- the mNeurosphere and mHippNPC datasets with nent analysis only on a subset of cells from a sin- a high degree of similarity between the cell cycle gle, proliferating progenitor cell type, the ellipsoid position inferred independently in the two datasets pattern returns (Supplementary Figure S8a,b). This as predicted by our mathematical model. highlights the challenge of inferring cell cycle for datasets that contain many different cell types, in- When principal component analysis fails to cluding postmitotic cells. reflect cell cycle position Transfer learning through projection A principal component analysis does not always yield an ellipsoid pattern; a requirement for this to To overcome the challenges of inferring cell cycle work is for the first principal component to domi- position in arbitrary datasets, we propose a simple, nated by cell cycle. To illustrate this, we used an yet highly effective transfer learning approach we

Zheng et al. | 2021 | bioRχiv | Page 6 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

mHippNPC (n=9188) mPancreas (n=3559) mRetina (n=99260) HeLa2 (n=2463) a 5.0 4 2.5 2.5 2.5 CC Stage 2 θ θ G1.S 0.0 θ 0.0 0.0 S 0 θ G2

Space Dim 2 Space Dim 2 -2.5 Space Dim 2 -2.5 Space Dim 2 G2.M

ns ns ns ns -2.5 -2 M.G1 NA CC CC -5.0 CC -5.0 CC -6 -4 -2 0 2 -8 -4 0 -7.5 -5.0 -2.5 0.0 2.5 -5.0 -2.5 0.0 2.5 5.0 CCns Space Dim 1 CCns Space Dim 1 CCns Space Dim 1 CCns Space Dim 1 b mHippNPC Top2A mPancreas Top2A mRetina Top2A HeLa2 TOP2A 5 5 3 4 CC Stage 4 4 G1.S 3 3 3 2 S 2 G2 2 2

(expression) (expression) 1 (expression) (expression) G2.M 2 1 2 2 1 2 1 M.G1 log 0 log 0 log 0 log 0 NA 0 0.5π 1π 1.5π 2π 0 0.5π 1π 1.5π 2π 0 0.5π 1π 1.5π 2π 0 0.5π 1π 1.5π 2π CCns Position θ CCns Position θ CCns Position θ CCns Position θ c mHippNPC mPancreas mRetina HeLa2 4 4 2.5 2 5 0.5π 0 0.0 0 0 π 0/2π UMAP-2 UMAP-2 UMAP-2 UMAP-2 -2 -4 -2.5 1.5π -5

-3 0 3 6 -10 -5 0 5 10 -6 -3 0 3 -5.0 -2.5 0.0 2.5 UMAP-1 UMAP-1 UMAP-1 UMAP-1

Figure 4. A pre-learned weights matrix learned from proliferating cortical neurospheres enables cell cycle position estimation in other proliferating datasets. (a) Different datasets (hippocampal NPCs, mouse pancreas, mouse retina and HeLa set 2) projected into the cell cycle embedding deﬁned by the cortical neurosphere dataset. Cell cycle position θ is estimated as the polar angle. (b) Inferred expression dynamics of Top2a (TOP2A for human), with a periodic loess line (Methods). (c) UMAP colored by cell cycle position using a circular color scale.

term tricycle (transferable representation and infer- cell cycle position in multiple independent and dis- ence of cell cycle). In short, we first construct a refer- parate datasets; evidence of which is provided be- ence embedding representing the cell cycle process low. Specifically, using the cortical neurosphere using a fixed dataset where cell cycle is the primary dataset as a fixed reference, our transfer learning source of transcriptional variation. For the remain- approach generalizes across cell types, species (hu- der of this manuscript we will use the cortical neu- man/mouse), sequencing depths and even single- rosphere data as this reference. We show that the cell RNA sequencing protocols. learned reference embedding generalizes across all As a demonstration, we consider a diverse se- datasets we have examined. Because our reference lection of single-cell RNA-seq datasets represent- embedding is a linear subspace, we benefit from ing different species (mouse and human), cell types an orthogonal projection operator which allows us and technologies (10x Chromium, SMARTer-Seq, to map new data into the reference embedding, Drop-seq and Fluidigm C1) (Table S1). We project with well understood mathematical properties. Fi- these datasets into the cell cycle embedding learned nally, we infer cell cycle position by the polar an- from the neurosphere data (Figure 4, Supplemen- gle around the origin of each cell in the embedding tary Figure S9), and color the projections accord- space. The robustness of this approach is demon- ing to the modified Schwabe estimator of cell cycle strated by the ability of this projection to estimate stage. Although the shape of the projection varies

Zheng et al. | 2021 | bioRχiv | Page 7 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

from dataset to dataset, the cells of the same stage OS cells to derive a FUCCI-based pseudo-time scor- always appear at a similar position of θ, such as ing. Their FUCCI measurements form a distinct cells at S stage centering at 0.75π. To verify our cell horseshoe shape with the left side of the horseshoe cycle ordering, we look at the expression dynamics representing time post-metaphase-anaphase tran- of Top2a and Smc4 as a function of θ (Figure 4, Sup- sition with a continuous progression through G1, plementary Figure S10). PCA plots of the GO cell S, G2 and ending pre-metaphase-anaphase transi- cycle genes for each dataset illustrates the advan- tion (Figure 5; this depiction mirrors other data tage of using a fixed embedding to represent cell presentations (Sakaue-Sawano et al., 2008; Sakaue- cycle (Supplementary Figure S11). Together, these Sawano et al., 2017)). Cell cycle is a continuous results strongly supports that tricycle generalizes process which is not immediately reflected in the across data modalities. horseshoe form because of the large gap (in the x- Having inferred cell cycle position, we can vi- axis) between the two ends of the horsehoe. The x- sualize the cell cycle dynamics on a UMAP plot axis reflects the protein levels of geminin (GMNN) representing the full transcriptional variation, as is which is degraded during the metaphase-anaphase standard in the scRNA-Seq literature (Figure 4). To transition (McGarry and Kirschner, 1998) and the effectively visualize cell cycle position, we use a cir- two ’open’ ends of the horseshoe are closely con- cular color scale to account for the fact that position nected in time despite the visual gap in the scatter- “wrap around” from 2π to 0. Doing so reveals the plot. This fact gives the FUCCI system the ability to smooth behaviour of the tricycle predictions (de- assess whether a cell in M phase is before or after spite not using smoothing or imputation) and ar- this transition, or said differently, a high temporal gues for representing cell cycle in gene expression resolution around this transition despite the rela- data as a continual progression rather than discrete tively short wall time compared to the rest of the states. cell cycle. We observe a close correspondence between tricycle cell cycle position and FUCCI pseu- Cell cycle position estimation on gold-standard dotime. The only cells for which there is a superfi- datasets cial disagreement are placed in M phase by tricycle (cell cycle position around 0.85π) and are split be- We validated tricycle on multiple datasets con- tween pre-metaphase-anaphase transition and post- taining “gold-standard” cell cycle measurements, metaphase-anaphase transition by FUCCI pseudo- including measurements by proxy using the flu- time, for this particular transition the FUCCI sys- orescent ubiquitination-based cell-cycle indicator tem has higher temporal resolution than tricycle; (FUCCI) system and by fluorescence-activated cell adding a small offset to these cells results in a re- sorting (FACS) of cells in discrete cell cycle stages. markable concordance between the two systems Both of these approaches allow for assignment to (Figure 5). Elsewhere in the cell cycle, there is no or selection of cells from discrete phases of the cell evidence of better temporal resolution with FUCCI; cycle. The FUCCI system uses a dual reporter as- examining expression dynamics suggests that tri- say in which the reporters are fused to two genes cycle does at least as good as FUCCI as order- with dynamic and opposing regulation during the ing key cell cycle genes. We can use tricycle to cell cycle (Sakaue-Sawano et al., 2008), allowing for examine the expression dynamics of GMNN and a quantitative assessment of whether cells are in CDT1 which reveals that GMNN expression is sta- G1 or S/G2/M phase. In contrast to FACS, FUCCI ble across the cell cycle (Supplementary Figure S12), systems, combined with an appropriate quantifi- suggesting the protein is predominantly regulated cation method, make it possible to continuously post-transcriptionally during mitosis. measure cell cycle progression by placing the 2 Hsiao et al. (2020) used FUCCI on human in- protein measurements in a 2-dimensional space. duced pluripotent stem cells (iPSC) followed by Cell cycle pseudotime needs to be inferred from scRNA sequencing using Fluidigm C1. While the these 2-dimensional measurements, which is usu- Mahdessian et al. (2021) FUCCI data look like a ally done by a variant of polar angle (Hsiao et al., horseshoe, the Hsiao et al. (2020) FUCCI data are 2020; Mahdessian et al., 2021). more akin to a cloud (the data differ in quantifica- Mahdessian et al. (2021) measured human U-2 tion and normalization of the FUCCI scores). These

Zheng et al. | 2021 | bioRχiv | Page 8 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

hU2OS FUCCI (n=1114) hU2OS FUCCI (n=1114) hU2OS R2

a b c CDCA8 KIF23CDCA2 TOP2A 1.00 2 AURKA HJURP 0 FAM83D 0.6 NDC80 ASPM CCNF 0.5π 0.75 KIF2C AURKB UBE2C NUF2 -2 0.4 0.50 π 0/2π

Position θ R 0.2 -4 0.25 ns CDT1-Red585 1.5π FUCCI pseudotime 0.00 CC 0.0 -4 -2 0 0.8π 1.3π 1.8π 0.3π 0.8π 0.0 0.40.2 0.6 2 GMNN-Green530 CCnsPosition θ FUCCI pseudotime R

2 hiPSCs FUCCI (n=888) hiPSCs log2(TOP2A) hiPSCs log2(TOP2A) hiPSCs R

d e f g CKS2 2 UBE2C 6 6 0.4 TPX2 AURKA TOP2A 1 MCM5 CCNB1 CENPF CENPE PCNA 5 5 0.3 PRC1 NUSAP1 HJURP DTL

CDC6 CDK1 4 4 0.2 DLGAP5 0 KIF23 CKAP2 (expr TOP2A) Position θ R (expr TOP2A) 3 2 3 2 0.1 2 2 ns

CDT1-mCherry R = 0.422 R = 0.274 log log 2 2 -1 CC 0.0 -1.5 -1.0 -0.5 0.0 0.5 0π 0.5π π 1.5π 2π 0π 0.5π π 1.5π 2π 0.30.20.10.0 0.4 2 GMNN-EGFP CCnsPosition θ FUCCI pseudotime FUCCI pseudotime R

Figure 5. Evaluation of tricycle on FUCCI datasets. (a-c) Data from Mahdessian et al. (2021). (a) FUCCI scores colored by tricycle cell cycle position. (b) Comparison between FUCCI pseudotime and tricycle cell cycle position with a periodic loess line. Cells in the dotted rectangle were moved by adding one period 2π to tricycle θ to reﬂect the higher temporal resolution around the anaphase-metaphase transition for FUCCI pseudotime. Note that the x-axis starts at 0.85π, which corresponds to FUCCI pseudotime 0. (c) R2 values of periodic loess line of all projection genes when using tricycle θ and FUCCI pseudotime as the predictor. The dashed line represents y = x. (d-g) Data from Hsiao et al. (2020). (d) FUCCI scores colored by tricycle cell cycle position. (d,e) Expression dynamics of Top2a with a periodic loess line using either (d) tricycle cell cycle position or (e) FUCCI pseudotime inferred by Hsiao et al. (2020). Cells are colored by 5 stage cell cycle representation, inferred using the modiﬁed Schwabe method Schwabe et al. (2020). (g) Similar to (c), but for the data from Hsiao et al. (2020).

data are used to estimate a continuous cell cycle po- (2015) assays mouse embryonic stem cells (mESC) sition (which we term “FUCCI pseudotime”) based using Hoechst 33342-staining followed by cell iso- on polar angle of the FUCCI scores. Compared with lation using the Fluidigm C1. They use very con- the data in Mahdessian et al. (2021), there are larger servative gating for G1 and G2M at the cost of less differences between FUCCI pseudotime and tricy- conservative gating for S phase. Leng et al. (2015) cle cell cycle position. However, we can directly uses FACS on FUCCI labeled H1 human embryonic compare the associated expression dynamics of key stem cells (hESC) followed by cell isolation using cell cycle genes (Figure 5 for TOP2A, Supplemen- the Fluidigm C1. In both experiments, cells largely tary Figure S13 for 8 additional genes). These re- appear as expected in the cell cycle embedding de- sults suggests that tricycle cell cycle position is at ﬁned by the cortical neurosphere reference embed- least as good or better as the FUCCI pseudotime ding (Supplementary Figure S14). For the mESC, at ordering the cells along the cell cycle; the R2 for we note that some cells labeled S (but not G1 or TOP2A is 0.42 for tricycle compared with 0.27 for G2M) appear outside the position expected for this peco. stage, consistent with the gating strategy used for In contrasts to FUCCI measurements, FACS sort- these data. ing and enrichment of cells yields groups of genes Summarizing this evidence, we conclude that tri- in (supposedly) distinct phases of the cell cycle. We cycle recapitulates and reﬁnes the cell cycle order- consider 2 different datasets where FACS has been ing consistent with current “state of the art” experi- combined with single-cell RNA-seq. Buettner et al. mental methods. Tricycle cell cycle position is com-

Zheng et al. | 2021 | bioRχiv | Page 9 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

petitive with FUCCI based measurements, except other stages. for cells in the metaphase to anaphase transition Additionally, we benchmarked the computa- during mitosis. tional speed and performance of tricycle against other cell cycle estimation algorithms. We briefly Comparison to existing tools for cell cycle compared the running time of several methods us- position inference ing subsets of the mRetina dataset (Supplementary Figure S22). To compute continuous estimates us- We next sought to compare tricycle cell cycle posi- ing tricycle takes a mean of about 0.58, 0.86 and 1.48 tion estimates to those obtained from other avail- seconds when the number of cells is 5000, 10000, able methods. Existing methods for cell cycle as- and 50000 respectively. In contrast, to compute fi- sessment can be divided into those which infer a nite discrete stages Seurat takes a mean of about continuous position and those which assign a dis- 1.10, 1.22 and 4.95 seconds for a three stage esti- crete stage. We have evaluated the following meth- mation and cyclone takes a mean of about 7.96, ods: peco (Hsiao et al., 2020), Revelio (Schwabe et 11.50 and 50.66 minutes for a three stage estima- al., 2020), Oscope (Leng et al., 2015), reCAT (Liu tion, when the number of cells is 5000, 10000, and et al., 2017), cyclone (Scialdone et al., 2015), Seurat 50000 respectively. Other methods (peco, Oscope, (Stuart et al., 2019), the original Schwabe Schwabe reCAT) are not capable of processing large (10k- et al., 2020, and the modified Schwabe 5 stage as- 100k+) datasets. All of the comparisons were run signment method. Each method differs in which on Apple Mac mini (2018) with 3.2 GHz 6-Core In- datasets it works well on and which issues it might tel Core i7 CPU, 64GB RAM, and operating system have; a detailed comparison is available in the Sup- macOS 11.2. Thus, tricycle is able to scale with the plement (Supplemental Methods, Supplementary increasing size of datasets. Figures S15-S21). Issues with existing methods include (a) ability Application of tricycle to a single-cell RNAseq to work on datasets with multiple cell types, (b) atlas the ability to scale to tens of thousands of cells or more, and (c) the ability to work on less information To demonstrate the scalability and generalizability rich datasets such as those generated by droplet- of tricycle we applied it to a recent dataset of ≈ 4 based or in situ scRNA-Seq methods. Oscope re- million cells from the developing human (Cao et quires data on many genes due to its use of pair- al., 2020). The data were generated using combi- wise correlations, and therefore does not work on natorial indexing (sci-RNA-seq3) and are relatively less information rich platforms (e.g 10x Chromium lightly sequenced with a median of 429 − 892 total or Drop-Seq). peco works better on less sparse, and UMIs for 4 single-cell profiled tissues and 354 − 795 information-rich data (e.g. Fluidigm C1), but even for 11 single-nuclei profiled tissues (Supplementary on data from this platform, it is outperformed by Figure S23). Using tricycle, we are able to rapidly tricycle. reCAT is critically dependent on the extent and robustly annotate cell cycle position for each of to which a principal component analysis of the cell the cells/nuclei in this atlas (Figure 6a, Supplemen- cycle genes reflect cell cycle and only infers a cell tary Figure S24). Within a global UMAP embed- ordering; it is not straightforward to interpret the re- ding, tricycle annotations enable immediate visual CAT ordering, especially across datasets. Revelio is identification of proliferating and/or progenitor primarily a visualization tool, which appears to fail cell populations for most cell types and tissues. The on datasets where substantial variation is driven by rapid annotation of cell cycle position on this refer- processes other than the cell cycle. Of the discrete ence dataset further allowed us to examine the rel- predictors, Seurat agrees well with tricycle (and is ative differences in the proportion of cells actively very scalable) but is limited by only predicting a proliferating across different tissues and cell types 3 stage cell cycle representation (G1/S/G2M). Cy- in the developing human. To quantify this, we dis- clone appears to do poorly in labelling cells in S cretized all cells along θ into two bins correspond- phase and only predicts 3 stages. The (modified) ing to actively proliferating (0.25π < θ < 1.5π; Schwabe predictor assigns 5 stages, but has many S/G2/M) or non-proliferating (G1/G0). We next missing labels and mis-assigns cells from G0/G1 to ranked each tissue by the relative proportion of ac-

Zheng et al. | 2021 | bioRχiv | Page 10 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

tively proliferating cells to identify the tissues and mapping between the two data sets contains only a cell types with the highest proliferative index (Fig- subset of the 500 genes used in the embedding. The ure 6b). To examine cell-type specific differences number of genes available for the feature mapping in proliferation potential, we computed the cell cy- has an impact on the shape of the resulting embed- cle embedding as well as the proliferative index for ding; the mNeurosphere and mHippNPC datasets the 9 most abundant cell types within each tissue have almost the same shape when restricted to a (Supplementary Figures S25 and S26). set of common genes (Supplementary Figure S28). Tissue-level proliferation indexes identified thy- To establish the stability of tricycle, we randomly mus, cerebrum, and adrenal gland as having the removed genes from the neurosphere dataset and highest overall proportions of dividing cells across computed tricycle cell cycle positions; we used the the sampled fetal timepoints. Within the thymus, neurosphere dataset as a positive control to ensure thymocytes represent both the most abundant cell all genes are present. We used the circular correla- type and the most ’prolific’ cell types as a function tion coefficient to assess the similarity between the of the proporation of mitotic cells. Thymocytes ex- tricycle cell cycle position for the full dataset vs. the hibit a circular embedding in UMAP space that ef- dataset with randomly pruned genes (Supplemen- fectively recapitulates the estimated cell cycle po- tary Figures S29, S30). This reveals excellent stabil- sition predictions from tricycle (Supplemental Fig- ity (circular ρ > 0.8) using as little as 100 genes. ure S26k). Within this circular embedding, there To examine the impact of sequencing depth, we is a gap of cells with cell cycle position estimates downsampled the mHippNPC dataset (Supplemen- at ≈ π, consistent with dropout of cells and lower tary Figures S31, S32), and used the circular cor- information content in M-phase. Comparison of relation coefficient to quantify to similarity to the tricycle cell cycle annotations to modified Schwabe cell cycle position inferred using the full sample. cell cycle phase calls in this embedding suggests Originally, the median of library sizes (total UMIs) that tricycle more accurately estimates cell cycle pois 10,000 for mHippNPC data. Downsampling to sition even on cell types with a mean total UMI of 20% of the original depth(approximate median of 354 (Supplementary Figure S27). library sizes 2,000) kept circular ρ > 0.8. This Within tissues, lymphoid cells are often the cell is congruent with the observed robustness of the type with the highest proliferation index (Supple- method to the varying sequencing depth of the var- mentary Figures S26, S25); often with a greater ious datasets examined above. number of actively proliferating cells than not. Next, we examined the stability of tricycle wrt. Within the fetal liver and spleen – both sites of the choice of reference embedding. Above, we early embryonic erythropoiesis during human de- show a cell cycle space estimated separately for velopment (Cumano and Godin, 2007) – erythrob- the mNeurosphere and the mHippNPC datasets lasts represent the cell type with the highest frac- (Figure 2). We observe that the inferred expres- tion of proliferating cells. Across developmental sion dynamics are more alike in the two datasets time, most tissues maintain relatively monotonic if we project the mHippNPC into the mNeuro- proliferation indices, however several (liver, pla- sphere embedding compare to using its own em- centa, intestine) exhibit dynamic changes across the bedding. To quantify this, we pick key cell cy- sampled timepoints. This application illustrates the cle genes (previously examined in Supplementary utility of tricycle to atlas-level data. Figure S6) and compare the location of peak expression in the mNeurosphere dataset compare to Stability of the cell cycle position assignments the mHippNPC dataset with cell cycle position estimated using these two approaches (Supplemen- To test the robustness of tricycle we performed in- tary Figure S33). For the vast majority of genes, silico experiments to determine the stability of cell the highest expression appear at a closer position cycle position assignments. We evaluated three dif- when we estimate cell cycle position by projecting ferent types of stability wrt. (a) missing genes, (b) the mHippNPC dataset into the mNeurosphere em- sequencing depth, and (c) data preprocessing. bedding. When projecting new data into the cell cycle ref- To examine the impact of preprocessing data erence embedding, it is common that the feature prior to projection, we compared cell cycle posi-

Zheng et al. | 2021 | bioRχiv | Page 11 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Human fetal tissues atlas (n=4062980) a 0.5π

π 0/2π

10 1.5π

0 UMAP-2

-10

-10 0 10 20 UMAP-1

b 1.00 Actively proliferating False True 0.75

0.50

Percentage 0.25

0.00

Eye Lung Liver Spleen Heart Kidney Muscle Thymus Adrenal Placenta Stomach Intestine Pancreas Cerebrum Cerebellum

Figure 6. Application of tricycle on a human fetal tissue atlas. Data is from from Cao et al. (2020). (a) UMAP embedding of human fetal tissue atlas data colored by cell cycle position θ estimated using mNeurosphere reference. (b) The percentage of actively proliferating cells in human fetal tissue atlas. Tissues are ordered decreasingly with the percentage. Tissue and cell type annotations are available in Supplementary Figure S24

tion inferred using data processed with and with- across a high dynamic range of both number of de- out Seurat. Note that when we estimate the cell tectable genes within the feature map as well as cycle space, we use Seurat to align the different bi- depth of the information content in the target cells. ological samples. But this is not done when we project new data using the pre-learned reference. DISCUSSION We observe negligible differences, whether or not Seurat is used (Supplementary Figure S34). Here, we have demonstrated the ability of tricycle These results demonstrate the high sensitivity of to accurately call cell cycle position in 26 datasets tricycle to accurately estimate the cell cycle position across species, cell types, and assay technologies.

Zheng et al. | 2021 | bioRχiv | Page 12 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Tricycle achieves its universality by leveraging key the biological process itself. Here, we use a fixed ref- features of the biology of the cell cycle, the mathe- erence embedding to represent cell cycle, defined matical properties of principal component analysis using the mouse cortical neurosphere dataset. This on periodic functions, and ubiquitous applicability raises the question: is there a single best embed- of transfer learning to enable rapid and efficient use ding? One approach would be to decrease the size across a diverse collection of datasets. Our embed- of the gene list used to construct the embedding. In ding – shaped by the fact that the first dimension support of this, Hsiao et al. (2020) reports that as stratifies G2/M from G1/G0 – ensures that we can little as 6 genes yield good performance. We find a easily interpret cell cycle position between datasets, small set (though larger than 6) of genes with high overcoming one challenge of cell cycle inference. weights (Figure 2), but that making the list too small The stage specific periodicity of cell cycle markers, results in inferior performance. tied to their biological function, implies that the cell Another approach would be to optimize the em- cycle space becomes two dimensional. Our defini- bedding to be as circular as possible. However, de- tion of cell cycle position as the polar angle of a cell spite different shapes, embeddings based on the embedded in the reference cell cycle space, serves cortical neurosphere and the primary hippocampal as a form of internal normalization and helps with NPC datasets result in similar cell cycle position es- the generalizability of tricycle across datasets. De- timates. Both results argue that the robustness of spite this, it is still remarkable that we can project the method is derived from the structure created new data – without data integration or batch effect by the relationship of the genes to each other rather removal – and still get a useful and accurate embed- than the behavior of any individual marker gene. ding of the data into the cell cycle space with min- Thus, so long as the structure of the embedding is imal computational effort. Because the projection driven by the cell cycle, the specific source of the operator is a single low-dimensional linear opera- reference embedding is irrelevant. Here, we use tion, tricycle has excellent scalability and can easily 500 genes as well as a single, clean, dataset to de- be applied to atlas-scale datasets. Thus, tricycle is a fine the cell cycle embedding, and we show that powerful tool for quickly and accurately inferring this achieves excellent generalization performance cell cycle position for single-cell RNA-seq data. without any optimization. While our use of a single, The cell cycle is a major source of transcriptional fixed, reference embedding is a clear advantage to variation in many biological systems. In particular, users, our package contains functions to define and highly studied systems such as developmental and use a custom reference embedding. disease processes rely on proper regulation of the cell cycle. In many single-cell experiments however, METHODS cell cycle is often considered a confounding factor and as such, methods exist to remove this effect Using principal component analysis to recover from the data prior to analysis. We caution against time ordering removing cell cycle progression blindly as it can be intimately intertwined with other sources of varia- We will consider the following statistical model. tion of interest. Taking the mPancreas data as an The mean expression of each gene is modelled as example, there is a clear relationship between the f (θ) = A cos(θ − d ) number of cycling cells and differentiation as the g g g multi-potent ductal cells advance to be terminally here Ag is a gene-specific amplitude and dg is a differentiated alpha and beta cells. If correction for mean-specific displacements (location of the peak). cell cycle progression is warranted, our analysis of In this formulation, the mean function has a sin- the mPancreas data suggests that the common ap- gle peak and is periodic . We have G genes and proach of regressing out principal components of each gene has its own (but not necessarily unique) cell cycle genes may remove biological variation of (Ag, dg). interest. Basic trigonometry yields the identity The success of tricycle’s application using a sin- ( ) = ( − ) gle arbitrary cell cycle embedding raises interesting fg θ Ag cos θ dg questions about the robustness and universality of = Ag cos(dg) cos(θ) + Ag sin(dg) sin(θ)

Zheng et al. | 2021 | bioRχiv | Page 13 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

which we can write as of 0.2 was added. The depictions in Figure 1a,b √ were each one of the realizations of the two differ- fg(θ) = Ag cos(dg) π φ (θ) √ 1 ent cosine functions. t + Ag sin(dg) π φ2(θ) = agφ(θ) For Supplementary Figures S1, S2 and S3 we simulated data using the negative binomial distribu- using the orthonormal functions tion, inspired by the setup in Splatter (Zappia et al., 2017). In addition to a gene-specific ampli- 1 θ 7→ φ(θ) = (φ1(θ), φ2(θ)) = √ (cos(θ), sin(θ)) tude (Ag) and location of the peak (Lg), we also π consider different library size (l), which is an ap- Our derivation is based on Ramsay and Silverman proximate as we still have some cell-to-cell vari- 0 (2005) section 8.4. This section shows that the ance. For a cell, we let λg = Ag cos(θ − Lg) + c, 0 variance-covariance operator is given by with c a constant to ensure positivity of λg. Then 0 0 G 0 the cell mean is λg = l · λg/ ∑ λg. The trended − v(s, θ) = φt(s) G 1CtC φ(θ) cell mean is simulated from a Gamma distribution 2 0 2 as λg ∼ Gamma(1/B , λgB ), with B the biological where the inner matrix (which turns out to deter- coefficient of variation (we fix B as 0.1 in our sim- mine the principal components) is a 2 × 2 matrix ulations). Thus, the counts for gene g is given as equal to yg ∼ Pois(λg). We always simulate a 100 genes times 5000 cells count matrix, with cell timepoint θ 1 t 1 t 1 2 t uniformed distributed between 0 and 2π. We only C C = ∑ agag = ∑ Agπφ(dg)φ(dg) G G g G g varies one of the Lg, Ag and l in Supplementary Fig- 2 ures S1, S2 and S3. Specifically, in Supplementary 1 2 cos (dg) cos(dg) sin(dg) = ∑ Ag 2 Figures S1, we used different number of distinct G cos(dg) sin(dg) sin (dg) g peak locations across 100 genes, and fixed the am- The principal component analysis is given by plitudes (across 100 genes) as 3 and library size as the Eigen-functions and -values of the variance- 2000. In Supplementary Figures S2, we used dif- covariance operator. Such an Eigen-function and ferent numbers of distinct amplitudes across 100 -value pair ξ, ρ takes the form genes, and fixed the number of distinct peak locations (across 100 genes) as 100 and library size as ξ(θ) = btφ(θ) 2000. In Supplementary Figures S3, we changed the library size l, and fixed the number of distinct for a vector b which satisfies peak locations (across 100 genes) as 100 and the amplitudes (across 100 genes) as 3. PCA was per- G−1CtCb = ρb formed on the library size normalized and log2 ie. b, ρ are Eigen-vectors and -values for the transformed matrix after we got the count matrix. −1 t G C C matrix. Specifically, if q1, q2, λ1, λ2 are two such Eigen-vectors- and -values then the two first Generation of mouse primary hippocampal NPC principal components are given by scRNA-Seq dataset √ p Hippocampal neural stem/progenitor cells (NPCs) θ 7→ ξ(θ) = (cos(θ), sin(θ))tq G λ i i were isolated by microdissection from E17 day embryos (offspring of male Kmt2d+/βgeo and female Simulations C57Bl/6J) and cultured on Matrigel as described in For Figure 1 we performed the following simula- Carosso et al. (2019). We verified neuronal lineage tion. 50 realization of a cosine function with a lo- by demonstrating Nestin, Calbindin, and Prox1 ex- cation of 0.2 and an amplitude of 0.5 as well as 50 pression (not shown). Cells were maintained in an realizations of a cosine function with a location of undifferentiated state with growth factor inhibition 1.2 and an amplitude of 1. Each function was evalu- (EGF, FGF2) in Neurobasal media. In a prior publi- +/βgeo ated on an equidistant grid of 1000 points and inde- cation, we have demonstrated that the Kmt2d pendent Gaussian noise with a standard deviation cells exhibit defects in proliferation (Carosso et al.,

Zheng et al. | 2021 | bioRχiv | Page 14 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

2019). Following isolation we collected cells from were created using the Chromium Single-Cell 3’ li- both genotypes at the undifferentiated state (day brary & Gel Bead Kit v2 (10x Genomics) according 0) and then after growth factor removal on days 4, to manufacturer protocol. 7, 10 and 14, capturing cells that were ever more differentiated. sc-RNA-Seq libraries were created Reference genome and mapping index building with a Chromium Single-Cell 3’ library & Gel Bead Kit v2 (10x Genomics) according to manufacturer For mouse, GRCm38 reference genome fasta protocol. Only cells from day 0 are analyzed here. file and primary gene annotation GTF file (v25) were downloaded from GENCODE (https:// www.gencodegenes.org). Similarly, GRCh38 Generation of mouse E14.5 Neurosphere reference genome fasta file and primary gene an- scRNA-Seq dataset notation GFT file(v35) were downloaded for hu- Cortical neurospheres were generated from the man. We built a reference index for use by alevin dissociated telencephalon of embryonic day 14.5 as described by Soneson (2020) using R package (E14.5) wild type embryos. Embryos were har- eisaR(v1.2.0), which we use to quantify both spliced vested and the dorsal telencephalon was dissected and unspliced counts of annotated genes. away and collected in 1X HBSS at RT temperature. The dorsal telencephalon was gently triturated us- scRNA-Seq preprocessing ing p1000 pipette tips and the resultant cell suspension was spun at 500G for 5min and the media was Mouse Neurosphere (mNeurosphere) dataset: aspirated off. The cell pellets were resuspended fastqs files were used to quantify both spliced and in complete neurosphere media 7ml (CNM) and unspliced counts by Alevin (Salmon v1.3.0) with de- plated in ultra-low adherence T25 flasks. CNM is fault settings as described by Soneson (2020). Abun- made from combining 480ml DMEM-F12 with glu- dances matrices were read in by R package tximeta tamine, 1.45g of glucose, 1X N2 supplement, 1X (v1.8.1). The spliced counts were treated as the ex- B27 supplement without retinoic acid, 1x penicil- pression counts. We removed cells with less than lium/streptomycin and 10ng/ml of both epidermal 200 expressed genes, and cells flagged as outliers growth factor (EGF) and basic fibroblast growth fac- (deviating more than triple median absolute devi- tor (bFGF). The cell pellets were cultured for 3-5 ations(MAD) from the median of log2(TotalUMIs), days, or until spheroids have formed. The neuro- log2(number of expressed genes), percentage of mi- spheres were then collected and spun at 100G for tochondrial gene counts, or log10(doublet scores)). 5min and the supernatant was removed. Neuro- The doublet scores were computed using doublet- spheres were resuspeneded in 5ml TrypLE and in- Cells function in R package scran (v1.18.1). All cubated for a maximum of 5min at 37◦C with gen- mitochondrial genes and any genes which were tle trituration every 1.5min with a p1000 until the expressed in less than 20 cells were further ex- neurospheres are mostly a single-cell suspension. cluded from all subsequent analyses. Expression The cells were spun down at 500G for 5min and abundances were then library size normalized and the supernatant was removed. The cells were resus- log2 transformed by function normalizeCounts in pended in 15ml of CNM and gently passed through R package scuttle (v1.0.2). The biological samples a 40uM filter to remove large cell clumps. The resul- were integrated together by Seurat (v3.2.2). We then tant cell suspension was then plated in T75 flasks run PCA on the top 2000 highly variable genes of for another 2-5 days or until spheres begun to have the integrated log2(expression) using the runPCA dark centers. This process was repeated two more function with default parameters, followed by run- times before cells were collected for 10X Genomics ing the runUMAP function on the resulting top single-cell library prep. Before single-cell library 30 principal components with default parameters. preparation, the neurospheres were dissociated as Note that we did not restrict genes to cell cycle described above and passed through a 40uM filter genes in this step, as we would like to see the over- to ensure a single-cell suspension. Approx. 7000 all variation of the data. Cell types were inferred cells were selected from each sample for input to by SingleR package v1.4.0 using built-in MouseR- the scRNA-Seq library prep. sc-RNA-Seq libraries NAseqData dataset as the reference.

Zheng et al. | 2021 | bioRχiv | Page 15 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Mouse primary hippocampal NPC (mHipp- pressed genes), percentage of mitochondrial gene

NPC) dataset: All preprocessing are the same as for counts, or log10(doublet scores)). As the total UMIs the mouse Neurosphere (mNeurosphere) dataset. depend on cell type, we ﬁltered the cells by block- Mouse developing pancreas (mPancreas) ing for each cell type. The doublet scores were com- dataset: We obtained the spliced and unspliced puted using doubletCells function in R package count matrices of the Mouse developing pancreas scran (v1.18.1). All mitochondrial genes and any dataset from the python package scvelo (v0.2.1). genes which were expressed in less than 20 cells The spliced counts were treated as the expression were further excluded from all subsequent analy- counts. We removed cells with less than 200 ses. Expression abundances were then library size expressed genes, and any cells ﬂagged as outliers normalized and log2 transformed by function nor- (deviating more than triple median absolute devia- malizeCounts. We used the cell type annotations

tions(MAD) from the median of log2(TotalUMIs), as the new CellType column in the provided pheno- log2(number of expressed genes), percentage of mi- type ﬁle. The single-cell libraries of the data was tochondrial gene counts, or log10(doublet scores)). generated using 10x Genomics’ Chromium v2 sys- Here, the doublet scores were computed using tem. doubletCells function in R package scran (v1.18.1). HeLa cell lines datasets: The spliced and un- All mitochondrial genes and any genes which spliced count matrices of HeLa Set 1 (HeLa1) and were expressed in less than 20 cells were further HeLa Set 2 (HeLa2) were downloaded from GEO excluded from all subsequent analyses. Expression website with accession number GSE142277 and abundances were then library size normalized and GSE142356. Both datasets were generated by the

log2 transformed by function normalizeCounts same lab under the same protocol, while the se- in R package scuttle (v1.0.2). We run PCA on the quencing depth of Set 2 is only about half that top 500 highly variable genes using the runPCA of Set 1 (Schwabe et al., 2020). For each dataset, function with default parameters, followed by we only used the genes existing in both spliced running the runUMAP function on the resulting and unspliced count matrices. The spliced counts top 30 principal components. When running the were treated as the expression counts. We re- UMAP, we set min dist to 0.5 instead of default moved cells with less than 200 expressed genes, and value 0.01 to replicate the UMAP ﬁgure shown in cells ﬂagged as outliers (deviating more than triple Bergen et al. (2020) with other parameters default. median absolute deviations(MAD) from the me-

Of note, the single-cell libraries of the data was dian of log2(TotalUMIs), log2(number of expressed generated using 10x Genomics’ Chromium v2 genes), percentage of mitochondrial gene counts,

system. or log10(doublet scores)). All mitochondrial genes Mouse Hematopoietic Stem Cell (mHSC) and any genes which were expressed in less than

dataset: We downloaded processed log2 transform 20 cells were further excluded from all subsequent TPM matrix directly from GEO under accession analyses. Expression abundances were then library

number GSE59114 (Kowalczyk et al., 2015). We size normalized and log2 transformed by function only used the cells from C57BL/6 strain, of which normalizeCounts. The single-cell libraries of the contains more cells, as the number of overlapped data was generated using Drop-seq system. genes between xlsx file of C57BL/6 strain and Mouse embryonic stem cell (mESC) DBA/2 strain is too small. Because the data was dataset: The processed count matrix was already processed and filtered, we did not perform downloaded from ArrayExpress website any other processing. Unlike the above mentioned under accession number E-MTAB-2805 dataset, the SMARTer protocol was applied during (https://www.ebi.ac.uk/arrayexpress/ library preparation. experiments/E-MTAB-2805/). We only re- Mouse Retina (mRetina) dataset: This dataset tained 279 cells with log2(counts) greater than 15. is available at https://github.com/gofflab/ The count matrix were library size normalized developing_mouse_retina_scRNASeq. We re- across cells and log2 transformed by function nor- moved cells flagged as outliers (deviating more malizeCounts. The RNA-Seq data was generated than triple median absolute deviations(MAD) from using Fluidigm C1 system in this dataset.

the median of log2(TotalUMIs), log2(number of ex- Human embroyonic stem cells (hESC) dataset:

Zheng et al. | 2021 | bioRχiv | Page 16 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

The processed count matrix was downloaded from stage k, we could calculate the mean expression GEO under accession number GSE64016. We only across genes in the gene list lk for the jth cell as 1 retained FACS sorted cells. The count matrix were mj,k = Σ i E i with E i as the log trans- pk gk∈lk gk,j gk,j 2 library size normalized across cells and log trans- i 2 formed expression value of gene gk and cell j. Then formed by function normalizeCounts. The RNA- we assess how well a gene in a gene list correlates Seq data was generated using Fluidigm C1 system to the mean expression level of that gene list as in this dataset. cgi = cor Egi , mk . For each stage, the gene list Human U-2 OS cells (hU2OS) dataset: The TPM k k is pruned to genes with c i > 0.2. (For the fetal tis- matrix was downloaded from GEO under acces- gk sues dataset, we used c i > 0.15 since the extremely sion number GSE146773. We only retained FACS gk shallowly sequenced data shows less co-expression sorted cells with log2(counts) greater than the 3 times MAD range. Genes which were expressed in patterns and the threshold 0.2 could leave us with less than 20 cell were removed. The left TPM matrix no genes.) We label this pruned new gene list as q L = {g1, g2, ··· , g k } with q the number of genes. were library size normalized across cells and log2 k k k k k transformed by function normalizeCounts. The The stage assignment score for cell j and stage k is RNA-Seq data was generated using SMART-seq2 given as chemistry in this dataset. 1 A = Σ i E i k,j g ∈Lk g ,j Human induced pluripotent stem cells (hiP- qk k k SCs) dataset: The processed FUCCI intensity and The 5-by-n matrix A, of which the number of RNA-seq data was downloaded from https:// columns equals to the number of cells, follows z- github.com/jdblischak/fucci-seq/blob/ score transformations w.r.t. first rows and then master/data/eset-final.rds?raw=true. columns, resulting the 5-by-n matrix Ae = (Aek,j). The preprocessing was described in Hsiao et al. For each cell, we compute the preliminary stage as- (2020). The count matrix were library size nor- signment as sj = arg max{Aek,j}. malized across cells and log2 transformed by k function normalizeCounts. The RNA-Seq data As in the Schwabe et al. (2020), we also apply was generated using Fluidigm C1 system in this two filtering steps. The first filtering, which is ex- dataset. actly the same described by the original paper. We Fetal tissue dataset: We got the loom file con- require sj˜, the stage with the second largest assign- taining gene counts of all tissue from GEO under ment score to be the neighboring stage to sj. This accession number GSE156793. We then processed requirement corresponds to that the 5 stages are and analyzed each tissue separately. For each tis- continuously cyclic processes. As for the second filtering step, the original sue type, cells of which log2(TotalUMIs) is lower than median − 3 × MAD, and genes expressed in method discards all cells with the second largest assignment score A > 0.75. We found the thresh- less than 20 cells were excluded from further analy- esj˜,j ses. The count matrix were library size normalized old of 0.75 to some extent not applicable, as in some datasets it leads to losing 90% of cells. There- across cells and log2 transformed by function nor- malizeCounts. All 4 tissues profiled using single- fore, we use a more adaptive threshold by requiring A − A > cell and 9 tissues profiled using single-nuclei were esj,j esj˜,j 0.3. generated on sci-RNA-seq3 system. If the cell passes two filtering steps, it will be assigned to a stage sj. Otherwise, it would be as- 5 stage cell cycle assignments signed as NA w.r.t. 5 stages of the cell cycle. To mitigate the batch effect on the 5 stage assignments, The 5 stage (G1S, S, G2, G2M, and MG1) cell cy- the assigning procedures are done for each sam- cle assignments were adapted from Schwabe et al. ple/batch separately within each dataset, as recom- (2020) with some modifications. Briefly, the assign- mended in Revelio package (Schwabe et al., 2020). ments use the high expression genes list for each stage, curated by Whitfield et al. (2002). Let k repre- 1 2 pk sent one of the 5 stages, and lk = {gk, gk, ··· , gk } represent the gene list with pk genes. For each

Zheng et al. | 2021 | bioRχiv | Page 17 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

PCA of GO cell cycle genes boundaries 0 and 2π. Hence, we concatenate triple y and triple θ with one period shift to form [y, y, y] For each dataset, we subsetted the preprocessed and [θ − 2π, θ, θ + 2π], on which the loess line is log transformed expression matrix to genes in the 2 fitted. We then only use the fitted value yˆ when θ is GO term cell cycle (GO:0007049). If there are clear between 0 and 2π for visualization purpose. batches defined in the dataset, such as sample or The calculation of the coefficient of determina- batch, we use Seurat3 to remove batch effect. In tion R2 of fitted loess model is given by the case of using Seurat3, we used a library size normalized count matrix as input instead of log 2 2 SSres transformed values. The integration anchors were R = 1 − SStotal searched in the space of top 30 PCs. The output in- n 2 n tegrated matrix is a log2 transformed matrix of top Here SSres = ∑i (yi − yî) and SStotal = ∑i (yi − 500 most variable genes. We then performed prin- y¯)2. Note that instead of using all three copies of cipal component analysis on the gene-wise mean data points, we restrict the calculation of SSres and centered expression matrix. In the case of no batch SStotal on the original data points (the middle copy). exiting, we also restricting to top 500 variable genes The residuals are not the same for the three copies, among GO cell cycle genes. especially at the beginning and end of [−2π, 2π].

Projection of new data to cell cycle embedding The circular correlation coefficient ρ and calculation of cell cycle position θ We use the circular correlation coefficient ρ defined The projection using pre-learned weights matrix by Jammalamadaka and Sarma, 1988 to evaluate during PCA of GO cell cycle genes is straight for- concordance between two polar vectors θ1 and θ2. ward, given by It is defined as follows P = E˜ t · R Σ[sin(θ − µ ) · sin(θ − µ )] ρ = 1 1 2 2 where R represents the o-by-2 reference matrix (o ≤ q Σ[sin2(θ − µ )] · Σ[sin2(θ − µ )] 500), contains the weights of top 2 PCs learned from 1 1 2 2 PCA of GO cell cycle genes; E˜ is a o-by-n matrix, µ and µ represent the mean of θ and θ respec- subsetted from E (the log transformed expression 1 2 1 2 2 tively, and are estimated by maximum likelihood matrix) with genes in the weights matrix and row- estimation under von Mises distribution assump- means centered. The resulting n-by-2 P is the cell tion. cycle embedding projected by the reference. The calculation of the cell cycle position θ is given by Running other methods P θ = arctan( 2 ) For other cell cycle inference methods, we use all P1 default parameters and its built-in reference (if needed) in the following packages: cyclone in scran where Pi is the ith column of matrix P. When mapping the genes between weights matrix and (v1.18.5), CellCycleScoring in Seurat (v4.0.0.9015), the data that we want to project, the Ensemble ID Revelio (v0.1.0), peco (v1.1.21), and reCAT (v1.1.0). is given higher priority than the gene symbol for mouse. For across species projection, we only con- Silhouette index on angular separation distance sider the homologous genes of the same gene sym- of tricycle cell cycle position θ bols. For cyclone and Seurat, we could use Silhouette index to describe consistency between discretized cell Periodic loess cycle stage and tricycle cell cycle position θ. We use As θ is a circular variable bound between 0 to 2π, angular separation distance metric to quantify the fitting a traditional loess model y ∼ θ, with y as any distance between cell i and cell j as response variable, such as the gene expression of d(i, j) = 1 − cos(θi − θj) gene, or log2(TotalUMIs), has problems around the

Zheng et al. | 2021 | bioRχiv | Page 18 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

(i) For a cell i ∈ Sk(i) 3 k ∈ {G1, S, G2M}. The mean Acknowledgements distance between cell i and all other cells assigned Funding: This project has been made possible to the same stage in part by grant number CZF2019-002443 from 1 the Chan Zuckerberg Initiative DAF, an advised a(i) = d(i, j) | ( ) | − ∑ fund of Silicon Valley Community Foundation. Re- Sk i 1 j∈S ,i6=j k(i) search reported in this publication was supported by the National Institute of General Medical Sci- with |S (i) | the cardinality of S (i) . Specially, a(i) = 0 k k ences of the National Institutes of Health under if |S (i) | = 1.The mean distance from cell i to all cells k award R01GM121459. This work was additionally assigned to other stage k0 such that k0 6= k(i) ∧ k0 ∈ supported by awards from the National Science {G1, S, G2M} is Foundation (IOS-1665692), the National Institute of 1 Aging (R01AG066768), and the Maryland Stem Cell b(i) = min d(i, j) 0 (i) |S 0 | ∑ Research Foundation (2016-MSCRFI-2805). GSO is k 6=k k j∈S 0 k supported by postdoctoral fellowship awards from The Silhouette index for cell i is given as the Kavli Neurodiscovery Institute, the Johns Hop- kins Provost Award Program, and the BRAIN Ini- ( b(i)−a(i) | ( ) | > tiative in partnership with the National Institute of max{a(i),b(i)} if Sk i 1 s(i) = Neurological Disorders (K99NS122085). 0 if |Sk(i) | = 1

For any cell i, the Silhouette index s(i) is bound be- Disclaimer: The content is solely the responsibil- tween −1 to 1 ( −1 ≤ s(i) ≤ 1). An s(i) close to 1 ity of the authors and does not necessarily repre- means the cell is consistently assigned to its neigh- sent the official views of the National Institutes of bors w.r.t. its cell cycle position θi. An s(i) close Health or National Science Foundation. to −1 means the cell is closer to the other stage. An s(i) equals to 0 means the cells is on the bor- Conflict of Interest: None declared. der of two stages. The mean Silhouette index on all cells measures how tight the stage assignments BIBLIOGRAPHY are. In this context, this value must be interpreted carefully as it is different from traditional cluster- Alter, O, Brown, PO, and Botstein, D (2000). Singu- ing which might puts hard boundaries and gaps lar value decomposition for genome-wide expression between clusters. As the cell cycle process is contin- data processing and modeling. Proceedings of the uous in nature, there must be cells assigned on the National Academy of Sciences of the United States of boundaries and ambiguous to either stage, and no America 97, 10101–10106. DOI: 10.1073/pnas.97. 18.10101 gap should appear between stages. Thus the mean . Ambros, V (1999). Cell cycle-dependent sequencing of Silhouette index greater than 0 might be appropri- cell fate decisions in Caenorhabditis elegans vulva ate to conclude the agreement between tricycle cell precursor cells. Development 126, 1947–56. cycle position θ and discretized cell cycle stages. Ashburner, M, Ball, CA, Blake, JA, Botstein, D, Butler, H, Michael Cherry, J, Davis, AP, Dolinski, K, Dwight, Data availability SS, Eppig, JT, et al. (2000). Gene Ontology: tool for the unification of biology. Nature Genetics 25, 25–29. Data reported in this publication is being submitted DOI: 10.1038/75556. to NCBI GEO. Get in touch if you want it sooner! Bastidas-Ponce, A, Tritschler, S, Dony, L, Scheibner, K, Tarquis-Medina, M, Salinno, C, Schirge, S, Burtscher, Software availability I, Bottcher,¨ A, Theis, FJ, et al. (2019). Comprehen- sive single cell mRNA profiling reveals a detailed The tricycle method is implemented in the R pack- roadmap for pancreatic endocrinogenesis. Develop- age tricycle containing the mNeurosphere refer- ment 146, dev173849. DOI: 10.1242/dev.173849. ence, which is available on https://github. com/hansenlab/tricycle. This package is being submitted to Bioconductor.

Zheng et al. | 2021 | bioRχiv | Page 19 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Belluti, S, Basile, V, Benatti, P, Ferrari, E, Marverti, G, Heck, MM, Hittelman, WN, and Earnshaw, WC (1988). and Imbriano, C (2013). Concurrent inhibition of Differential expression of DNA topoisomerases I and enzymatic activity and NF-Y-mediated transcription II during the eukaryotic cell cycle. Proceedings of the of Topoisomerase-IIα by bis-DemethoxyCurcumin in National Academy of Sciences 85, 1086–1090. DOI: cancer cells. Cell Death & Disease 4, e756–e756. DOI: 10.1073/pnas.85.4.1086. 10.1038/cddis.2013.287. Hsiao, CJ, Tung, P, Blischak, JD, Burnett, JE, Barr, KA, Bergen, V, Lange, M, Peidli, S, Wolf, FA, and Theis, FJ Dey, KK, Stephens, M, and Gilad, Y (2020). Charac- (2020). Generalizing RNA velocity to transient cell terizing and inferring quantitative cell cycle phase in states through dynamical modeling. Nature Biotech- single-cell RNA-seq data analysis. Genome Research nology, 1–7. DOI: 10.1038/s41587-020-0591-3. 30, 611–621. DOI: 10.1101/gr.247759.118. Buettner, F, Natarajan, KN, Casale, FP, Proserpio, V, Jammalamadaka, SR and Sarma, Y (1988). A correlation Scialdone, A, Theis, FJ, Teichmann, SA, Marioni, JC, coefficient for angular variables. Statistical theory and and Stegle, O (2015). Computational analysis of cell- data analysis II, 349–364. to-cell heterogeneity in single-cell RNA-sequencing Kowalczyk, MS, Tirosh, I, Heckl, D, Rao, TN, Dixit, A, data reveals hidden subpopulations of cells. Nature Haas, BJ, Schneider, RK, Wagers, AJ, Ebert, BL, and Biotechnology 33, 155–160. DOI: 10 . 1038 / nbt . Regev, A (2015). Single-cell RNA-seq reveals changes 3102. in cell cycle and differentiation programs upon ag- Cao, J, O’Day, DR, Pliner, HA, Kingsley, PD, Deng, M, ing of hematopoietic stem cells. Genome Research 25, Daza, RM, Zager, MA, Aldinger, KA, Blecher-Gonen, 1860–1872. DOI: 10.1101/gr.192237.115. R, Zhang, F, et al. (2020). A human cell atlas of fe- Leng, N, Chu, LF, Barry, C, Li, Y, Choi, J, Li, X, Jiang, P, tal gene expression. Science 370, eaba7721. DOI: 10. Stewart, RM, Thomson, JA, and Kendziorski, C (2015). 1126/science.aba7721. Oscope identifies oscillatory genes in unsynchronized Carosso, GA, Boukas, L, Augustin, JJ, Nguyen, HN, single-cell RNA-seq experiments. Nature Methods 12, Winer, BL, Cannon, GH, Robertson, JD, Zhang, L, 947–950. DOI: 10.1038/nmeth.3549. Hansen, KD, Goff, LA, et al. (2019). Precocious neu- Liu, Z, Lou, H, Xie, K, Wang, H, Chen, N, Aparicio, OM, ronal differentiation and disrupted oxygen responses Zhang, MQ, Jiang, R, and Chen, T (2017). Reconstruct- in Kabuki syndrome. JCI Insight 4. DOI: 10.1172/ ing cell cycle pseudo time-series via single-cell tran- jci.insight.129375. scriptome data. Nature Communications 8, 22. DOI: Cho, RJ, Campbell, MJ, Winzeler, EA, Steinmetz, L, Con- 10.1038/s41467-017-00039-z. way, A, Wodicka, L, Wolfsberg, TG, Gabrielian, AE, Lodish, Berk, Harvey and, Kaiser, Arnold and, Kaiser, Landsman, D, Lockhart, DJ, et al. (1998). A genome- Chris A and, Krieger, Chris and, Scott, Monty and, wide transcriptional analysis of the mitotic cell cycle. Bretscher, Matthew P and, Ploegh, Anthony and, Mat- Molecular Cell 2, 65–73. DOI: 10 . 1016 / s1097 - sudaira, Hidde and, and others, Paul and (2008). “Sec- 2765(00)80114-8. tion 12.3 The Role of Topoisomerases in DNA Repli- Clark, BS, Stein-O’Brien, GL, Shiau, F, Cannon, GH, cation”. Molecular Cell Biology. 4th edition. New York: Davis-Marcisak, E, Sherman, T, Santiago, CP, Hoang, W. H. Freeman. TV, Rajaii, F, James-Esposito, RE, et al. (2019). Single- Mahdessian, D, Cesnik, AJ, Gnann, C, Danielsson, F, Cell RNA-Seq Analysis of Retinal Development Iden- Stenstrom,¨ L, Arif, M, Zhang, C, Le, T, Johansson, F, tifies NFI Factors as Regulating Mitotic Exit and Late- Shutten, R, et al. (2021). Spatiotemporal dissection of Born Cell Specification. Neuron 102, 1111–1126.e5. the cell cycle with single-cell proteogenomics. Nature DOI: 10.1016/j.neuron.2019.04.010. 590, 649–654. DOI: 10.1038/s41586-021-03232- Cumano, A and Godin, I (2007). Ontogeny of the 9. hematopoietic system. Annual Review of Immunol- Marguerat, S and Bahler,¨ J (2012). Coordinating genome ogy 25, 745–785. DOI: 10 . 1146 / annurev . expression with cell size. Trends in Genetics 28, 560– immunol.25.022106.141538. 565. DOI: 10.1016/j.tig.2012.07.003. Dolatabadi, S, Candia, J, Akrap, N, Vannas, C, Tomic, McConnell, S and Kaznowski, C (1991). Cell cycle depen- TT, Losert, W, Landberg, G, Aman,˚ P, and Stahlberg,˚ dence of laminar determination in developing neocor- A (2017). Cell Cycle and Cell Size Dependent Gene tex. Science 254, 282–285. DOI: 10.1126/science. Expression Reveals Distinct Subpopulations at Single- 1925583. Cell Level. Frontiers in Genetics 8, 1. DOI: 10.3389/ McGarry, TJ and Kirschner, MW (1998). Geminin, an fgene.2017.00001. inhibitor of DNA replication, is degraded during mi- Gauthier, NP, Larsen, ME, Wernersson, R, Lichtenberg, tosis. Cell 93, 1043–1053. DOI: 10.1016/s0092- U de, Jensen, LJ, Brunak, S, and Jensen, TS (2008). 8674(00)81209-x. Cyclebase.org–a comprehensive multi-organism on- Ohnuma, Si and Harris, WA (2003). Neurogenesis and line database of cell-cycle experiments. Nucleic Acids the Cell Cycle. Neuron 40, 199–208. DOI: 10.1016/ Research 36, D854–9. DOI: 10.1093/nar/gkm729. s0896-6273(03)00632-9.

Zheng et al. | 2021 | bioRχiv | Page 20 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Ono, T, Losada, A, Hirano, M, Myers, MP, Neuwald, Wei-Shan, H, Amit, VC, and Clarke, DJ (2019). ell cycle AF, and Hirano, T (2003). Differential Contributions regulation of condensin Smc4. Oncotarget 10, 263– of Condensin I and Condensin II to Mitotic Chromo- 276. DOI: 10.18632/oncotarget.26467. some Architecture in Vertebrate Cells. Cell 115, 109– Whitfield, ML, Sherlock, G, Saldanha, AJ, Murray, JI, 121. DOI: 10.1016/s0092-8674(03)00724-4. Ball, CA, Alexander, KE, Matese, JC, Perou, CM, Hurt, Padovan-Merhar, O, Nair, GP, Biaesch, AG, Mayer, A, MM, Brown, PO, et al. (2002). Identification of Genes Scarfone, S, Foley, SW, Wu, AR, Churchman, LS, Periodically Expressed in the Human Cell Cycle and Singh, A, and Raj, A (2015). Single Mammalian Their Expression in Tumors. Molecular Biology of the Cells Compensate for Differences in Cellular Volume Cell 13, 1977–2000. DOI: 10.1091/mbc.02- 02- and DNA Copy Number through Independent Global 0030. Transcriptional Mechanisms. Molecular Cell 58, 339– Zappia, L, Phipson, B, and Oshlack, A (2017). Splat- 352. DOI: 10.1016/j.molcel.2015.03.005. ter: simulation of single-cell RNA sequencing data. Pan, SJ, Kwok, JT, and Yang, Q (2008). “Transfer learning Genome Biology 18, 174. DOI: 10.1186/s13059- via dimensionality reduction”. AAAI. Vol. 8, 677–682. 017-1305-0. Ramsay, H and Silverman, BW (2005). Functional Data Analysis, 2nd ed. Springer Verlag, New York. Sakaue-Sawano, A, Kurokawa, H, Morimura, T, Hanyu, A, Hama, H, Osawa, H, Kashiwagi, S, Fukami, K, Miy- ata, T, Miyoshi, H, et al. (2008). Visualizing Spatiotem- poral Dynamics of Multicellular Cell-Cycle Progres- sion. Cell 132, 487–498. DOI: 10.1016/j.cell. 2007.12.033. Sakaue-Sawano, A, Yo, M, Komatsu, N, Hiratsuka, T, Kogure, T, Hoshida, T, Goshima, N, Matsuda, M, Miyoshi, H, and Miyawaki, A (2017). Genetically En- coded Tools for Optical Dissection of the Mammalian Cell Cycle. Molecular Cell 68, 626–640.e5. DOI: 10. 1016/j.molcel.2017.10.001. Schwabe, D, Formichetti, S, Junker, JP, Falcke, M, and Rajewsky, N (2020). The transcriptome dynamics of single cells during the cell cycle. Molecular Systems Biology 16, e9946. DOI: 10.15252/msb.20209946. Scialdone, A, Natarajan, KN, Saraiva, LR, Proserpio, V, Teichmann, SA, Stegle, O, Marioni, JC, and Buettner, F (2015). Computational assignment of cell-cycle stage from single-cell transcriptome data. Methods 85, 54– 61. DOI: 10.1016/j.ymeth.2015.06.021. Soneson, C (2020). RNA Velocity with alevin. Spellman, PT, Sherlock, G, Zhang, MQ, Iyer, VR, Anders, K, Eisen, MB, Brown, PO, Botstein, D, and Futcher, B (1998). Comprehensive identification of cell cycle- regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 9, 3273–3297. DOI: 10.1091/mbc.9.12. 3273. Stein-O’Brien, GL, Clark, BS, Sherman, T, Zibetti, C, Hu, Q, Sealfon, R, Liu, S, Qian, J, Colantuoni, C, Black- shaw, S, et al. (2019). Decomposing Cell Identity for Transfer Learning across Cellular Measurements, Platforms, Tissues, and Species. Cell Systems 8, 395– 411.e8. DOI: 10.1016/j.cels.2019.04.004. Stuart, T, Butler, A, Hoffman, P, Hafemeister, C, Pa- palexi, E, Mauck 3rd, WM, Hao, Y, Stoeckius, M, Smibert, P, and Satija, R (2019). Comprehensive In- tegration of Single-Cell Data. Cell 177, 1888–1902.e21. DOI: 10.1016/j.cell.2019.05.031.

Zheng et al. | 2021 | bioRχiv | Page 21 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

S UPPLEMENTARY M ATERIALS

• Supplementary Methods.

• Supplementary Figures S1-S34.

• Supplementary Table S1.

Zheng et al. | 2021 | bioRχiv | Page S22 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

SUPPLEMENTARY METHODS

Comparison to existing cell cycle tools Oscope Oscope poses signiﬁcant challenges when run on shallow data (10X, sci-RNA-seq3, or DropSeq), since the method requires quantiﬁcation of a high number of genes in every cell. For this reason, we do not evaluate Oscope.

peco Peco supplies 2 models: one trained on 101 genes and one trained on 5 genes. We used the 101 gene model to be robust to some genes not being measurable in all datasets. We applied peco to all dataset described in Supplementary Table S1, except mRetina and human fetal tissues. For human fetal tissues, we only use a subset of random 2000 cells selected from human fetal intestine data (termed “hfIntestineSub”). We assess the expression dynamics of 4 genes highlighted in Hsiao et al. (2020): CDK1, TOP2A, UBE2C and H4C3 (Supplementary Figure S15); not all datasets have these genes measured in which case they are absent from the ﬁgure. To systematically compare tricycle and peco we use the R2 associated with two different cell cycle positions. This is a comparison between R2 for the same data, but using the same periodic loess approach with two different position variables. For these genes, across all dataset, tricycle cell cycle position has a higher R2 than peco cell cycle position (Supplementary Figure S15). Generally, information-rich Fluidigm C1 data does better with peco compared to information-poor 10X, Drop-Seq.

Revelio Revelio is designed to search for an ellipsoid pattern amongst (rotated) principal components, by ﬁnding the directions having strongest association to 5 discrete cell cycle stages. The output of Revelio is therefore supposed to be an ellipsoid. Revelio by itself does not quantify cell cycle position, although it seems natural to do so by the angle. When we use Revelio, we do indeed observe an ellipsoid in 4 datasets (Supplementary Figure S16a, b, f, g, i and j), but it clearly fails in 3 datasets: mPancreas dataset, mRetina dataset, and mHSC dataset (Supplementary Figure S16c, d, and e). These 3 datasets all have substantial variation which is not associated with cell cycle, such as cell types and differentiation, which we believe explains the non-ellipsoidal embedding. For example, in the mPancreas data some of the differentiation effect is perfectly confounded with cell cycle as the terminally differentiated cells stop cycling. It is not clear that simply rotating the principal components will help us ﬁnd a better cell cycle exclusive dimension. Additionally, Revelio removes any cell which does not have a prediction using the Schwabe stage predictor; in the mRetina dataset only 30k out of more than 90k cells are retained.

reCAT reCAT starts with a principal component analysis of the cell cycle genes, and infers an ordering by solving a traveling salesman problem on this representation. This produces an ordering, but this ordering is hard to interpret because it is not directly linked to cell cycle stage. To address this, the authors provide two different stage predictors. Because the method requires the solution of a traveling salesman problem, it scales poorly. Due to these issues, we only ran reCAT on data with less than 5000 cells. The orderings inferred by reCAT are largely consistent with our cell cycle position θ using mNeurosphere reference for all dataset except the most shallow sequenced hfIntestineSub data (Supplementary Figure S17 last sub-panel in each panel). And the expression dynamics of Top2A on the time series also conﬁrms the appropriate ordering of cells (Supplementary Figure S17 the third sub-panel in each panel). However, the two stage predictors given by reCAT yield different predictions on stages. For example, for the mPancreas dataset (Supplementary Figure S17a), the majority of cells are at S stage based on Bayes

Zheng et al. | 2021 | bioRχiv | Page S23 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

scores but are at G1 stage based on mean scores. Note that the reCAT function requires the user to feed an approximate cutoff position to assign a cell cycle stage based on Bayes scores. However, in all the datasets, we are unable to assign cutoff position to let each stage have its own highest scores interval. Without a useful stage assignment, the ability to make use of the cell orders is substantially restricted as the percentage of each stage is different across dataset.

Cyclone We observe general agreement between the 3 stage predictions of cyclone and tricycle cell cycle position, as the cyclone stages cluster together (Supplementary Figure S18). We note that cyclone assigns very few cells to the S stage. We believe this is caused by the assignment strategy (cells are assigned to S stage if both G1 and G2M scores are below 0.5). To expand on this comparison, we computed silhouette index with a distance deﬁned by the tricycle cell cycle position (Methods). For cyclone, the under- representation of S stage drags down the silhouette index for both G1 and S stages, as cells at S stages are usually mixed with G1 cells, making the mean distance to all cells at G1 stage and to all cells at S stage not that differentiable. We note that cyclone works best on the last two FACS dataset, with one of them (mESC) is the training dataset for cyclone gene list.

Seurat We observe good agreement between the 3 stage predictions of Seurat and tricycle cell cycle position, better than cyclone (Supplementary Figure S19). Compared to cyclone, we have a much higher silhouette index for Seurat; the highest observed mean is 0.74 for the mHSC dataset, which conﬁrms the highly visual agreement between Seurat assignments and tricycle. The main disadvantage of Seurat is the inherent limitation of a 3 stage prediction.

(modified) Schwabe The (modified) Schwabe method assigns cells to 5 different stages. Because of the higher resolution, it is the main predictor we use in our work. By default, the Schwabe method as reported in Schwabe et al. (2020) produces a substantial amount of missing labels, and we have therefore modified the method to address this (Methods); we call this the modified Schwabe predictor. Broadly, the (modified) Schwabe predictor agrees with tricycle, with one specific type of disagreement. These inconsistencies are examined in Supplementary Figure S20. Some cells with a tricycle cell cycle position of 0/2π (G0/G1) are assigned to other stages by modified Schwabe (Supplementary Figure S20 second sub-panel of each row). It is well appreciated that there are many more genes specifically expressed at S, G2 or M stage as compared to G0/G1 stage (Dolatabadi et al., 2017). For each dataset, we plot out the percentage of non-expressed genes over all projection genes in the first sub-panels, which show that the dynamics of percentages are captured by cell cycle position θ using mNeurosphere reference. We plot the percentage of non-expressed genes conditioned on stage and whether tricycle cell cycle position is around 0/2π (Supplementary Figure S20 third sub-panel of each row), which confirm that for each stage there exist two distinct groups. This is reinforced by the different expression patterns of Top2a and Smc4 between flagged cells and non-flagged cells in the last two sub-panels. Thus, we conclude the cells around 0/2π are likely to be wrongly assigned to other stages, probably due to low information content. To assess whether these inconsistencies are caused by our modification of Schwabe, we repeat the comparison using the original Schwabe assignments and arrive at the same conclusion (Supplementary Figure S21). This assessment highlights the large number of missing labels from the original Schwabe predictor, for example only 30k out of 90k cells in the mRetina dataset are labelled.

Zheng et al. | 2021 | bioRχiv | Page S24 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

SUPPLEMENTARY FIGURES

a Gene PCA (nPeak=2) Cell PCA (nPeak=2) Cell PC1 (nPeak=2) Cell PC2 (nPeak=2)

1 0.1 4 5

0.0 0 0 PC2 PC1 PC2 -1 PC2 (1.3%) -0.1 -4 -5

-2 -10 -0.10 -0.05 0.00 0.05 0.10 0.15 -10 -5 0 5 10 0 2 4 6 0 2 4 6 PC1 PC1 (61.9%) Timepoint θ Timepoint θ

b Gene PCA (nPeak=3) Cell PCA (nPeak=3) Cell PC1 (nPeak=3) Cell PC2 (nPeak=3) 10

0.1 4 5 2.5

0.0 0 0 0.0 PC2 PC1 PC2 PC2 (7.5%) -5 -0.1 -4 -2.5

-10

-0.10 -0.05 0.00 0.05 0.10 0.15 -10 -5 0 5 10 0 2 4 6 0 2 4 6 PC1 PC1 (65.7%) Timepoint θ Timepoint θ

c Gene PCA (nPeak=5) Cell PCA (nPeak=5) Cell PC1 (nPeak=5) Cell PC2 (nPeak=5) 5.0

0.1 2.5 4 5

0.0 0.0 0 0 PC2 PC1 PC2 -2.5 PC2 (20.6%) -0.1 -4 -5 -5.0

-7.5 -0.10 -0.05 0.00 0.05 0.10 0.15 -10 -5 0 5 10 0 2 4 6 0 2 4 6 PC1 PC1 (50.8%) Timepoint θ Timepoint θ

d Gene PCA (nPeak=10) Cell PCA (nPeak=10) Cell PC1 (nPeak=10) Cell PC2 (nPeak=10)

0.1 4 4 5

0 0.0 0 0 PC2 PC1 PC2 PC2 (26.3%) -4 -0.1 -4 -5

-0.10 -0.05 0.00 0.05 0.10 0.15 -10 -5 0 5 10 0 2 4 6 0 2 4 6 PC1 PC1 (47.6%) Timepoint θ Timepoint θ

e Gene PCA (nPeak=50) Cell PCA (nPeak=50) Cell PC1 (nPeak=50) Cell PC2 (nPeak=50)

0.1 4 5 4

0.0 0 0 0 PC2 PC1 PC2 PC2 (31.8%) -0.1 -4 -4 -5

-0.10 -0.05 0.00 0.05 0.10 0.15 -10 -5 0 5 10 0 2 4 6 0 2 4 6 PC1 PC1 (41.5%) Timepoint θ Timepoint θ

Supplementary Figure S1. Simulations using negative binomial distribution with different number of distinct peak locations. We used different number of distinct peak locations across 100 genes, and ﬁxed the amplitudes (across 100 genes) as 3 and library size as 2000. The number of distinct peak locations across 100 genes is (a) 2, (b) 3, (c) 5, (d) 10, and (e) 50. As long as we have more than 2 distinct peak locations, we get an ellipsoid.

Zheng et al. | 2021 | bioRχiv | Page S25 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a Gene PCA (nAmplitude=1) Cell PCA (nAmplitude=1) Cell PC1 (nAmplitude=1) Cell PC2 (nAmplitude=1) 10 0.2 2.5 2.5 5 0.1

0.0 0 0.0 0.0 PC2 PC1 PC2

-0.1 PC2 (21.1%) -5 -2.5 -2.5 -0.2 -10

-0.2 -0.1 0.0 0.1 0.2 -10 -5 0 5 10 0 2 4 6 0 2 4 6 PC1 PC1 (22.5%) Timepoint θ Timepoint θ

b Gene PCA (nAmplitude=2) Cell PCA (nAmplitude=2) Cell PC1 (nAmplitude=2) Cell PC2 (nAmplitude=2) 10 0.2 5 5 5 0.1

0.0 0 0 0 PC2 PC1 PC2

-0.1 PC2 (33.8%) -5 -5 -5 -0.2 -10 -10 -0.2 -0.1 0.0 0.1 0.2 -10 -5 0 5 10 0 2 4 6 0 2 4 6 PC1 PC1 (40.6%) Timepoint θ Timepoint θ

c Gene PCA (nAmplitude=3) Cell PCA (nAmplitude=3) Cell PC1 (nAmplitude=3) Cell PC2 (nAmplitude=3)

10 10 10 0.2

5 5 5 0.1

0.0 0 0 0 PC2 PC1 PC2

-0.1 PC2 (38.2%) -5 -5 -5

-0.2 -10 -10 -10 -0.2 -0.1 0.0 0.1 0.2 -10 -5 0 5 10 0 2 4 6 0 2 4 6 PC1 PC1 (40.0%) Timepoint θ Timepoint θ

d Gene PCA (nAmplitude=5) Cell PCA (nAmplitude=5) Cell PC1 (nAmplitude=5) Cell PC2 (nAmplitude=5) 10 0.2

5 5 4 0.1

0.0 0 0 0 PC2 PC1 PC2

-0.1 PC2 (29.6%) -5 -5 -4

-0.2 -10 -8 -0.2 -0.1 0.0 0.1 0.2 -10 -5 0 5 10 0 2 4 6 0 2 4 6 PC1 PC1 (44.5%) Timepoint θ Timepoint θ

e Gene PCA (nAmplitude=10) Cell PCA (nAmplitude=10) Cell PC1 (nAmplitude=10) Cell PC2 (nAmplitude=10) 10 10 0.2

5 5 5 0.1

0.0 0 0 0 PC2 PC1 PC2

-0.1 PC2 (32.4%) -5 -5 -5

-0.2 -10 -10

-0.2 -0.1 0.0 0.1 0.2 -10 -5 0 5 10 0 2 4 6 0 2 4 6 PC1 PC1 (44.8%) Timepoint θ Timepoint θ

f Gene PCA (nAmplitude=50) Cell PCA (nAmplitude=50) Cell PC1 (nAmplitude=50) Cell PC2 (nAmplitude=50) 10 10 10 0.2

5 5 5 0.1

0 0.0 0 0 PC2 PC1 PC2

-0.1 PC2 (36.3%) -5 -5 -5

-0.2 -10 -10 -10

-0.2 -0.1 0.0 0.1 0.2 -10 -5 0 5 10 0 2 4 6 0 2 4 6 PC1 PC1 (41.9%) Timepoint θ Timepoint θ

Supplementary Figure S2. Simulations using negative binomial distribution with different number of distinct amplitudes. We used different numbers of distinct amplitudes across 100 genes, and ﬁxed the number of distinct peak locations (across 100 genes) as 100 and library size as 2000. The number of distinct amplitudes across 100 genes is (a) 1, (b) 2, (c) 3, (d) 5, (e) 10, and (f) 50. No matter what the number of distinct amplitude(s) is, we always get an ellipsoid.

Zheng et al. | 2021 | bioRχiv | Page S26 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a Gene PCA (library.size=2000) Cell PCA (library.size=2000) Cell PC1 (library.size=2000) Cell PC2 (library.size=2000) 8

0.1 5 5 4

0.0 0 0 0 PC2 PC1 PC2

-0.1 PC2 (35.8%) -4 -5 -5 -0.2 -8 -0.1 0.0 0.1 -5 0 5 0 2 4 6 0 2 4 6 PC1 PC1 (38.2%) Timepoint θ Timepoint θ

b Gene PCA (library.size=500) Cell PCA (library.size=500) Cell PC1 (library.size=500) Cell PC2 (library.size=500)

5 0.1 5 4

0.0 0 0 0 PC2 PC1 PC2

-0.1 PC2 (22.9%) -5 -4 -5 -0.2

-0.1 0.0 0.1 -5 0 5 0 2 4 6 0 2 4 6 PC1 PC1 (23.7%) Timepoint θ Timepoint θ

c Gene PCA (library.size=100) Cell PCA (library.size=100) Cell PC1 (library.size=100) Cell PC2 (library.size=100) 5.0 5.0

0.1 5 2.5 2.5

0.0 0 0.0 0.0 PC2 PC1 PC2 PC2 (6.9%) -0.1 -2.5 -2.5 -5

-0.2 -5.0 -0.1 0.0 0.1 -5 0 5 0 2 4 6 0 2 4 6 PC1 PC1 (10.1%) Timepoint θ Timepoint θ

d Gene PCA (library.size=10) Cell PCA (library.size=10) Cell PC1 (library.size=10) Cell PC2 (library.size=10)

1.0 1.0 0.1 5 0.5 0.5 0.0 0.0 0 PC2 PC1 0.0 PC2

PC2 (1.7%) -0.5 -0.1 -0.5

-5 -1.0 -0.2 -1.0

-0.1 0.0 0.1 -5 0 5 0 2 4 6 0 2 4 6 PC1 PC1 (1.9%) Timepoint θ Timepoint θ

Supplementary Figure S3. Simulations using negative binomial distribution with different library size. We changed the library size l, and fixed the number of distinct peak locations (across 100 genes) as 100 and the amplitudes (across 100 genes) as 3. The library size is (a) 2000, (b) 500, (c) 100, and (d) 10. The range of x-axis and y-axis of the first two sub-panels are fixed across (a)-(d). With library size decreasing, the ellipsoid shrinks to the (0, 0). However, the orders of cell can still be recovered.

Zheng et al. | 2021 | bioRχiv | Page S27 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a mNeurosphere (n=12805) b mNeurosphere (n=12805) c mNeurosphere (n=12805)

2.5 2.5 Cell type 2.5 log2(TotalUMIs) aNSCs Sample Astrocytes 16 WT1 NPCs 15 WT2 Oligodendrocytes 14

UMAP-2 0.0 UMAP-2 0.0 UMAP-2 0.0 qNSCs 13 Other

-2.5 -2.5 -2.5

-3 0 3 6 -3 0 3 6 -3 0 3 6 UMAP-1 UMAP-1 UMAP-1

d mNeurosphere (n=12805) e mNeurosphere (n=12805) f mNeurosphere (n=12805)

2.5 2.5 2.5 CC Stage Cyclone SeuratCC G1.S S G1 G1 G2 S S G2.M G2M G2M UMAP-2 0.0 UMAP-2 0.0 UMAP-2 0.0 M.G1 NA

-2.5 -2.5 -2.5

-3 0 3 6 -3 0 3 6 -3 0 3 6 UMAP-1 UMAP-1 UMAP-1

Supplementary Figure S4. UMAPs of the mouse cortical Neurosphere dataset. Scatter plots show the UMAPs of Seurat3 mergerd Neurosphere data colored by (a) sample, (b) cell type inferred by SingleR,

(c) log2(TotalUMIs), (d) inferred cell cycle stage by cyclone, (e) inferred cell cycle stage by Seurat, (f) inferred cell cycle stage by the modiﬁed Schwabe method Schwabe et al. (2020) (See Methods). The UMAP coordinates were computed using the PCA on top 2000 highly variable genes after integreation by Seurat3.

Zheng et al. | 2021 | bioRχiv | Page S28 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a mNeurosphere (n=500) b mHippNPC (n=500)

0.2 Tipin Mcm3 Hells Mcm5 Mcm6 Ccnd2 Mcm6 Rpa2 Tipin Mcm4 Pcna Chaf1a Ccne1 Hells Uhrf1 Camk2b Pclaf Lig1 Mcm2 Chaf1b Rif1 Gmnn Topbp1 Dtl Camk1 0.1 Mcm7 Dtl 0.1 Smc2 Nasp E2f1 Ppp3ca

Clspn Ccne2 Nasp Smc4 Fen1 Mcm7 Mdc1

Top2a Incenp Cdk1 Hjurp 0.0 Bub3 Plk4 Mki67 Lmnb1 0.0 Smc2 Ezh2 Spc25 Kif18a Nusap1 Cdkn2c Birc5 Tpx2 Smc4 Dbf4 Prc1 Ccnb2 Top2a Cit Pbk Cdkn3 Ncapd2 Cenpf Racgap1 Lmnb1 -0.1 Knl1 Anln Cenpe Tacc3 Cdca8 Cks2 Knstrn -0.1 Nusap1 Rad21 mHippNPC PC2 weights Kif11 Ube2s Ccnb1 Nde1 mNeurosphere PC2 weights Ube2c Cdc20 -0.2 Cenpa Ckap2 Tacc3 -0.20 -0.15 -0.10 -0.05 0.00 -0.2 -0.1 0.0 mNeurosphere PC1 weights mHippNPC PC1 weights

c PC1 weights of overlapped genes(n=321) d PC2 weights of overlapped genes(n=321)

Mcm6 Topbp1 0.0 Tipin 0.1 Nasp Topbp1 Mdc1 Cdkn2c Bub3 Mcm7 Kif18a Plk4 Ezh2 Bub3 Rad21 Mcm7 Plk4 Mdc1 0.0 Top2a -0.1 Hjurp Tipin Nasp Dbf4 Kif18a Mcm6 Nde1 Cdkn2c Smc2 Dbf4 Ezh2 Cit Cit Tacc3 Rad21 Ncapd2 Smc4 Ckap2 -0.1 Nusap1 Lmnb1 Hjurp Ncapd2 Knl1 Lmnb1 Knl1 Nusap1 Smc4 -0.2 Kif11 Smc2 Kif11 Nde1 mHippNPC PC1 weights PCC = 0.93 mHippNPC PC2 weights Tacc3 Top2a -0.2 Ckap2 PCC = 0.69 -0.15 -0.10 -0.05 0.00 -0.1 0.0 0.1 0.2 mNeurosphere PC1 weights mNeurosphere PC2 weights

Supplementary Figure S5. Weights of PCA on GO cell cycle genes. (a) The weights of top 2 PCs learned from doing PCA on GO cell cycle genes of cortical Neurosphere data. (b) The weights of top 2 PCs learned from doing PCA on GO cell cycle genes of mouse primary hippocampal NPC data. (c) A comparison of the weights on principal component 1 between the cortical neurosphere and hippocampal progenitor datasets. (d) As (c), but for PC2. Genes with high weights (|score| > 0.1 for either vector) are highlighted in red. PCC: Pearson’s Correlation Coefﬁcient.

Zheng et al. | 2021 | bioRχiv | Page S29 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Supplementary Figure S6. Expression dynamics of top ranked genes. Similar to Figure 2d and e, but now showing all overlapped projection genes with absolute weights greater than 0.1 in either PC1 or PC2 of either dataset. Yellow points are cells of mHippNPC data, while blue points are cells of mNeurosphere data. Two loess lines were ﬁtted for two dataset respectively. There is high agreement of the dynamics between datasets.

Zheng et al. | 2021 | bioRχiv | Page S30 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Supplementary Figure S7. Characteristics of expression patterns of the mNeurosphere reference. (a) Heatmap shows the z-scores of 500 projections genes in the mNeurosphere data. Each row represents a gene and each column represents a cell, ordered by the cell cycle position θ from PCA. We also annotate the position of half π as the cells are not uniformed distributed along 0 to 2π. (b) The ﬁtted loess line of z-scores over cell cycle position θ for all 500 projection genes. (c-e) The three different clusters in (b). (c) The cluster of genes with highest z-scores less than 0.5. (d) The cluster of genes with highest z-scores greater than 0.5 and peak position before π This cluster corresponds to high expression genes at G1/S stage. (e) The cluster of genes with highest z-scores greater than 0.5 and peak position after π. This cluster corresponds to high expression genes at G2/M stage.

Zheng et al. | 2021 | bioRχiv | Page S31 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a PCA of Ductal cells (n=895) b PCA of Ngn3LEp cells (n=257) c PCA of Ngn3HEp cells(n=620) 6

3 10 3

0 0 5 PC2 (3%) PC2 (3%) PC2 (2%)

-3 -3 0

-6 -6 -5 0 5 -10 -5 0 -5.0 -2.5 0.0 2.5 PC1 (8%) PC1 (10%) PC1 (3%)

d Ngn3LEp projected by Ductal reference e Ngn3HEp projection by Ductal reference

3 3 CC Stage G1.S S 0 0

G2 Space Dim 2 Space Dim 2 G2.M

M.G1 ductal ductal -3 -3 CC CC

-5 0 -12 -8 -4 0 CCductal Space Dim 1 CCductal Space Dim 1

Supplementary Figure S8. PCA and projections of the mouse developing pancreas data. (a-c) The top 2 PCs of GO cell cycle genes of the the three most multipotent cell types in the mouse developing pancreas data. PCA was performed independently for each cell type. (d) Projection of allNgn3LEP cells of mPancreas data using the learned top 2 PCs weights on GO cell cycle genes of Ductal cells. (e) Projection of allNgn3HEP cells of mPancreas data using the learned top 2 PCs weights on GO cell cycle genes of Ductal cells.

Zheng et al. | 2021 | bioRχiv | Page S32 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a mHSC (n=1434) HeLa1 (n=1398) 5.0 20

10 2.5 θ CCStage G1.S 0 θ 0.0 S G2

Space Dim 2 -10 Space Dim 2 G2.M

ns ns M.G1 -2.5 NA CC -20 CC

-30 -5.0 -40 -20 0 20 -8 -4 0 4 CCns Space Dim 1 CCns Space Dim 1 b mHSC Top2A (n=1434) HeLa1 TOP2A (n=1398)

9 4 CC Stage G1.S 6 S G2 2 G2.M M.G1 3 NA (expression of Top2A) (expression of TOP2A) 2 2 log log 0 0 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π CCns Position θ CCns Position θ c mHSC UMAP (n=1434) HeLa1 UMAP (n=1398) 3

2 2 0.5π

0 0 π 0/2π UMAP-2 UMAP-2 -1 -2 -2 1.5π

-5.0 -2.5 0.0 2.5 -5.0 -2.5 0.0 2.5 UMAP-1 UMAP-1

Supplementary Figure S9. A pre-learned rotation matrix learned from proliferating cortical neurospheres enables cell cycle position estimation in other proliferating datasets. This ﬁgure include two other datasets in addition to the four datasets in Figure 4. (a) Different datasets (mouse hematopoietic stem cell and Hela set 1) projected into the cell cycle embedding deﬁned by the cortical neurosphere dataset. Cell cycle position θ is estimated using polar angle. (b) Inferred expression dynamics of Top2A(or TOP2A for human), with a periodic loess line (Methods). (c) UMAP colored by cell cycle position using a circular color scale.

Zheng et al. | 2021 | bioRχiv | Page S33 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a mHippNPC Smc4 (n=9188) b mPancreas Smc4 (n=3559) c mRetina Smc4 (n=99260)

CC Stage 4 G1.S 4 S 3 G2 G2.M 3 3 M.G1 NA 2 2 2

(expression of Smc4) 1 (expression of Smc4) (expression of Smc4) 1 2 2 2 log log log 0 0 0 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π CCns Position θ CCns Position θ CCns Position θ

d HeLa2 SMC4 (n=2463) e mHSC Smc4 (n=1434) f HeLa1 SMC4 (n=1398)

9 4 4

2 2 3 (expression of Smc4) (expression of SMC4) (expression of SMC4) 2 2 2 log log log 0 0 0 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π CCns Position θ CCns Position θ CCns Position θ

Supplementary Figure S10. The dynamics of Smc4 expression over cell cycle position θ. Inferred expression dynamics of Smc4(or SMC4 for human) over cell cycle position inferred using cortical neurospheres reference, with a periodic loess line (Methods) for (a) hippocampal NPCs, (b) mouse pancreas, (c) mouse retina, (d) Hela set 2, (e) mouse hematopoietic stem cell, and (f) Hela set 1 data. These data are the same data used in Figure 4 and Supplementary Figure S9.

Zheng et al. | 2021 | bioRχiv | Page S34 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a mHippNPC PCA (n=9188) b mPancreas PCA (n=3559) c mHSC PCA (n=1434)

4 5

0 0

PC2 (2%) PC2 (3%) PC2 (2%) CC Stage -5 -10 G1.S -4 S G2 G2.M -10 M.G1

-15 -10 -5 0 5 10 -8 -4 0 4 -20 -10 0 10 PC1 (7%) PC1 (8%) PC1 (7%)

d HeLa1 PCA (n=1398) e HeLa2 PCA (n=2463) f mRetina PCA (n=99260) 5 10 10 5 0

0 0

PC2 (3%) -5 PC2 (2%) -5 PC2 (3%)

-10 -10

-15 -10 -10 0 10 20 -10 -5 0 5 -20 -10 0 10 PC1 (7%) PC1 (5%) PC1 (9%)

Supplementary Figure S11. The top 2 PCs of GO cell cycle genes. The ﬁgure consists top 2 PCs of PCA performed on GO cell cycle genes of each dataset. They serve as companion ﬁgures to Figure 4 and Supplementary Figure S9. Note that the cell cycle progression is hidden by direct PCA on datasets with higher heterogeneity, such as mPancreas and mRetina dataset, while cell cycle progression is visible in other datasets.

Zheng et al. | 2021 | bioRχiv | Page S35 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a hU2OS log2(GMNN) b hU2OS log2(CDT1)

CC Stage 5 G1.S S 2 G2 (expression of CDT1)

(expression of GMNN) 2 2 2 R = 0.00844 G2.M R 2 = 0.0592 M.G1 log log 0 NA 0 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 FUCCI pseudotime FUCCI pseudotime

c hU2OS log2(GMNN) d hU2OS log2(CDT1)

5 2 (expression of CDT1)

(expression of GMNN) 2 2 2 R = 0.0475 R 2 = 0.182 log log 0 0 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π CCns Position θ CCns Position θ

Supplementary Figure S12. Expression dynamics of GMNN and CDT1 on FUCCI pseudotime and tricycle position of hU2OS data. (a) The gene expression of GMNN in hU2OS dataset is stable over the FUCCI pseudotime. (b) The gene expression of CDT1 in hU2OS dataset is stable over the FUCCI pseudotime. (c-d) Similar to (a-b), but we use the cell cycle position θ using mNeurosphere reference as the predictor.

Zheng et al. | 2021 | bioRχiv | Page S36 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a hiPSCs CDK1 (n=888) hiPSCs CDK1 (n=888) b hiPSCs UBE2C (n=888) hiPSCs UBE2C (n=888) 7 7 R 2 = 0.176 R 2 = 0.342 R 2 = 0.45 R 2 = 0.307 6 6

4 4 5 5

CC Stage 4 4 G1.S 2 2 S 3 3

(expression of CDK1) G2 (expression of CDK1) (expression of UBE2C) (expression of UBE2C) 2 2

G2.M 2 2 2 2 log M.G1 log log log 0 NA 0 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π CCns Position θ FUCCI pseudotime CCns Position θ FUCCI pseudotime

c hiPSCs SMC2 (n=888) hiPSCs SMC2 (n=888) d hiPSCs SMC4 (n=888) hiPSCs SMC4 (n=888)

2 2 2 2 R = 0.0482 R = 0.00649 5 R = 0.0956 5 R = 0.0161 4 4

4 4 3 3

2 2 3 3

(expression of SMC2) 1 (expression of SMC2) 1 (expression of SMC4) (expression of SMC4) 2 2 2 2 2 2 log log log log 0 0 1 1 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π CCns Position θ FUCCI pseudotime CCns Position θ FUCCI pseudotime

e hiPSCs UBE2S (n=888) hiPSCs UBE2S (n=888) f hiPSCs CKAP2 (n=888) hiPSCs CKAP2 (n=888) 4 4 R 2 = 0.0531 R 2 = 0.00833 R 2 = 0.212 R 2 = 0.0484 4 4 3 3 3 3

2 2 2 2

1 1

(expression of UBE2S) (expression of UBE2S) (expression of CKAP2) 1 (expression of CKAP2) 1 2 2 2 2 log log log log 0 0 0 0 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π CCns Position θ FUCCI pseudotime CCns Position θ FUCCI pseudotime

g hiPSCs MCM6 (n=888) hiPSCs MCM6 (n=888) h hiPSCs HELLS (n=888) hiPSCs HELLS (n=888)

R 2 = 0.185 R 2 = 0.06 R 2 = 0.0295 R 2 = 0.0138 4 4 6 6

3 3 5 5

2 2 4 4

(expression of MCM6) 1 (expression of MCM6) 1 (expression of HELLS) (expression of HELLS) 2 2 2 2 3 3 log log log log 0 0 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π CCns Position θ FUCCI pseudotime CCns Position θ FUCCI pseudotime

Supplementary Figure S13. Expression dynamics of selected cell cycle genes of hiPSCs dataset. Similar to Figure 5e,f, but now we show more cell cycle related genes. In each panel, the left sub-panel shows the expression of the gene over tricycle cell cycle position θ using mNeurosphere reference, and the right sub-panel over the FUCCI pseudotime inferred by Hsiao et al. (2020). Cells are colored by 5 stage cell cycle representation, inferred using the modiﬁed Schwabe method Schwabe et al. (2020). Periodic loess lines and R2 are added for each sub-panel (Methods).

Zheng et al. | 2021 | bioRχiv | Page S37 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a mESC (n=279) b mESC Top2A (n=279) FACS G1 10 S 12.5 G2M θ 10.0 0 Space Dim 2 ns 7.5 (expression of Top2A) CC

-10 2 log 5.0 0 20 40 0π 0.5π 1π 1.5π 2π CCns Space Dim 1 CCns Position θ

c hESC (n=226) d hESC TOP2A (n=226)

10 12.5

5 10.0 θ

0 7.5 Space Dim 2 ns -5 5.0 CC (expression of TOP2A) 2

-10 log 2.5

-10 0 10 0π 0.5π 1π 1.5π 2π CCns Space Dim 1 CCns Position θ

Supplementary Figure S14. Evaluation of tricycle on FACS datasets (a-b) Data from Buettner et al. (2015). (a) The data is projected to the cell cycle embedding deﬁned by the cortical neurosphere dataset. Cells are colored by FACS labels. (b) Expression dynamics of Top2A with a periodic loess line using tricycle cell cycle position estimated by projection in (a). (c-d) Similar to (a,b), but for data from Leng et al. (2015).

Zheng et al. | 2021 | bioRχiv | Page S38 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

mNeurosphere peco (Cdk1) mNeurosphere peco (Ube2C) mNeurosphere peco (Top2A) mHippNPC peco (Top2A) mPancreas peco (Cdk1) 4 CC Stage 4 4 4 G1.S S G2 2 2 2 G2.M 2 2 M.G1 NA 0 0 0 0 0

-2 -2 -2 -2 -2 peco θ R 2 = 0.316 peco θ R 2 = 0.351 peco θ R 2 = 0.13 peco θ R 2 = 0.125 peco θ R 2 = 0.0833

peco normalized expression 2 peco normalized expression 2 peco normalized expression 2 peco normalized expression 2 peco normalized expression 2 CCns Position θ R = 0.463 CCns Position θ R = 0.562 CCns Position θ R = 0.484 CCns Position θ R = 0.509 CCns Position θ R = 0.298 -4 -4 -4 -4 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π peco θ peco θ peco θ peco θ peco θ

mPancreas peco (Top2A) mHSC peco (Cdk1) mHSC peco (Ube2C) mHSC peco (Top2A) HeLa1 peco (CDK1)

2 2 2 2 2

0 0 0 0 0

-2 -2 -2 -2 -2 peco θ R 2 = 0.0545 peco θ R 2 = 0.121 peco θ R 2 = 0.0487 peco θ R 2 = 0.00724 peco θ R 2 = 0.28

peco normalized expression 2 peco normalized expression 2 peco normalized expression 2 peco normalized expression 2 peco normalized expression 2 CCns Position θ R = 0.223 CCns Position θ R = 0.245 CCns Position θ R = 0.12 CCns Position θ R = 0.391 CCns Position θ R = 0.335

0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π peco θ peco θ peco θ peco θ peco θ

HeLa1 peco (TOP2A) HeLa2 peco (CDK1) HeLa2 peco (UBE2C) HeLa2 peco (TOP2A) hfIntestineSub peco (CDK1)

2 2 2 2 2

0 0 0 0 0

-2 -2 -2 -2 -2 peco θ R 2 = 0.312 peco θ R 2 = 0.237 peco θ R 2 = 0.305 peco θ R 2 = 0.217 peco θ R 2 = -0.151

peco normalized expression 2 peco normalized expression 2 peco normalized expression 2 peco normalized expression 2 peco normalized expression 2 CCns Position θ R = 0.482 CCns Position θ R = 0.337 CCns Position θ R = 0.441 CCns Position θ R = 0.372 CCns Position θ R = 0.00558

0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π peco θ peco θ peco θ peco θ peco θ

hfIntestineSub peco (TOP2A) hU2OS peco (CDK1) hU2OS peco (UBE2C) hU2OS peco (TOP2A) hU2OS peco (H4C3)

2 2 2 2 2

0 0 0 0 0

-2 -2 -2 -2 -2 peco θ R 2 = -0.0343 peco θ R 2 = 0.317 peco θ R 2 = 0.381 peco θ R 2 = 0.342 peco θ R 2 = 0.0961

peco normalized expression 2 peco normalized expression 2 peco normalized expression 2 peco normalized expression 2 peco normalized expression 2 CCns Position θ R = 0.0683 CCns Position θ R = 0.476 CCns Position θ R = 0.62 CCns Position θ R = 0.636 CCns Position θ R = 0.386

0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π peco θ peco θ peco θ peco θ peco θ

hiPSCs peco (CDK1) hiPSCs peco (UBE2C) hiPSCs peco (TOP2A) mESC peco (Cdk1) mESC peco (Ube2C) 3 FACS 3 G1 2 S 2 2 2 2 G2M 1 1

0 0 0 0 0

-1 -1

-2 -2 -2 peco θ R 2 = 0.0159 peco θ R 2 = 0.378 peco θ R 2 = 0.317 -2 peco θ R 2 = 0.227 -2 peco θ R 2 = 0.398

peco normalized expression 2 peco normalized expression 2 peco normalized expression 2 peco normalized expression 2 peco normalized expression 2 CCns Position θ R = 0.182 CCns Position θ R = 0.423 CCns Position θ R = 0.4 CCns Position θ R = 0.268 CCns Position θ R = 0.499 -3 -3 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π peco θ peco θ peco θ peco θ peco θ

mESC peco (Top2A) hESC peco (CDK1) hESC peco (UBE2C) hESC peco (TOP2A) 3 3 3 3

2 2 2 2

1 1 1 1

0 0 0 0

-1 -1 -1 -1

-2 peco θ R 2 = 0.179 -2 peco θ R 2 = 0.0234 -2 peco θ R 2 = 0.135 -2 peco θ R 2 = 0.157

peco normalized expression 2 peco normalized expression 2 peco normalized expression 2 peco normalized expression 2 CCns Position θ R = 0.386 CCns Position θ R = 0.254 CCns Position θ R = 0.315 CCns Position θ R = 0.283 -3 -3 -3 -3 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π peco θ peco θ peco θ peco θ

Supplementary Figure S15. Expression dynamics of cell cycle genes on peco cell cycle position. We run peco on all dataset described in Supplementary Table S1, except mRetina and human fetal tissues. mRetina data has too many cells, and for human fetal tissues, we only use a subset of random 2000 cells from intestine data (hfIntestineSub). For each data, the expression dynamics of Cdk1, Top2A, Ube2C and H4c3, as long as they exist in the target dataset, over peco inferred θ are plotted out. Note that in this ﬁgure, the y-axis represents the peco normalized expression values, as peco has its own normalization requirement. We annotate the each panel with R2 of loess line calculated on peco inferred θ and R2 of loess line on tricycle inferred θ using mNeurosphere reference (although we have not plotted out the expression dynamics over tricycle inferred θ). Across all datasets and genes, the tricycle inferred θs have greater R2 to peco θ, and are highlighted as red.

Zheng et al. | 2021 | bioRχiv | Page S39 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a mNeurosphere Revelio (n=11131) b mHippNPC Revelio (n=7786) c mPancreas Revelio (n=2895) d mRetina Revelio (n=32993) 15 CC Stage 10 G1.S S 10 G2 G2.M 10 M.G1 0 5 0

0 0 Revelio DC-2 Revelio DC-2 Revelio DC-2 -10 Revelio DC-2 -5 -10

-10 -10 0 10 20 -10 0 10 -10 -5 0 5 10 15 -5 0 5 10 15 Revelio DC-1 Revelio DC-1 Revelio DC-1 Revelio DC-1

e mHSC Revelio (n=1171) f HeLa1 Revelio (n=1132) g HeLa2 Revelio (n=1718) h hfIntestineSub Revelio (n=130)

15 10 5 10 4 5

5 0 0 0 0 Revelio DC-2 Revelio DC-2 Revelio DC-2 Revelio DC-2 -5 -5 -4 -5

-10 -20 -10 0 10 -5 0 5 10 -10 -5 0 5 -4 -2 0 2 4 Revelio DC-1 Revelio DC-1 Revelio DC-1 Revelio DC-1

i hU2OS Revelio (n=1010) j hiPSCs Revelio (n=786) k mESC Revelio (n=228) l hESC Revelio (n=203)

40 30 10

20 5 10 20 0 10 0 -5 0 Revelio DC-2 0 Revelio DC-2 Revelio DC-2 Revelio DC-2

-10 -10 -10

-15 -10 -5 0 5 10 -10 -5 0 5 10 -40 -20 0 20 -20 -10 0 10 Revelio DC-1 Revelio DC-1 Revelio DC-1 Revelio DC-1

Supplementary Figure S16. Cell cycle embeddings by Revelio. The cell cycle embedding produced by Revelio for each data. Cells are colored by 5 stage cell cycle representation, inferred using the original Schwabe method Schwabe et al. (2020) as implemented in the Revelio package. Note that all cells without a valid stage assignment (assigned to ”NA”) are removed by the functions in Revelio package.

Zheng et al. | 2021 | bioRχiv | Page S40 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Supplementary Figure S17. Cell cycle stage and order estimations by reCAT. Panels show the cell cycle stage scores and cell orders estimated by reCAT for (a) mPancreas, (b) mHSC, (c) HeLa set 1, (d) HeLa set 2, (e) hfIntestineSub, (f) hU2OS, (g) hiPSCs, (h) mESC, and (i) hESC data. For each data, the ﬁrst sub-panel shows the Bayes scores for G1, S, and G2/M stage over the estimated cell orders(time series t). For each cell, there will be three scores (data points) colored by stage. The second sub-panel shows the mean scores for G1, G1/S, S, G2, G2/M, and M stage over the estimated cell orders. For each cell, there will be six scores (data points) colored by stage. The third sub-panel is the expression dynamic of Top2A(or TOP2A for human) over reCAT estimated cell orders. Each data point is a cell, colored by 5 stage cell cycle representation, inferred using the modiﬁed Schwabe method Schwabe et al. (2020) (except the last two FACS datastes, for which we color the cells by FACS stage.). Note that although reCAT package provide function to assign cell cycle stage, it requires manual input cutoff for Bayes scores. It is unrealistic for us to pick some appropriate cutoffs for most of the datasets presented here. For example, for mPancreas data in (a), we cannot decide which region has the consistent G1 scores. The last sub-panels compares tricycle cell cycle position using mNeurosphere reference and reCAT cell orders.

Zheng et al. | 2021 | bioRχiv | Page S41 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a mNeurosphere (n=12805) mNeurosphere Mean:0.32 mNeurosphere Top2A (n=12805) b mHippNPC (n=9188) mHippNPC Mean:0.33 mHippNPC Top2A (n=9188) cyclone 1.0 cyclone 1.0 G1 G1 5 S S G2M G2M 0.5 0.5 G2M 2.5

0 G2M 0.0 0.0 S Space Dim 2 Space Dim 2

ns ns 0.0 S

-5 Silhouette index -0.5 cyclone Silhouette index -0.5 CC CC G1 G1 S G1 -1.0 G2M -1.0 -2.5 -15 -10 -5 0 5 G1(n=10323) S(n=189) G2M(n=2293) 0 2 4 6 -6 -4 -2 0 2 G1(n=7382) S(n=383) G2M(n=1423) 0 2 4 6 CCns Space Dim 1 cyclone log2(expression of Top2A) CCns Space Dim 1 cyclone log2(expression of Top2A)

c mPancreas (n=3559) mPancreas Mean:0.54 mPancreas Top2A (n=3559) d mRetina (n=99260) mRetina Top2A (n=99260) 5.0 1.0

2.5 2.5 0.5 G2M

0.0 NA 0.0 0.0 S

Space Dim 2 Space Dim 2 G2M ns ns -2.5 Silhouette index -0.5 -2.5

CC CC S G1 G1 -5.0 -1.0 -5.0 -8 -4 0 G1(n=3311) S(n=36) G2M(n=212) 0 1 2 3 4 -7.5 -5.0 -2.5 0.0 2.5 0 1 2 3 4 5 CCns Space Dim 1 cyclone log2(expression of Top2A) CCns Space Dim 1 log2(expression of Top2A)

e mHSC (n=1434) mHSC Mean:0.38 mHSC Top2A (n=1434) f HeLa1 (n=1398) HeLa1 Mean:-0.26 HeLa1 TOP2A (n=1398) 1.0 1.0 20

2.5 10 0.5 0.5

0 G2M 0.0 0.0 0.0 G2M

Space Dim 2 -10 Space Dim 2

ns S ns -2.5 Silhouette index -0.5 Silhouette index -0.5 S CC -20 CC

G1 G1 -30 -1.0 -5.0 -1.0 -40 -20 0 G1(n=1255) S(n=72) G2M(n=107) 0 5 10 -8 -4 0 4 G1(n=1094) S(n=28) G2M(n=276) 0 2 4 6 CCns Space Dim 1 cyclone log2(expression of Top2A) CCns Space Dim 1 cyclone log2(expression of TOP2A)

g HeLa2 (n=2463) HeLa2 Mean:-0.011 HeLa2 TOP2A (n=2463) h hfIntestineSub (n=2000) hfIntestineSub Mean:-0.13 hfIntestineSub TOP2A (n=2000) 1.0 1.0 1.0 2.5 0.5 0.5 0.5

G2M 0.0 0.0 0.0 0.0 G2M

Space Dim 2 Space Dim 2 -0.5

ns S ns S Silhouette index -0.5 Silhouette index -0.5

CC -2.5 CC -1.0 G1 G1 -1.0 -1.5 -1.0 -5.0 -2.5 0.0 2.5 5.0 G1(n=1543) S(n=156) G2M(n=764) 0 2 4 -3 -2 -1 0 G1(n=117) S(n=386) G2M(n=1497) 0 1 2 3 CCns Space Dim 1 cyclone log2(expression of TOP2A) CCns Space Dim 1 cyclone log2(expression of TOP2A)

i hU2OS (n=1114) hU2OS Mean:0.052 hU2OS TOP2A (n=1114) j hiPSCs (n=888) hiPSCs Mean:-0.16 hiPSCs TOP2A (n=888) 1.0 1.0

10 2.5 0.5 0.5

0.0 0 0.0 0.0 G2M G2M

Space Dim 2 Space Dim 2 -2.5 ns ns

Silhouette index -0.5 S Silhouette index -0.5 S CC CC -10 -5.0 G1 G1 -1.0 -1.0 0 20 40 60 G1(n=788) S(n=6) G2M(n=320) 0 4 8 12 -6 -3 0 3 G1(n=641) S(n=150) G2M(n=97) 2 4 6 CCns Space Dim 1 cyclone log2(expression of TOP2A) CCns Space Dim 1 cyclone log2(expression of TOP2A)

k mESC (n=279) mESC Mean:0.21 mESC Top2A (n=279) l hESC (n=226) hESC Mean:0.17 hESC TOP2A (n=226) 10 1.0 1.0

5 0.5 0.5 0 0 0.0 0.0 G2M G2M Space Dim 2 Space Dim 2

ns ns -5

-10 Silhouette index -0.5 S Silhouette index -0.5 S CC CC

G1 -10 G1 -1.0 -1.0 0 20 40 G1(n=85) S(n=105) G2M(n=89) 6 9 12 15 -10 0 10 G1(n=83) S(n=75) G2M(n=68) 5 10 CCns Space Dim 1 cyclone log2(expression of Top2A) CCns Space Dim 1 cyclone log2(expression of TOP2A)

Supplementary Figure S18. Comparison between cyclone assigned stages and tricycle cell cycle position using mNeurosphere reference. Each panel describe one data, specifically for (a) mNeurosphere, (c) mHippNPC, (c) mPancreas, (d) mRetina, (e) mHSC, (f) HeLa set 1, (g) HeLa set 2, (h) hfIntestineSub, (i) hU2OS, (j) hiPSCs, (k) mESC, (l) hESC data. For each data, the first sub-panel shows the cell cycle embedding projection by mNeurosphere reference, and each point is a cell, colored by cyclone inferred cell cycle stage. The second sub-panel shows silhouette index computed using angular separation distance of tricycle cell cycle position θ estimated using mNeurosphere reference (Methods), stratified by cyclone inferred cell cycle stage. The mean silhouette index across all cells is given in the title. Boxes indicate 25th and 75th percentiles. Whiskers extend to the largest values no further than 1.5 × interquartile range (IQR) from these percentiles. For mRetina data, the pairwise distance matrix is too big to substantiate, so we could not compute silhouette index. The third sub-panel shows the marginal density of Top2A(or TOP2A for human) expression conditioned on cyclone cell cycle stage.

Zheng et al. | 2021 | bioRχiv | Page S42 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a mNeurosphere (n=12805) mNeurosphere Mean:0.73 mNeurosphere Top2A (n=12805) b mHippNPC (n=9188) mHippNPC Mean:0.66 mHippNPC Top2A (n=9188) SeuratCC 1.0 SeuratCC 1.0 G1 G1 5 S S G2M G2M 0.5 G2M 0.5 2.5 G2M

0 0.0 0.0 S S Space Dim 2 Space Dim 2

ns ns 0.0

-5 Silhouette index -0.5 SeuratCC Silhouette index -0.5 CC CC G1 G1 G1 S -1.0 G2M -1.0 -2.5 -15 -10 -5 0 5 G1(n=6371) S(n=3209) G2M(n=3225) 0 2 4 6 -6 -4 -2 0 2 G1(n=5047) S(n=1978) G2M(n=2163) 0 2 4 6 CCns Space Dim 1 SeuratCC log2(expression of Top2A) CCns Space Dim 1 SeuratCC log2(expression of Top2A)

c mPancreas (n=3559) mPancreas Mean:0.55 mPancreas Top2A (n=3559) d mRetina (n=99260) mRetina Top2A (n=99260) 5.0 1.0

2.5 2.5 0.5 G2M G2M 0.0 0.0 0.0 S

Space Dim 2 S Space Dim 2 ns ns -2.5 Silhouette index -0.5 -2.5 CC CC G1 G1

-5.0 -1.0 -5.0 -8 -4 0 G1(n=2305) S(n=917) G2M(n=337) 0 1 2 3 4 -7.5 -5.0 -2.5 0.0 2.5 0 1 2 3 4 5 CCns Space Dim 1 SeuratCC log2(expression of Top2A) CCns Space Dim 1 log2(expression of Top2A)

e mHSC (n=1434) mHSC Mean:0.74 mHSC Top2A (n=1434) f HeLa1 (n=1398) HeLa1 Mean:0.35 HeLa1 TOP2A (n=1398) 1.0 1.0 20

2.5 10 0.5 0.5 G2M G2M 0 0.0 0.0 0.0

Space Dim 2 -10 Space Dim 2 S S ns ns -2.5 Silhouette index -0.5 Silhouette index -0.5 CC -20 CC G1 G1 -30 -1.0 -5.0 -1.0 -40 -20 0 G1(n=942) S(n=382) G2M(n=110) 0 5 10 -8 -4 0 4 G1(n=294) S(n=595) G2M(n=509) 0 2 4 6 CCns Space Dim 1 SeuratCC log2(expression of Top2A) CCns Space Dim 1 SeuratCC log2(expression of TOP2A)

g HeLa2 (n=2463) HeLa2 Mean:0.34 HeLa2 TOP2A (n=2463) h hfIntestineSub (n=2000) hfIntestineSub Mean:0.18 hfIntestineSub TOP2A (n=2000) 1.0 1.0 1.0 2.5 0.5 0.5 0.5 G2M 0.0 0.0 0.0 0.0 G2M

Space Dim 2 S Space Dim 2 -0.5 ns ns S

Silhouette index -0.5 Silhouette index -0.5

CC -2.5 CC -1.0 G1 G1 -1.0 -1.5 -1.0 -5.0 -2.5 0.0 2.5 5.0 G1(n=460) S(n=1112) G2M(n=891) 0 2 4 -3 -2 -1 0 G1(n=996) S(n=526) G2M(n=478) 0 1 2 3 CCns Space Dim 1 SeuratCC log2(expression of TOP2A) CCns Space Dim 1 SeuratCC log2(expression of TOP2A)

i hU2OS (n=1114) hU2OS Mean:0.62 hU2OS TOP2A (n=1114) j hiPSCs (n=888) hiPSCs Mean:0.41 hiPSCs TOP2A (n=888) 1.0 1.0

10 2.5 0.5 0.5

0.0 0 0.0 0.0 G2M G2M

Space Dim 2 Space Dim 2 -2.5

ns ns S

Silhouette index -0.5 S Silhouette index -0.5 CC CC -10 -5.0 G1 G1 -1.0 -1.0 0 20 40 60 G1(n=193) S(n=442) G2M(n=479) 0.0 2.5 5.0 7.5 10.0 -6 -3 0 3 G1(n=178) S(n=360) G2M(n=350) 2 4 6 CCns Space Dim 1 SeuratCC log2(expression of TOP2A) CCns Space Dim 1 SeuratCC log2(expression of TOP2A)

k mESC (n=279) mESC Mean:0.16 mESC Top2A (n=279) l hESC (n=226) hESC Mean:0.19 hESC TOP2A (n=226) 10 1.0 1.0

5 0.5 0.5

0 G2M 0 0.0 0.0 G2M

Space Dim 2 Space Dim 2 S ns ns -5

-10 Silhouette index -0.5 S Silhouette index -0.5 CC CC G1 G1 -10 -1.0 -1.0 0 20 40 G1(n=45) S(n=106) G2M(n=128) 6 9 12 15 -10 0 10 G1(n=34) S(n=64) G2M(n=128) 5 10 CCns Space Dim 1 SeuratCC log2(expression of Top2A) CCns Space Dim 1 SeuratCC log2(expression of TOP2A)

Supplementary Figure S19. Comparison between Seurat assigned stages and tricycle cell cycle position using mNeurosphere reference. Each panel describe one data, specifically for (a) mNeurosphere, (c) mHippNPC, (c) mPancreas, (d) mRetina, (e) mHSC, (f) HeLa set 1, (g) HeLa set 2, (h) hfIntestineSub, (i) hU2OS, (j) hiPSCs, (k) mESC, (l) hESC data. For each data, the first sub-panel shows the cell cycle embedding projection by mNeurosphere reference, and each point is a cell, colored by Seurat inferred cell cycle stage. The second sub-panel shows silhouette index computed using angular separation distance of tricycle cell cycle position θ estimated using mNeurosphere reference (Methods), stratified by Seurat inferred cell cycle stage. The mean silhouette index across all cells is given in the title. Boxes indicate 25th and 75th percentiles. Whiskers extend to the largest values no further than 1.5·IQR from these percentiles. For mRetina data, the pairwise distance matrix is too big to substantiate, so we could not compute silhouette index. The third sub-panel shows the marginal density of Top2A(or TOP2A for human) expression conditioned on Seurat cell cycle stage.

Zheng et al. | 2021 | bioRχiv | Page S43 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a mNeurosphere Pct0 (n=12805) mNeurosphere mNeurosphere (n=12805) mNeurosphere Top2A (n=12805) mNeurosphere Smc4 (n=12805) 0.5π 1.129 5 0.8 0.8 4 0.565 Schwabe 4 G1.S 0 3 S 0.6 0.6 3 G2 1π 0π G2.M 2 M.G1

Density 2 NA 0.4 0.4

(expression of Smc4) 1 (expression of Top2A) 1 2 2 Flag log log

Pct of 0 (projection genes) Pct of 0 (projection genes) FALSE 0.2 1.5π 0.2 0 0 TRUE 0π 0.5π 1π 1.5π 2π S S S CCns Position θ G1.S G2 G2.M M.G1 NA G1.S G2 G2.M M.G1 NA G1.S G2 G2.M M.G1 NA (n=2233)(n=783)(n=977)(n=1212)(n=1998)(n=5602) (n=2233)(n=783)(n=977)(n=1212)(n=1998)(n=5602) (n=2233)(n=783)(n=977)(n=1212)(n=1998)(n=5602) b mHippNPC Pct0 (n=9188) mHippNPC mHippNPC (n=9188) mHippNPC Top2A (n=9188) mHippNPC Smc4 (n=9188) 0.5π 0.972 5 4 0.75 0.486 0.75 Schwabe 4 G1.S 0 3 S 3 G2 0.50 1π 0π 0.50 G2.M 2 M.G1 Density 2 NA

0.25 0.25 (expression of Smc4) 1 (expression of Top2A)

1 2 2 Flag log log

Pct of 0 (projection genes) Pct of 0 (projection genes) FALSE 1.5π 0 0 TRUE 0π 0.5π 1π 1.5π 2π S S S CCns Position θ G1.S G2 G2.M M.G1 NA G1.S G2 G2.M M.G1 NA G1.S G2 G2.M M.G1 NA (n=1394)(n=770)(n=704)(n=672)(n=1355)(n=4293) (n=1394)(n=770)(n=704)(n=672)(n=1355)(n=4293) (n=1394)(n=770)(n=704)(n=672)(n=1355)(n=4293) c mPancreas Pct0 (n=3559) mPancreas mPancreas (n=3559) mPancreas Top2A (n=3559) mPancreas Smc4 (n=3559) 0.5π 0.9 0.982 0.9 3 0.491 3 Schwabe G1.S 0 0.8 0.8 S 2 2 G2 1π 0π G2.M 0.7 0.7 M.G1 Density NA 1 1 0.6 0.6 (expression of Smc4) (expression of Top2A) 2 2 Flag log log

Pct of 0 (projection genes) Pct of 0 (projection genes) FALSE 0.5 1.5π 0.5 0 0 TRUE 0π 0.5π 1π 1.5π 2π S S S CCns Position θ G1.S G2 G2.M M.G1 NA G1.S G2 G2.M M.G1 NA G1.S G2 G2.M M.G1 NA (n=364)(n=247)(n=177)(n=229)(n=627)(n=1915) (n=364)(n=247)(n=177)(n=229)(n=627)(n=1915) (n=364)(n=247)(n=177)(n=229)(n=627)(n=1915) d mRetina Pct0 (n=99260) mRetina mRetina (n=99260) mRetina Top2A (n=99260) mRetina Smc4 (n=99260) 1.0 0.5π 1.0 4 1.056 4 0.528 Schwabe G1.S 0 3 0.8 0.8 3 S G2 1π 0π 2 G2.M 2 M.G1 0.6 Density 0.6 NA 1

1 (expression of Smc4) (expression of Top2A) 2 2 Flag log log

Pct of 0 (projection genes) Pct of 0 (projection genes) FALSE 0.4 0.4 1.5π 0 0 TRUE 0π 0.5π 1π 1.5π 2π S S S CCns Position θ G1.S G2 G2.M M.G1 NA G1.S G2 G2.M M.G1 NA G1.S G2 G2.M M.G1 NA (n=16672)(n=9075)(n=6508)(n=7453)(n=10714)(n=48838) (n=16672)(n=9075)(n=6508)(n=7453)(n=10714)(n=48838) (n=16672)(n=9075)(n=6508)(n=7453)(n=10714)(n=48838) e HeLa1 Pct0 (n=1398) HeLa1 HeLa1 (n=1398) HeLa1 TOP2A (n=1398) HeLa1 SMC4 (n=1398) 0.5π 0.921 0.460 Schwabe 0.75 0.75 G1.S 0 4 4 S G2 0.50 1π 0π 0.50 G2.M M.G1 Density 2 2 NA

0.25 0.25 (expression of SMC4) (expression of TOP2A) 2

2 Flag log

Pct of 0 (projection genes) Pct of 0 (projection genes) FALSE log 1.5π 0 0 TRUE 0π 0.5π 1π 1.5π 2π S G2 S G2 S G2 CCns Position θ G1.S G2.M M.G1 NA G1.S G2.M M.G1 NA G1.S G2.M M.G1 NA (n=348)(n=163) (n=84) (n=84) (n=172)(n=547) (n=348)(n=163) (n=84) (n=84) (n=172)(n=547) (n=348)(n=163) (n=84) (n=84) (n=172)(n=547) f HeLa2 Pct0 (n=2463) HeLa2 HeLa2 (n=2463) HeLa2 TOP2A (n=2463) HeLa2 SMC4 (n=2463) 1.0 0.5π 1.0 5 1.011 Schwabe 0.505 4 4 G1.S 0.8 0 0.8 S 3 G2 0.6 1π 0π 0.6 G2.M 2 M.G1 Density 2 NA

0.4 0.4 1 (expression of SMC4) (expression of TOP2A) 2

2 Flag log

Pct of 0 (projection genes) Pct of 0 (projection genes) FALSE log 1.5π 0 0 TRUE 0π 0.5π 1π 1.5π 2π S G2 S G2 S G2 CCns Position θ G1.S G2.M M.G1 NA G1.S G2.M M.G1 NA G1.S G2.M M.G1 NA (n=610)(n=287)(n=143)(n=128)(n=317)(n=978) (n=610)(n=287)(n=143)(n=128)(n=317)(n=978) (n=610)(n=287)(n=143)(n=128)(n=317)(n=978)

Supplementary Figure S20. Comparison between modiﬁed 5 stage assignments and tricycle cell cycle position using mNeurosphere reference. See next page for caption.

Zheng et al. | 2021 | bioRχiv | Page S44 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Supplementary Figure S20. (Continued). Each row or panel contains analysis for a dataset, speciﬁcally (a) for mNeurosphere, (b) for mHippNPC, (c) for mPancreas, (d) for mRetina, (e) for HeLa set 1, (f) for HeLa set 2 data. For each data, the ﬁrst sub-panel shows the dynamics of percentage of non-expressed genes over all overlapped genes with mNeurosphere projection matrix (number of genes with 0 expression divided by the number overlapped genes with mNeurosphere projection matrix) w.r.t. tricycle cell cycle position θ using mNeurosphere reference. Cells are colored by 5 stage assignment. The second panel shows the marginal density of tricycle cell cycle position θ conditioned on 5 stage assignments using von Mises kernel on polar coordinate system. The third sub-panel shows the percentage of non-expressed genes over all overlapped genes with mNeurosphere projection matrix conditioned on 5 stages assignment and whether cells appear in the G1/G0 cluster - θ < 0.25π or θ > 1.5π as boxplots. The forth and the last sub-panel show the expression of Top2A and Smc4 conditioned on 5 stages assignment and whether cells appear in the G1/G0 cluster. Boxes indicate 25th and 75th percentiles. Whiskers extend to the largest values no further than 1.5·IQR from these percentiles.

Zheng et al. | 2021 | bioRχiv | Page S45 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a mNeurosphere Pct0 (n=11131) mNeurosphere mNeurosphere (n=11131) mNeurosphere Top2A (n=11131) mNeurosphere Smc4 (n=11131) 0.5π 1.146 5 0.8 0.8 4 0.573 Flag 4 0 FALSE 3 TRUE 0.6 0.6 3 1π 0π 2 Schwabe

Density 2 G1.S 0.4 0.4 S

(expression of Smc4) 1 (expression of Top2A) 1 2 G2 2 G2.M log log Pct of 0 (projection genes) Pct of 0 (projection genes) 0.2 1.5π 0.2 0 0 M.G1 0π 0.5π 1π 1.5π 2π S S S CCns Position θ G1.S G2 G2.M M.G1 G1.S G2 G2.M M.G1 G1.S G2 G2.M M.G1 (n=3032)(n=1339)(n=1696)(n=2050)(n=3014) (n=3032)(n=1339)(n=1696)(n=2050)(n=3014) (n=3032)(n=1339)(n=1696)(n=2050)(n=3014) b mHippNPC Pct0 (n=7786) mHippNPC mHippNPC (n=7786) mHippNPC Top2A (n=7786) mHippNPC Smc4 (n=7786) 0.5π 1.004 5 0.75 0.75 4 0.502 Flag 4 0 FALSE 3 TRUE 0.50 0.50 3 1π 0π 2 Schwabe

Density 2 G1.S 0.25 0.25 S (expression of Smc4) 1 (expression of Top2A)

1 2 G2 2 G2.M log log Pct of 0 (projection genes) 1.5π Pct of 0 (projection genes) 0 0 M.G1 0π 0.5π 1π 1.5π 2π S S S CCns Position θ G1.S G2 G2.M M.G1 G1.S G2 G2.M M.G1 G1.S G2 G2.M M.G1 (n=2048)(n=1201)(n=1236)(n=1188)(n=2113) (n=2048)(n=1201)(n=1236)(n=1188)(n=2113) (n=2048)(n=1201)(n=1236)(n=1188)(n=2113) c mPancreas Pct0 (n=2895) mPancreas mPancreas (n=2895) mPancreas Top2A (n=2895) mPancreas Smc4 (n=2895) 0.5π 0.9 0.995 0.9 3 3 0.497 Flag 0.8 0 0.8 FALSE TRUE 2 2 1π 0π 0.7 0.7 Schwabe Density G1.S 1 1 0.6 0.6 S (expression of Smc4) (expression of Top2A)

2 G2 2 G2.M log log Pct of 0 (projection genes) 0.5 1.5π Pct of 0 (projection genes) 0.5 0 0 M.G1 0π 0.5π 1π 1.5π 2π S S S CCns Position θ G1.S G2 G2.M M.G1 G1.S G2 G2.M M.G1 G1.S G2 G2.M M.G1 (n=663) (n=443) (n=311) (n=414) (n=1064) (n=663) (n=443) (n=311) (n=414) (n=1064) (n=663) (n=443) (n=311) (n=414) (n=1064) d mRetina Pct0 (n=32993) mRetina mRetina (n=32993) mRetina Top2A (n=32993) mRetina Smc4 (n=32993) 0.5π 3 1.125

0.563 3 Flag 0.8 0 0.8 FALSE 2 TRUE 2 1π 0π Schwabe Density 0.6 0.6 1 G1.S 1 S (expression of Smc4) (expression of Top2A)

2 G2 2 G2.M log log Pct of 0 (projection genes) 0.4 1.5π Pct of 0 (projection genes) 0.4 0 0 M.G1 0π 0.5π 1π 1.5π 2π S S S CCns Position θ G1.S G2 G2.M M.G1 G1.S G2 G2.M M.G1 G1.S G2 G2.M M.G1 (n=10245)(n=5598)(n=4696)(n=5773)(n=6681) (n=10245)(n=5598)(n=4696)(n=5773)(n=6681) (n=10245)(n=5598)(n=4696)(n=5773)(n=6681) e HeLa1 Pct0 (n=1132) HeLa1 HeLa1 (n=1132) HeLa1 TOP2A (n=1132) HeLa1 SMC4 (n=1132) 0.5π 5 0.8 1.2 0.8 0.6 4 Flag 0 4 FALSE 0.6 0.6 3 TRUE 1π 0π 0.4 0.4 Schwabe

Density 2 2 G1.S S

(expression of SMC4) 1 (expression of TOP2A) 0.2 0.2 2 G2 2 G2.M log Pct of 0 (projection genes) Pct of 0 (projection genes) 1.5π log 0 0 M.G1 0π 0.5π 1π 1.5π 2π S G2 S G2 S G2 CCns Position θ G1.S G2.M M.G1 G1.S G2.M M.G1 G1.S G2.M M.G1 (n=415) (n=211) (n=121) (n=142) (n=243) (n=415) (n=211) (n=121) (n=142) (n=243) (n=415) (n=211) (n=121) (n=142) (n=243) f HeLa2 Pct0 (n=1718) HeLa2 HeLa2 (n=1718) HeLa2 TOP2A (n=1718) HeLa2 SMC4 (n=1718) 0.9 0.5π 0.9 1.296 4 4 0.8 0.648 0.8 Flag 0 3 3 FALSE 0.7 0.7 TRUE

1π 0π 2 0.6 0.6 2 Schwabe Density G1.S 0.5 0.5 1 1 S (expression of SMC4) (expression of TOP2A)

2 G2 2 0.4 0.4 G2.M log Pct of 0 (projection genes) Pct of 0 (projection genes) 1.5π log 0 0 M.G1 0π 0.5π 1π 1.5π 2π S G2 S G2 S G2 CCns Position θ G1.S G2.M M.G1 G1.S G2.M M.G1 G1.S G2.M M.G1 (n=645) (n=324) (n=219) (n=184) (n=346) (n=645) (n=324) (n=219) (n=184) (n=346) (n=645) (n=324) (n=219) (n=184) (n=346)

Supplementary Figure S21. Comparison between original 5 stage assignments and tricycle cell cycle position using mNeurosphere reference. This ﬁgure shows the exact same data and comparisons as in Supplementary Figure S20, but now we use the original Schwabe method as implemented in the Revelio package (Schwabe et al., 2020). Note that the number of cells in each dataset is decreased as any cell without a valid stage assignment (assigned to ”NA”) is removed by the functions in Revelio package.

Zheng et al. | 2021 | bioRχiv | Page S46 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Benchmarking running time

Method Cyclone Seurat 40 TriCycle

20 Time (mins)

5000 10000 50000 Number of cells

Supplementary Figure S22. Running time comparisons between cyclone, Seurat, and tricycle cell cycle inference We record the elapsing time for each method when running them on 10 random subsets of mRetina data with 5000, 10000, and 50000 cells. For cyclone and Seurat, the time is recorded for the cell cycle stage assignment function. For tricycle, the time is recorded for cell cycle position estimation using mNeurosphere reference. Note that we add jitters to the data points to avoid excessive overlaps.

Number of totalUMIs of human fetal tissues atlas

2e+05 1e+05 50000

10000 5000

1000 500 300 200 Number of totalUMIs (log10 scaled)

Kidney Eye Heart Liver Lung Muscle Spleen Intestine Pancreas Stomach Adrenal Cerebellum Cerebrum Placenta Thymus (n=51650)(n=155386)(n=45653) (n=12106)(n=387771)(n=1092000)(n=1751246)(n=51836)(n=101749)(n=113138)(n=217738)(n=30872) (n=29876) (n=13180) (n=8779)

Supplementary Figure S23. TotalUMIs of human fetal atlas. For each tissue type of the human fetal atlas data (Cao et al., 2020), we show the total UMIs of a cell. The dashed line separates 4 single-cell proﬁled tissues with 11 single-nuclei proﬁled tissues. Boxes indicate 25th and 75th percentiles. Whiskers extend to the largest values no further than 1.5 × interquartile range (IQR) from these percentiles.

Zheng et al. | 2021 | bioRχiv | Page S47 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a Human fetal tissue atalas (n=4062980)

Tissue Cerebrum (n=1751246) Cerebellum (n=1092000) Adrenal (n=387771) Lung (n=217738) 0 Kidney (n=155386) Liver (n=113138) Heart (n=101749)

UMAP-2 Eye (n=51836) Intestine (n=51650) Pancreas (n=45653) Muscle (n=30872) Placenta (n=29876) Spleen (n=13180) Stomach (n=12106) Thymus (n=8779)

-10

-10 0 10 20 UMAP-1

b Human fetal tissue atalas (n=4062980)

Cell Type Excitatory neurons (n=1258818) Inhibitory neurons (n=350759) Adrenocortical cells (n=328920) 10 Astrocytes (n=318105) Granule neurons (n=312675) Purkinje neurons (n=280377) Stromal cells (n=152957) Inhibitory interneurons (n=129890) Bronchiolar and alveolar epithelial cells (n=101249) Metanephric cells (n=90876) Vascular endothelial cells (n=75104) Cardiomyocytes (n=67610) Hepatoblasts (n=58447) Unipolar brush cells (n=52646) Erythroblasts (n=46707) Limbic system neurons (n=45989) Mesangial cells (n=44177) Oligodendrocytes (n=33748) Intestinal epithelial cells (n=24545) Chromaffin cells (n=21055) SLC24A4_PEX5L positive cells (n=19722) 0 Myeloid cells (n=18300) Lymphoid cells (n=16873) SKOR2_NPSR1 positive cells (n=15005) Retinal progenitors and Muller glia (n=14865) UMAP-2 Syncytiotrophoblasts and villous cytotrophoblasts (n=14567) Skeletal muscle cells (n=13516) Photoreceptor cells (n=12398) Acinar cells (n=11388) Ganglion cells (n=10143) Ureteric bud cells (n=9138) Microglia (n=8975) Thymocytes (n=7852) Endocardial cells (n=7398) Smooth muscle cells (n=7344) Amacrine cells (n=7341) Lymphatic endothelial cells (n=7305) Satellite cells (n=5255) Ductal cells (n=4829) -10 Islet endocrine cells (n=4768) STC2_TLX1 positive cells (n=4487) Sympathoblasts (n=3805) Goblet cells (n=3624) Megakaryocytes (n=3498) ENS glia (n=3219) Ciliated epithelial cells (n=2886) ENS neurons (n=2514) Parietal and chief cells (n=2322) Horizontal cells (n=2300) Squamous epithelial cells (n=2238)

-10 0 10 20 UMAP-1

Supplementary Figure S24. Human fetal tissue atlas UMAP embeddings with all tissues Human fetal tissue atlas UMAP embeddings with all tissues, colored by (a) tissue and (b) cell type.

Zheng et al. | 2021 | bioRχiv | Page S48 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a Intestine (n=51650) Intestine (n=51650) Intestine (n=51650) Percentage of actively Intestine (n=51650) Percentage of actively proliferating cells in intestine (n=51650) proliferating cells in intestine (n=51650) 2.0 2.0 2.0 1.00 1.00 2 Actively proliferating FALSE TRUE Actively proliferating FALSE TRUE Cell Type Chromaffin cells (n=599) Day 1.5 1.5 0.75 1.5 0.75 1 ENS glia (n=2087) 72 (n=12911) ENS neurons (n=2154) 74 (n=12481) 0.50 Intestinal epithelial cells (n=24545) 85 (n=14332) 0.50 0 1.0 1.0 Lymphoid cells (n=2144) 1.0 96 (n=5597) Space Dim 2 Myeloid cells (n=2830) Percentage 100 (n=1012) UMAP-2 UMAP-2 0.25 UMAP-2 Percentage ns -1 Smooth muscle cells (n=1149) 112 (n=1700) 0.25 0.5 0.5 Stromal cells (n=13970) 0.5 117 (n=3596) CC Vascular endothelial cells (n=1459) 0.00 125 (n=21) -2 NA 0.00 0.0 0.0 0.0 72 74 85 96 100 112 117 125 ENS glia -5 -4 -3 -2 -1 0 1 Lymphoid(n=599) cells Stromal cellsMyeloid cells ENS neuronsChromaffin cells (n=12911)(n=12481)(n=14332) (n=5597) (n=1012) (n=1700) (n=3596) (n=21) 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 (n=2087)(n=2154)(n=24545)(n=2144)(n=2830)Smooth(n=1149) muscle (n=13970)cells (n=1459) 0.0 0.5 1.0 1.5 Vascular endothelialIntestinal epithelial cells cells CCns Space Dim 1 UMAP-1 UMAP-1 UMAP-1 Development day

b Kidney (n=155386) Kidney (n=155386) Kidney (n=155386) Percentage of actively Kidney (n=155386) Percentage of actively 2.0 proliferating cells in kidney (n=155386) proliferating cells in kidney (n=155386)

2 1.00 Actively proliferating FALSE TRUE 1.00 Actively proliferating FALSE TRUE 1.6 Cell Type 1.6 Day Erythroblasts (n=250) 0.75 1.5 72 (n=19512) 0.75 Lymphoid cells (n=577) 85 (n=18229) 0 Megakaryocytes (n=67) 0.50 96 (n=6067) 1.2 Mesangial cells (n=44177) 1.2 0.50 100 (n=35992) Metanephric cells (n=90876) Space Dim 2 1.0 Percentage 110 (n=11346) UMAP-2 UMAP-2 0.25 UMAP-2 Myeloid cells (n=3413) Percentage ns 112 (n=49436) 0.25 Stromal cells (n=1026) -2 0.8 0.8 117 (n=9890) CC Ureteric bud cells (n=9138) 0.00 125 (n=4914) Vascular endothelial cells (n=5862) 0.00 0.5

0.4 0.4 72 85 96 100 110 112 117 125 -5.0 -2.5 0.0 Erythroblasts Myeloid cells Stromal cells 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 Metanephric(n=250) cells(n=577)Lymphoid(n=67)Megakaryocytes cells(n=44177)(n=90876)Ureteric(n=3413) bud cells(n=1026)Mesangial(n=9138) cells(n=5862) 0.0 0.5 1.0 1.5 (n=19512)(n=18229) (n=6067) (n=35992)(n=11346)(n=49436) (n=9890) (n=4914) Vascular endothelial cells CCns Space Dim 1 UMAP-1 UMAP-1 UMAP-1 Development day

c Pancreas (n=45653) Pancreas (n=45653) Pancreas (n=45653) Percentage of actively Pancreas (n=45653) Percentage of actively proliferating cells in pancreas (n=45653) proliferating cells in pancreas (n=45653) 2.0 2.0 2.0 1.00 Actively proliferating FALSE TRUE 1.00 Actively proliferating FALSE TRUE 2 Cell Type Acinar cells (n=11388) 0.75 0.75 1.5 1.5 Ductal cells (n=4829) 1.5 Erythroblasts (n=1261) 0.50 Day 0 Islet endocrine cells (n=4768) 0.50 1.0 1.0 Lymphoid cells (n=6330) 1.0 110 (n=26705) Space Dim 2 Myeloid cells (n=1519) Percentage 117 (n=18948) UMAP-2 UMAP-2 0.25 UMAP-2 Percentage ns Smooth muscle cells (n=2657) 0.25 0.5 Stromal cells (n=6306) CC -2 0.5 0.5 Vascular endothelial cells (n=4804) 0.00 NA 0.00 0.0 0.0 0.0 110 117 Ductal cells Acinar cells -5.0 -2.5 0.0 ErythroblastsLymphoid cells Myeloid cells Stromal cells (n=26705) (n=18948) 0.0 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 (n=11388)(n=4829)(n=1261)(n=4768)Smooth(n=6330) muscle cells(n=1519)Islet endocrine(n=2657) cells(n=6306)(n=4804) 0.5 1.0 1.5 2.0 Vascular endothelial cells CCns Space Dim 1 UMAP-1 UMAP-1 UMAP-1 Development day

d Stomach (n=12106) Stomach (n=12106) Stomach (n=12106) Percentage of actively Stomach (n=12106) Percentage of actively 2.5 proliferating cells in stomach (n=12106) proliferating cells in stomach (n=12106)

1.00 Actively proliferating FALSE TRUE 1.00 Actively proliferating FALSE TRUE 2.0 2.0 2.0 Cell Type 1 Ciliated epithelial cells (n=298) 0.75 Goblet cells (n=3624) 0.75 1.5 1.5 1.5 Lymphoid cells (n=436) 0.50 Day MUC13_DMBT1 positive cells (n=792) 96 (n=2256) 0.50 0 Neuroendocrine cells (n=300) 1.0 1.0 1.0 110 (n=6883) Percentage 0.25 Space Dim 2 Parietal and chief cells (n=2322) UMAP-2 UMAP-2 UMAP-2

117 (n=2967) Percentage ns Squamous epithelial cells (n=1956) 0.25 Stromal cells (n=1368) CC 0.5 0.5 0.00 0.5 -1 Vascular endothelial cells (n=310) NA 0.00 0.0 0.0 0.0 96 110 117 Goblet cells Lymphoid(n=298) cells (n=436) (n=792)Stromal(n=300) cells (n=310) -3 -2 -1 0 (n=3624) (n=2322)(n=1956)(n=1368)Neuroendocrine cells (n=2256) (n=6883) (n=2967) 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 Vascular Squamousendothelial epithelialcells cells Ciliated epithelialParietal andcells chief cells 0.0 0.5 1.0 1.5 2.0 MUC13_DMBT1 positive cells CCns Space Dim 1 UMAP-1 UMAP-1 UMAP-1 Development day

Supplementary Figure S25. Application of tricycle on 4 single-cell profiled human tissues We show one tissue type in each row/panel (a) intestine, (b) kidney, (c) pancreas, and (d) stomach. For each tissue, the cell cycle embedding using mNeurosphere reference is given in the first sub-panel, tissue-level UMAPs from Cao et al. (2020) colored by cell cycle position θ in the second sub-panel, tissue-level UMAPs from Cao et al. (2020) colored by cell type in the third sub-panel, percentage of actively proliferating cells for each cell type in decreasing order in the forth sub-panel, tissue-level UMAPs from Cao et al. (2020) colored by development days in the fifth sub-panel, and percentage of actively proliferating cells for each development day in the last sub-panel.

Zheng et al. | 2021 | bioRχiv | Page S49 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a Adrenal (n=387771) Adrenal (n=387771) Adrenal (n=387771) Percentage of actively Adrenal (n=387771) Percentage of actively proliferating cells in adrenal (n=387771) proliferating cells in adrenal (n=387771) 1.25 2 1.00 Actively proliferating FALSE TRUE 1.00 Actively proliferating FALSE TRUE Cell Type 1.00 1.00 1.00 1 Adrenocortical cells (n=328920) 0.75 Day Chromaffin cells (n=20456) 0.75 89 (n=59307) Erythroblasts (n=2750) 0 0.75 0.75 0.75 94 (n=58944) Megakaryocytes (n=931) 0.50 110 (n=87839) 0.50 Myeloid cells (n=2389) 113 (n=45218) Space Dim 2 Schwann cells (n=981) Percentage -1 UMAP-2 UMAP-2 0.25 UMAP-2 0.50 0.50 0.50 117 (n=63677) Percentage ns Stromal cells (n=10543) 0.25 120 (n=32258) Sympathoblasts (n=3805) CC -2 0.00 122 (n=40528) 0.25 Vascular endothelial cells (n=16202) 0.25 NA 0.25 0.00 -3 89 94 110 113 117 120 122 -5 -4 -3 -2 -1 0 1 ErythroblastsSchwann Stromalcells cells Myeloid cells Sympathoblasts 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 Megakaryocytes(n=328920)Chromaffin(n=20456) cells(n=2750) (n=931) (n=2389) (n=981)(n=10543)Adrenocortical(n=3805) (n=16202)cells 0.0 0.5 1.0 1.5 (n=59307) (n=58944) (n=87839) (n=45218) (n=63677) (n=32258) (n=40528) Vascular endothelial cells CCns Space Dim 1 UMAP-1 UMAP-1 UMAP-1 Development day

b Cerebellum (n=1092000) Cerebellum (n=1092000) Cerebellum (n=1092000) Percentage of actively Cerebellum (n=1092000) Percentage of actively proliferating cells in cerebellum (n=1092000) proliferating cells in cerebellum (n=1092000) 2.0 2.0 2.0 1.00 Actively proliferating FALSE TRUE 1.00 Actively proliferating FALSE TRUE 2 Cell Type 0.75 1.5 1.5 Astrocytes (n=268809) 1.5 0.75 Granule neurons (n=312675) Day 0 Inhibitory interneurons (n=129890) 0.50 89 (n=119954) Microglia (n=4428) 94 (n=133761) 0.50 1.0 1.0 1.0 Oligodendrocytes (n=16104) 110 (n=289878) Percentage 0.25 Space Dim 2 UMAP-2 UMAP-2 UMAP-2

Purkinje neurons (n=280377) 115 (n=188524) Percentage ns 0.25 -2 SLC24A4_PEX5L positive cells (n=19722) 125 (n=359883) 0.00

CC 0.5 0.5 Unipolar brush cells (n=52646) 0.5 Vascular endothelial cells (n=7349) 0.00

-4 0.0 89 94 110 115 125 0.0 Astrocytes Microglia 0.0 -8 -6 -4 -2 0 (n=268809)(n=312675)Oligodendrocytes(n=129890)(n=4428)(n=16104)Granule(n=280377) neuronsPurkinje(n=19722) neurons(n=52646)Unipolar(n=7349) brush cells 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 Vascular endothelial cells Inhibitory interneurons 0.0 0.5 1.0 1.5 (n=119954) (n=133761) (n=289878) (n=188524) (n=359883) SLC24A4_PEX5L positive cells CCns Space Dim 1 UMAP-1 UMAP-1 UMAP-1 Development day

c Cerebrum (n=1751246) Cerebrum (n=1751246) Cerebrum (n=1751246) Percentage of actively Cerebrum (n=1751246) Percentage of actively proliferating cells in cerebrum (n=1751246) proliferating cells in cerebrum (n=1751246) 2.5 2.5 1.00 Actively proliferating FALSE TRUE 1.00 Actively proliferating FALSE TRUE 2.0 2.0 2.0 Cell Type 0.75 Astrocytes (n=49115) Day 0.75 Excitatory neurons (n=1258818) 0.0 1.5 1.5 1.5 89 (n=448243) Inhibitory neurons (n=350759) 0.50 110 (n=936397) Limbic system neurons (n=45989) 0.50 115 (n=165032) 1.0 Megakaryocytes (n=36) 1.0 Percentage 0.25 1.0 Space Dim 2 119 (n=16900) UMAP-2 UMAP-2 UMAP-2

Microglia (n=4459) Percentage ns 120 (n=26703) 0.25 -2.5 Oligodendrocytes (n=17644) 122 (n=157971) CC 0.5 0.5 SKOR2_NPSR1 positive cells (n=15005) 0.00 0.5 Vascular endothelial cells (n=9421) 0.00 0.0 0.0 0.0 89 110 115 119 120 122 AstrocytesMicroglia Megakaryocytes (n=36) -8 -6 -4 -2 0 (n=49115)(n=1258818)(n=350759)(n=45989) Oligodendrocytes(n=4459)Excitatory(n=17644)Inhibitory neurons(n=15005) neurons(n=9421) (n=448243) (n=936397) (n=165032) (n=16900) (n=26703) (n=157971) 1 2 0.5 1.0 1.5 2.0 2.5 Vascular endothelial cells Limbic system neurons 0.5 1.0 1.5 2.0 2.5 SKOR2_NPSR1 positive cells CCns Space Dim 1 UMAP-1 UMAP-1 UMAP-1 Development day

d Eye (n=51836) Eye (n=51836) Eye (n=51836) Percentage of actively Eye (n=51836) Percentage of actively proliferating cells in eye (n=51836) proliferating cells in eye (n=51836) 2.0 2.0 2.0 1.00 1 Actively proliferating FALSE TRUE 1.00 Actively proliferating FALSE TRUE Cell Type Amacrine cells (n=7341) 0.75 1.5 1.5 1.5 0.75 Bipolar cells (n=1207) Day 0 Ganglion cells (n=10143) 0.50 94 (n=13603) Horizontal cells (n=2300) 96 (n=1890) 0.50 1.0 1.0 PDE11A_FAM19A2 positive cells (n=806) 1.0 Percentage 0.25 115 (n=16845)

Space Dim 2 Photoreceptor cells (n=12398) UMAP-2 UMAP-2 UMAP-2

117 (n=12851) Percentage ns -1 Retinal pigment cells (n=704) 0.25 0.00 129 (n=6647) Retinal progenitors and Muller glia (n=14865) CC 0.5 0.5 Stromal cells (n=1049) 0.5 NA 0.00 -2

0.0 Bipolar cells 94 96 115 117 129 AmacrineHorizontal cells Stromalcells(n=806) cells (n=704)Ganglion cells (n=7341)Photoreceptor(n=1207) cells(n=10143)(n=2300) Retinal(n=12398) pigment cells (n=14865)(n=1049) -4 -3 -2 -1 0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 (n=13603) (n=1890) (n=16845) (n=12851) (n=6647) Retinal progenitors and Muller glia PDE11A_FAM19A2 positive cells CCns Space Dim 1 UMAP-1 UMAP-1 UMAP-1 Development day

e Heart (n=101749) Heart (n=101749) Heart (n=101749) Percentage of actively Heart (n=101749) Percentage of actively proliferating cells in heart (n=101749) proliferating cells in heart (n=101749)

2.0 1.00 Actively proliferating FALSE TRUE 2.0 1.00 Actively proliferating FALSE TRUE 1 2.0 Cell Type

Cardiomyocytes (n=67610) 0.75 Day Endocardial cells (n=7398) 0.75 90 (n=12238) 0 1.5 Epicardial fat cells (n=1891) 1.5 1.5 94 (n=5378) Lymphatic endothelial cells (n=1251) 0.50 110 (n=42276) 0.50 Lymphoid cells (n=1348) 113 (n=4129) Percentage

Space Dim 2 Myeloid cells (n=1040) -1 UMAP-2 UMAP-2 0.25 UMAP-2 115 (n=9084) Percentage ns 1.0 1.0 Smooth muscle cells (n=2036) 1.0 0.25 120 (n=6231) Stromal cells (n=12430) CC 0.00 122 (n=22413) Vascular endothelial cells (n=4911) -2 0.00 0.5 NA 0.5 0.5 90 94 110 113 115 120 122 -4 -3 -2 -1 0 Lymphoid cells CardiomyocytesMyeloid cells Stromal cells 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 (n=67610)Smooth(n=7398) muscle cells(n=1891)(n=1251)(n=1348)(n=1040)Epicardial(n=2036) fat cells(n=12430)Endocardial(n=4911) cells 0.0 0.5 1.0 1.5 (n=12238) (n=5378) (n=42276) (n=4129) (n=9084) (n=6231) (n=22413) Vascular endothelial cells Lymphatic endothelial cells CCns Space Dim 1 UMAP-1 UMAP-1 UMAP-1 Development day

f Liver (n=113138) Liver (n=113138) Liver (n=113138) Percentage of actively Liver (n=113138) Percentage of actively proliferating cells in liver (n=113138) proliferating cells in liver (n=113138) 2.0 2.0 2 2.0 1.00 Actively proliferating FALSE TRUE 1.00 Actively proliferating FALSE TRUE Cell Type 1 1.5 1.5 Erythroblasts (n=41645) 0.75 1.5 Day 0.75 Hematopoietic stem cells (n=1096) 89 (n=6642) Hepatoblasts (n=58447) 0.50 90 (n=36348) 0 1.0 1.0 Lymphoid cells (n=954) 1.0 94 (n=10177) 0.50 Megakaryocytes (n=1661) 110 (n=27404) Space Dim 2 Percentage UMAP-2 UMAP-2 0.25 UMAP-2 Mesothelial cells (n=246) 115 (n=3412) Percentage ns 0.25 -1 Myeloid cells (n=2766) 120 (n=28725) 0.5 0.5 0.5 CC Stellate cells (n=1743) 0.00 122 (n=430) Vascular endothelial cells (n=4580) -2 0.00 0.0 0.0 0.0 89 90 94 110 115 120 122 -3 -2 -1 0 1 Erythroblasts Myeloid cellsStellate cells Hepatoblasts (n=430) 1 2 0.5 1.0 1.5 2.0 2.5 (n=41645)(n=1096)Lymphoid(n=58447)Megakaryocytes cells(n=954) (n=1661) (n=246) (n=2766)Mesothelial(n=1743) cells(n=4580) 0.5 1.0 1.5 2.0 2.5 (n=6642) (n=36348) (n=10177) (n=27404) (n=3412) (n=28725) Hematopoietic stem cells Vascular endothelial cells CCns Space Dim 1 UMAP-1 UMAP-1 UMAP-1 Development day

Supplementary Figure S26. Application of tricycle on 11 single-nuclei profiled human tissues Similar to Figure S25, but for 11 tissues with single-nuclei RNA profiled. We show one tissue type in each panel (a) adrenal, (b) cerebellum, (c) cerebrum, (d) eye, (e) heart, (f) liver, (g) lung, (h) muscle, (i) placenta, (j) spleen, and (k) thymus. For each tissue, the cell cycle embedding using mNeurosphere reference is given in the first sub-panel, tissue-level UMAPs from Cao et al. (2020) colored by cell cycle position θ in the second sub-panel, tissue-level UMAPs from Cao et al. (2020) colored by cell type in the third sub-panel, percentage of actively proliferating cells for each cell type in decreasing order in the forth sub-panel, tissue-level UMAPs from Cao et al. (2020) colored by development days in the fifth sub-panel, and percentage of actively proliferating cells for each development day in the last sub-panel.

Zheng et al. | 2021 | bioRχiv | Page S50 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

g Lung (n=217738) Lung (n=217738) Lung (n=217738) Percentage of actively Lung (n=217738) Percentage of actively 2.5 proliferating cells in lung (n=217738) 2.5 proliferating cells in lung (n=217738) 1.00 Actively proliferating FALSE TRUE 1.00 Actively proliferating FALSE TRUE Cell Type 1 2.0 2.0 2 Bronchiolar and alveolar epithelial cells (n=101249) 0.75 Day Ciliated epithelial cells (n=2588) 89 (n=33553) 0.75 0.50 0 1.5 Lymphatic endothelial cells (n=5423) 1.5 94 (n=10105) Lymphoid cells (n=2274) 110 (n=57847) 0.50 0.25 Mesothelial cells (n=1074) Percentage 113 (n=15481) 1 1.0 1.0 Space Dim 2 Myeloid cells (n=2416) 115 (n=25001) UMAP-2 UMAP-2 UMAP-2 Percentage ns -1 Neuroendocrine cells (n=1524) 0.00 120 (n=19788) 0.25 Stromal cells (n=87037) 122 (n=32954) CC 0.5 0.5 Vascular endothelial cells (n=13288) 125 (n=23009) -2 NA 0.00 0 0.0 0.0 Lymphoid cells Stromal cellsMyeloid cells Mesothelial cells 89 94 110 113 115 120 122 125 (n=101249)(n=2588)(n=5423)(n=2274)(n=1074)(n=2416)(n=1524)Neuroendocrine(n=87037)(n=13288) cells Vascular endothelial cells Ciliated epithelial cells -4 -3 -2 -1 0 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 Lymphatic endothelial cells 0.5 1.0 1.5 2.0 (n=33553)(n=10105)(n=57847)(n=15481)(n=25001)(n=19788)(n=32954)(n=23009) Bronchiolar and alveolar epithelial cells CCns Space Dim 1 UMAP-1 UMAP-1 UMAP-1 Development day

h Muscle (n=30872) Muscle (n=30872) Muscle (n=30872) Percentage of actively Muscle (n=30872) Percentage of actively 2.0 proliferating cells in muscle (n=30872) proliferating cells in muscle (n=30872) 1.00 Actively proliferating FALSE TRUE 1.00 Actively proliferating FALSE TRUE 1 Cell Type 1.5 1.5 Lymphatic endothelial cells (n=115) 0.75 1.5 Day Lymphoid cells (n=88) 0.75 89 (n=2295) Myeloid cells (n=284) 90 (n=1681) 0 Satellite cells (n=5255) 0.50 94 (n=1768) 0.50 Schwann cells (n=253) 1.0 1.0 1.0 110 (n=14154) Percentage

Space Dim 2 Skeletal muscle cells (n=13368) UMAP-2 UMAP-2 0.25 UMAP-2

115 (n=6977) Percentage ns Smooth muscle cells (n=575) 0.25 120 (n=857) -1 Stromal cells (n=8793) CC 0.00 125 (n=3140) 0.5 Vascular endothelial cells (n=2091) 0.5 NA 0.5 0.00

-2 89 90 94 110 115 120 125 Myeloid cells Stromal cells -3 -2 -1 0 Lymphoid(n=115) Satellitecells(n=88) Schwanncells(n=284) cells (n=253) (n=575) (n=2295) (n=1681) (n=1768) (n=6977) (n=857) (n=3140) 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 (n=5255)Smooth muscle (n=13368)cells (n=8793)Skeletal(n=2091) muscle cells 0.0 0.5 1.0 1.5 2.0 (n=14154) Vascular endothelial cells Lymphatic endothelial cells CCns Space Dim 1 UMAP-1 UMAP-1 UMAP-1 Development day

i Placenta (n=29876) Placenta (n=29876) Placenta (n=29876) Percentage of actively Placenta (n=29876) Percentage of actively proliferating cells in placenta (n=29876) proliferating cells in placenta (n=29876) 2.0 2.0 1 2.0 1.00 Actively proliferating FALSE TRUE 1.00 Actively proliferating FALSE TRUE Cell Type 0.75 Extravillous trophoblasts (n=1535) Day 1.5 1.5 0.75 1.5 IGFBP1_DKK1 positive cells (n=701) 0.50 89 (n=3470) 0 Lymphoid cells (n=465) 90 (n=3122) Myeloid cells (n=341) 0.25 94 (n=5675)

Percentage 0.50 1.0 1.0 Smooth muscle cells (n=882) 1.0 110 (n=7744) 0.00

Space Dim 2 Stromal cells (n=8379) 113 (n=123) UMAP-2 UMAP-2 UMAP-2 Percentage ns Syncytiotrophoblasts and villous cytotrophoblasts (n=14567) 115 (n=1611) 0.25 -1 0.5 Trophoblast giant cells (n=1137) 120 (n=5774)

CC 0.5 0.5 Vascular endothelial cells (n=1464) 122 (n=2357) NA 0.00 0.0 Myeloid cellsStromal cells 0.0 (n=1535)Lymphoid(n=701) cells(n=465) (n=341) (n=882) (n=8379)(n=14567)(n=1137)(n=1464) 0.0 Smooth muscle cells ExtravillousTrophoblast trophoblasts giant cells Vascular endothelial cells IGFBP1_DKK1 positive cells 89 90 94 110 113 115 120 122 -3 -2 -1 0 0.0 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 (n=3470) (n=3122) (n=5675) (n=7744) (n=123) (n=1611) (n=5774) (n=2357) Syncytiotrophoblasts and villous cytotrophoblasts CCns Space Dim 1 UMAP-1 UMAP-1 UMAP-1 Development day

j Spleen (n=13180) Spleen (n=13180) Spleen (n=13180) Percentage of actively Spleen (n=13180) Percentage of actively proliferating cells in spleen (n=13180) proliferating cells in spleen (n=13180)

1.00 Actively proliferating FALSE TRUE 1.00 Actively proliferating FALSE TRUE 1.5 1.5 1 1.5 Cell Type 0.75 AFP_ALB positive cells (n=94) Day 0.75 Erythroblasts (n=439) 94 (n=722) 0 Lymphoid cells (n=1624) 1.0 1.0 0.50 1.0 110 (n=3249) Megakaryocytes (n=95) 0.50 113 (n=790) Mesothelial cells (n=258) Percentage Space Dim 2 115 (n=6088) UMAP-2 UMAP-2 0.25 UMAP-2

Myeloid cells (n=1178) Percentage ns -1 117 (n=1304) 0.25 0.5 0.5 STC2_TLX1 positive cells (n=4487) 0.5 122 (n=1027) CC Stromal cells (n=1944) 0.00 Vascular endothelial cells (n=3061) 0.00 -2 0.0 0.0 0.0 94 110 113 115 117 122 Erythroblasts Stromal cellsMyeloid cells -3 -2 -1 0 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 (n=94)Lymphoid(n=439) cells(n=1624) (n=95)Megakaryocytes(n=258) (n=1178)(n=4487)Mesothelial(n=1944) cells(n=3061) 0.5 1.0 1.5 2.0 (n=722) (n=3249) (n=790) (n=6088) (n=1304) (n=1027) STC2_TLX1Vascular positive endothelial cells cells AFP_ALB positive cells CCns Space Dim 1 UMAP-1 UMAP-1 UMAP-1 Development day

k Thymus (n=8779) Thymus (n=8779) Thymus (n=8779) Percentage of actively Thymus (n=8779) Percentage of actively 2.5 proliferating cells in thymus (n=8779) proliferating cells in thymus (n=8779) 1.00 Actively proliferating FALSE TRUE 1.00 Actively proliferating FALSE TRUE 1.0 2.0 2.0 2.0 0.75 0.75 0.5 Cell Type Day Antigen presenting cells (n=465) 0.50 110 (n=31) 0.0 Stromal cells (n=112) 0.50 1.5 1.5 1.5 113 (n=6310) Thymic epithelial cells (n=259) Space Dim 2 Percentage 115 (n=2375) UMAP-2 UMAP-2 0.25 UMAP-2

Thymocytes (n=7852) Percentage ns -0.5 117 (n=63) Vascular endothelial cells (n=91) 0.25

CC 1.0 1.0 0.00 1.0 -1.0 0.00

-1.5 0.5 110 113 115 117 Thymocytes Stromal cells -2 -1 0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 (n=465) (n=112) (n=259) (n=7852) (n=91) 0.0 0.5 1.0 1.5 2.0 (n=31) (n=6310) (n=2375) (n=63) Antigen presenting cells Vascular endothelialThymic cells epithelial cells CCns Space Dim 1 UMAP-1 UMAP-1 UMAP-1 Development day

Supplementary Figure S26. (Continued) with (g) lung, (h) muscle, (i) placenta, (j) spleen, and (k) thymus.

Zheng et al. | 2021 | bioRχiv | Page S51 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a Human fetal Thymus (n=8779) b Human fetal Thymus (n=8779) 2.5 CC Stage G1.S S 2.0 G2 2.0 G2.M M.G1 NA 1.5 1.5 UMAP-2 UMAP-2

1.0 1.0

0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 UMAP-1 UMAP-1

c Human fetal Thymus (n=8779) d Human fetal Thymus (n=8779) cyclone SeuratCC G1 G1 S S 2.0 G2M 2.0 G2M

1.5 1.5 UMAP-2 UMAP-2

1.0 1.0

0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 UMAP-1 UMAP-1

Supplementary Figure S27. Human fetal thymus UMAPs colored by cell cycle position or stage. (a) Same as Supplementary Figure S26k second sub-panel, which shows the UMAP embeddings of human fetal thymus, colored by cell cycle position θ. (b) Same UMAP embedding as in (a), but colored by 5 stage cell cycle representation, inferred using the modiﬁed Schwabe method from Schwabe et al. (2020). (c) Same UMAP embedding as in (a), but colored by 3 stage cell cycle representation, inferred by cyclone (Scialdone et al., 2015). (d) Same UMAP embedding as in (a), but colored by 3 stage cell cycle representation, inferred by Seurat (Stuart et al., 2019).

mNeurosphere projected by mNeurosphere mHippNPC projected by mNeurosphere a (384 common genes) b (384 common genes)

4 CC Stage G1.S S G2 2 G2.M 2.5 M.G1

0 Space Dim 2 Space Dim 2 0.0 ns ns CC CC -2 -2.5 -5.0 -2.5 0.0 2.5 -4 -2 0 2 CCns Space Dim 1 CCns Space Dim 1

Supplementary Figure S28. Projection using the exact same genes on two datasets. This ﬁgure shows cell cycle embeddings for (a) mNeurosphere and (b) mHippNPC dataset using the subset mNeurosphere reference restricted to 384 genes existing in both datasets. Cells are colored by 5 stage cell cycle representation, inferred using the modiﬁed Schwabe method Schwabe et al. (2020).

Zheng et al. | 2021 | bioRχiv | Page S52 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a mNeurosphere (N.genes=400) mNeurosphere (N.genes=300) mNeurosphere (N.genes=200) mNeurosphere (N.genes=100) mNeurosphere (N.genes=50) CC Stage 4 2 1.0 G1.S 4 S G2 3 1 0.5 2 G2.M M.G1 0 0.0 0 0 0

Space Dim 2 Space Dim 2 Space Dim 2 Space Dim 2 -1 Space Dim 2

ns ns ns ns ns -0.5

CC CC -3 CC CC CC -4 -2 -2 -1.0

-3 -10 -5 0 5 -10 -5 0 5 -7.5 -5.0 -2.5 0.0 2.5 -3 -2 -1 0 1 -2 -1 0 CCns Space Dim 1 CCns Space Dim 1 CCns Space Dim 1 CCns Space Dim 1 CCns Space Dim 1

b mNeurosphere (N.genes=400) mNeurosphere (N.genes=300) mNeurosphere (N.genes=200) mNeurosphere (N.genes=100) mNeurosphere (N.genes=50) 2π 2π 2π 2π 2π Circular correlation Circular correlation Circular correlation Circular correlation Circular correlation 1.5π ρ=0.990 1.5π ρ=0.983 1.5π ρ=0.968 1.5π ρ=0.889 1.5π ρ=0.753

1π 1π 1π 1π 1π

0.5π 0.5π 0.5π 0.5π 0.5π Position θ (N.genes=50) Position θ (N.genes=400) Position θ (N.genes=300) Position θ (N.genes=200) Position θ (N.genes=100) ns ns ns ns ns CC

CC 0π CC 0π CC 0π CC 0π 0π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π CCns Position θ (N.genes=500) CCns Position θ (N.genes=500) CCns Position θ (N.genes=500) CCns Position θ (N.genes=500) CCns Position θ (N.genes=500)

Supplementary Figure S29. Examples of mNeurosphere dataset projections with randomly sub-sampled projection matrices. Each column represents an example of a projection using the sub-sampled genes from original 500 projection genes. From left to the right, the numbers of genes are 400, 300, 200, 100, and 50. (a) The projected cell cycle embedding using the sub-sampled projection matrix. Cells are colored by 5 stage cell cycle representation, inferred using the modiﬁed Schwabe method from Schwabe et al. (2020). (b) Comparisons of cell cycle positions θ estimated using the full 500 projection matrix and using the sub-sampled projection matrix. The circular correlation ρ is given in the ﬁgure.

Tolerance of removing genes from the mNeuroshoere reference 1

0.75

0.5

0.25 Circular correlation ρ

0 450 400 350 300 250 200 150 100 50 Number of genes retianed in the mNeurosphere reference

Supplementary Figure S30. Stability assessment with projection genes missing. This ﬁgure shows comprehensive assessments as complement to Supplementary Figure S29. For each target number of genes retained in the mNeurosphere reference matrix, we randomly sampled different genes 30 times. For each run, the circular correlation coefﬁcient ρ was calculated between θ from projection using the full reference matrix and θ from projection using sub-sampled reference.

Zheng et al. | 2021 | bioRχiv | Page S53 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a mHippNPC b mHippNPC mHippNPC mHippNPC mHippNPC mHippNPC (original; lib.size median:9996) (downsampled; lib.size median:7996) (downsampled; lib.size median:6000) (downsampled; lib.size median:4001) (downsampled; lib.size median:2000) (downsampled; lib.size median:999) 4 CC Stage 2 3 G1.S 1.0 S G2 2.5 2 2 1 2.5 G2.M 0.5 M.G1 1 0 0.0 Space Dim 2 0.0 Space Dim 2 0.0 Space Dim 2 0 Space Dim 2 0 Space Dim 2 Space Dim 2 ns ns ns ns ns ns -0.5 CC CC CC CC CC CC -1 -1 -2 -2.5 -2.5 -1.0 -4 -2 0 2 -4 -2 0 2 -5.0 -2.5 0.0 2.5 -4 -2 0 2 -2 -1 0 1 -2 -1 0 CCns Space Dim 1 CCns Space Dim 1 CCns Space Dim 1 CCns Space Dim 1 CCns Space Dim 1 CCns Space Dim 1

c mHippNPC d mHippNPC mHippNPC mHippNPC mHippNPC mHippNPC (multiple lib.size) (downsampled; lib.size median:7996) (downsampled; lib.size median:6000) (downsampled; lib.size median:4001) (downsampled; lib.size median:2000) (downsampled; lib.size median:999) lib.s median 2π 2π 2π 2π 2π 9996 7996 Circular correlation Circular correlation Circular correlation Circular correlation Circular correlation 6000 1.5π 1.5π 1.5π 1.5π 1.5π 2.5 ρ=0.974 ρ=0.946 ρ=0.902 ρ=0.788 ρ=0.649 4001 2000 999 1π 1π 1π 1π 1π

Space Dim 2 0.0 ns 0.5π 0.5π 0.5π 0.5π 0.5π CC Position θ (downsampled) Position θ (downsampled) Position θ (downsampled) Position θ (downsampled) Position θ (downsampled) ns ns ns ns ns -2.5 0π 0π 0π 0π 0π CC CC CC CC CC -6 -4 -2 0 2 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π 0π 0.5π 1π 1.5π 2π CCns Space Dim 1 CCns Position θ (original) CCns Position θ (original) CCns Position θ (original) CCns Position θ (original) CCns Position θ (original)

e mHippNPC f mHippNPC mHippNPC mHippNPC mHippNPC mHippNPC (original; lib.size median:9996) (downsampled; lib.size median:7996) (downsampled; lib.size median:6000) (downsampled; lib.size median:4001) (downsampled; lib.size median:2000) (downsampled; lib.size median:999) 4 5.0 log2(lib.size) log2(lib.size) log2(lib.size) log2(lib.size) 2 log2(lib.size) log2(lib.size) 15 14 13 16 15 3 15 14 13 1.0 12 15 14 14 13 12 11 14 13 2.5 13 2 2 12 1 11 10 2.5 13 12 0.5 12 11 10 9 12 11 11 1 10 9 8 0 0.0 Space Dim 2 0.0 Space Dim 2 0.0 Space Dim 2 0 Space Dim 2 0 Space Dim 2 Space Dim 2 ns ns ns ns ns ns -0.5 CC CC CC CC CC CC -1 -1 -2.5 -2 -2.5 -1.0 -6 -4 -2 0 2 -6 -4 -2 0 2 -5.0 -2.5 0.0 2.5 -4 -2 0 2 -2 -1 0 1 -2 -1 0 CCns Space Dim 1 CCns Space Dim 1 CCns Space Dim 1 CCns Space Dim 1 CCns Space Dim 1 CCns Space Dim 1

Supplementary Figure S31. Examples of projections on downsampled mHippNPC dataset. (a) The cell cycle embedding projection using the mNeurosphere reference on original mHippNPC data. (b) Each sub-panel represents the same projection as in (a), but the mHippNPC is downsampled to the 80%, 60%, 40%, 20%, and 10% of its original library size (corresponding to median of library size is given in the panel title). Note that the ranges of both x-axis and y-axis are different across sub-panels. (c) By overlaying (a) and all sub-panels of (b), it shows the shrinkage of projections with library size decreasing. (d) Comparisons of cell cycle positions θ estimated from the original mHippNPC data and

from the downsampled mHippNPC data. (e) Similar to (a), but the points are colored by log2 transformed library size. (f) Similar to (b), but the points are colored by log2 transformed library size.

Stability with libray size downsampled (mHippNPC)

0.75

0.5

0.25 Circular correlation ρ

0 9000 8000 7000 6000 5000 4000 3000 2000 1000 Approx. libray size median

Supplementary Figure S32. Stability assessment with decreasing sequencing depths.This ﬁgure shows comprehensive assessments as complement to Supplementary Figure S31. We repeated the downsampling processes for mHippNPC for each target downsampling percentage. For each run, the circular correlation coefﬁcient ρ was calculated between θ estimated on the original mHippNPC data and θ estimated on the downsampled data.

Zheng et al. | 2021 | bioRχiv | Page S54 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a b c Absolute distance of peaks between Exp. peak positions on PCA θ Exp. peak positions on CCns Position θ mNeurosphere and mHippNPC dataset

Ube2s 1.5π 1.5π Ncapd2 Ckap2 Nde1 Tacc3 Nusap1 Ckap2 Cit Ncapd2 Kif18a Kif11 Bub3 Rad21 Ube2s Rad21 Ube2s 0.24π Tacc3 Smc4 Smc4 Cit Ezh2 Hjurp Kif11 Lmnb1 Nde1 Tacc3 Bub3 Tipin Plk4 Smc2 Kif18a Nusap1 Position θ Knl1

1π ns 1π Lmnb1 Ezh2 Top2A Chaf1a Mcm7 Top2A Knl1 Cdkn2c Hjurp Mcm7 Plk4 Dbf4 Cdkn2c 0.16π Dtl Nasp Mdc1 Dbf4 Smc2 Chaf1a Topbp1 Position θ Hells Mcm4 Topbp1 Tipin Mdc1 ns Dtl Nasp Mcm7 Ezh2 Mcm6 Rif1 Camk2b Nasp 0.5π 0.5π Hells Mcm4 Bub3 Smc4 Tipin Ccnd2 Ccnd2 Camk1 Topbp1 Rif1 on CC 0.08π Mcm6 Camk1 Mcm6 Exp. peak positions of Exp. peak positions of Camk1 Mcm4 Camk2b Plk4 Hjurp Camk2b Nde1 Lmnb1 Absolute distance of peaks mHippNPC data on PCA θ Top2A Rif1 Chaf1a Cdkn2c Kif18a Ccnd2 Smc2 Dbf4 Rad21 Ckap2 mHippNPC data on CC Cit Mdc1 0 0 0 Nusap1 Kif11 Knl1Ncapd2 Hells Dtl

0 0.5π 1π 1.5π 0 0.5π 1π 1.5π 0 0.08π 0.16π 0.24π Exp. peak positions of Exp. peak positions of Absolute distance of peaks on PCA θ

mNeurosphere data on PCA θ mNeurosphere data on CCns Position θ

Supplementary Figure S33. Comparison of positions of peak expression for θ estimated on independent PCA and projection by mNeurosphere reference. (a) For each gene dipicted in Supplementary Figure S6, we estimate and compare when the peak expression is reached between 0 to 2π for mNeurosphere and mHippNPC data. The position θ is based on independent PCA on GO cell cycle genes of each data. (b) Similar to (a), but now we use position θ estimated using pre-learned mNeurosphere reference. (c) The majority of genes are better aligned on θ pre-learned mNeurosphere reference. x-axis represents the absolute distance of position of peak expression on θ estimated on independent PCA, while y-axis represents those estimated using pre-learned mNeurosphere reference. Genes with a larger absolute distance on θ estimated on independent PCA compared to θ estimated using pre-learned mNeurosphere reference are colored as blue, and genes are colored by red if showing the opposite direction.

a mNeurosphere (n=12805) b Consistency of two θs c mNeurosphere (n=12805) CC Stage 2π G1.S 5 S Circular correlation 5 G2 1.5π G2.M ρ=0.968 M.G1 θ 0 NA 0 1π Position θ Space Dim 2 Space Dim 2 ns ns ns

-5 CC 0.5π -5 CC CC

0π -15 -10 -5 0 5 0π 0.5π 1π 1.5π 2π -15 -10 -5 0 5 CCns Space Dim 1 PCA θ CCns Space Dim 1

Supplementary Figure S34. Self-projection to test method sensitivity on a positive control. (a) The cell cycle embedding of mNeurosphere data using the reference learned from itself. Note the projections are different from direct PCA, as the PCA is done on Seurat corrected expressions while the projection is calculated on non-corrected expressions. (b) Comparisons of cell cycle positions θ estimated from the direct PCA and from the projection. (c) RNA velocity embedding of the projection genes on the cell cycle embedding for mNeurosphere data. Cells are colored by 5 stage cell cycle representation, inferred using the modiﬁed from Schwabe method Schwabe et al. (2020).

Zheng et al. | 2021 | bioRχiv | Page S55 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.06.438463; this version posted April 6, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

SUPPLEMENTARY TABLES

Supplementary Table S1. Datasets

Dataset Species Platform # cells Note Reference

mNeurosphere Mouse 10x 12805 mHippNPC Mouse 10x 9188 mPancreas Mouse 10x 3559 Bastidas-Ponce et al. (2019) mHSC Mouse SMARTer 1343 Kowalczyk et al. (2015) mRetina Mouse 10x 99260 Clark et al. (2019) HeLa 1 Human Drop-seq 1398 Schwabe et al. (2020) HeLa 2 Human Drop-seq 2463 Schwabe et al. (2020) mESC Mouse Fluidigm C1 279 FACS Buettner et al. (2015) hESC Human Fluidigm C1 226 FACS Leng et al. (2015) hU2OS Human SMART-seq2 1114 FUCCI Mahdessian et al. (2021) hiPSCs Human Fluidigm C1 888 FUCCI Hsiao et al. (2020) Fetal tissue atlas Human sci-RNA-seq3 Varies Cao et al. (2020)

Zheng et al. | 2021 | bioRχiv | Page S56