RNA Velocity Analysis for Pertrub-Seq by Mesert Kebed B.S. Computer Science and Engineering, Massachusetts Institute of Technology (2018) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2020 ○c Massachusetts Institute of Technology 2020. All rights reserved.

Author...... Department of Electrical Engineering and Computer Science August 14, 2020

Certified by...... Professor of Thesis Supervisor

Accepted by ...... Katrina LaCurts Chair, Master of Engineering Thesis Committee 2 RNA Velocity Analysis for Pertrub-Seq by Mesert Kebed

Submitted to the Department of Electrical Engineering and Computer Science on August 14, 2020, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science

Abstract Recent developments in single-cell RNA seq and CRISPR based perturbations have enabled researchers to carry out hundreds of perturbation experiments in a pooled format in an experimental approach called Perturb-Seq [7]. Prior analysis of Perturb- Seq measured the overall effect of a perturbation on each gene, however it remains difficult to capture temporal responses to a perturbation. In this thesis, we compare the effectiveness of three RNA velocity informed models and two cell-cell similarity based models in providing a pseudo-temporal ordering of cells. We find pseudotime estimated with the dynamical model for computing velocity provides the most reli- able ordering of cells. We use this pseudo-temporal ordering to bin cells into three time resolved groups and compute the effect of a perturbation at each time point. This analysis provides a promising start to understanding the temporal effects of a perturbation.

Thesis Supervisor: Aviv Regev Title: Professor of Biology

3 4 Acknowledgments

Growing up I often heard the proverb it takes a village to raise a child - and I am no exception. There are lot more people that have made this work a possibility than I can exhaustively acknowledge here but I hope to express my gratitude a few of those people. First, I would like to thank Oana Ursu, who mentored me through the duration of the thesis and patiently taught me everything I know in the field. This work would not have been possible without her gentle guidance and kind words of encouragement. She has made me a better researcher, writer, learner and person and for that I will be forever grateful. I would like to thank Professor Regev, who initially inspired me to pursue compu- tational biology and exposed me to a wonderfully balanced group of experimentalists, mathematicians and computer scientists. I’m extremely grateful to the Regev Group and the Broad Institute for providing an environment that nurtured and celebrated my curiosity. I would like to thank all of my professors, instructors and staff who have instilled in me a desire to seek and tackle challenging problems; the Dept. of EECS admins that have supported me through numerous years at the Institute; and John Guttag and Ana Bell who have welcomed me into their teaching staff with open arms. Lastly, I would like to thank my family for inspiring my curiosity, creativity, and drive from a young age. I’m also grateful to my friends who have kept me company through the late nights and early mornings, exposed me to new experiences and truly made MIT my home away from home.

5 6 Contents

1 Introduction 17 1.1 Perturb-Seq: genetic screens for studying gene function ...... 18 1.1.1 Previous analysis of Perturb-Seq: using linear regression for identifying which genes are affected by a given perturbation .19 1.1.2 Challenges in previous analysis methods of Perturb-Seq: under- standing the temporal progression of gene expression changes induced by a gene knockout ...... 19 1.2 RNA Velocity: inferring a time based ordering of cells ...... 21 1.3 Our proposed approach: using RNA velocity to increase the temporal resolution of perturbation-induced gene expression changes ...... 22 1.3.1 Using RNA velocity to arrange cells over time, towards distin- guishing cells that are early from late responders ...... 22 1.3.2 Compare the effect of perturbations on gene expression as in- ferred by a) traditional Perturb-Seq analyses and b) incorpo- rating insights from RNA velocity ...... 23

2 Related Work 25 2.1 Trajectory inference: infers projection of cells using diffusion maps . . 25 2.2 RNA Velocity: captures the rate of change in the cell’s expression state 26 2.2.1 Estimating RNA velocity ...... 26 2.2.2 Steady-state model: captures variations from an observed steady- state expression ...... 28

7 2.2.3 Dynamical model: solves the full gene-wise transcriptional dy- namics ...... 29

3 Data 31 3.1 Time-series dataset: mouse BMDCs sequenced at 1 hour intervals fol- lowing LPS stimulation ...... 31 3.1.1 Downloading and preparing the dataset ...... 32 3.1.2 Filtering the dataset ...... 32 3.1.3 Processing the dataset ...... 33 3.2 Perturb-Seq: mouse BMDCs with 24 perturbations at 0 and 3 hours after LPS stimulation ...... 35 3.2.1 Downloading and preparing the dataset ...... 36 3.2.2 Filtering the dataset ...... 36 3.2.3 Processing the Dataset ...... 38 3.2.4 Post-Processing ...... 39

4 Methods 43 4.1 Characterising the various pseudotime ordering methods ...... 43 4.2 Computing a pseudo-temporal ordering of cells ...... 44 4.3 Computing the transition matrix ...... 45 4.3.1 For velocity informed models ...... 45 4.3.2 For cell-cell similarity based methods ...... 47 4.4 Software Packages ...... 47

5 Analysing the time-series dataset 49 5.1 Goal ...... 49 5.2 Data and code ...... 50 5.3 Results and Analysis ...... 50 5.3.1 Picking the number of neighbours for the transition matrix . . 50 5.3.2 Comparing the pseudotime estimates with the ground truth . 52

8 5.3.3 Exploring the genes that are strongly correlated with pseudo- time estimates for each method ...... 53

6 Analysing the Perturb-Seq dataset 57 6.1 Goal ...... 57 6.2 Data and Code ...... 58 6.3 Results and Analysis ...... 58 6.3.1 Picking the number of neighbours for the transition matrix . . 58 6.3.2 Comparing the pseudotime estimates among each of the veloc- ity models ...... 59 6.3.3 Exploring the genes that are strongly correlated with pseudotime 61 6.3.4 Compare the Beta’s that we get by sorting the cells into pseudo- time ordered groups to the Beta’s we get from previous analysis of Perturb-Seq ...... 62 6.4 Conclusion and Future work ...... 63

9 10 List of Figures

1-1 Linear regression model for Perturb-Seq. The model predicts the gene expression matrix Y (given) as a product of X, a matrix that specifies which cell received which perturbation (given), and beta, aset of coefficients, which represent the effects of each perturbation oneach gene. The beta coefficients are then used by biologists to understand the biological processes affected by different perturbations. . . . . 19

1-2 Direct and indirect effects of a Gene A. In this simplified model of a cell pathway, Gene A activates expression of Gene B and C, and represses expression of Gene D which go to activate and repress other genes. In this figure Gene X, Y and Y are indirectly affected bythe expression of Gene A...... 20

1-3 Illustration of RNA velocity on Perturb-Seq. RNA velocity allows us to compute c(t+1) from which we can infer that c’(t) likely responded to the stimulus before c(t), and is therefore at a later cell state as compared to c(t)...... 23

2-1 Overview of RNA Velocity. a) Gene model, DNA is transcribed to RNA at rate , spliced at rate , and degraded at rate . b) Phase diagram capturing regions of induction and repression based on the amount of unspliced and spliced RNA...... 27

11 3-1 Distribution of total counts for each group of cells in time series dataset The plot reveals that the unstimulated group has more reads than the others. We subsample the reads per cell to 500,000 to account for this...... 33

3-2 Quality check for cells in time series dataset. Scatter plots (left) the number of counts against percent mitochondrial genes; (right) the number of counts against the number of genes in each cell. We remove all cells that have greater than 0.1% mitochondrial genes and greater than 7,000 genes...... 34

3-3 Overview of cells in time series dataset. a) Low dimensional representation in UMAP space representing the 20 clusters identified in the dataset, the 0hr, 1hr, 2hr, 4hr, and 6hr cells. b) Average expression of LPS gene groups in the dataset across cells...... 35

3-4 Cluster disruption. Expression of cluster disruptive genes (Cd83 and SerpinB6b) across the cells represented in the UMAP space. . . . 36

3-5 Quality check on Perturb-Seq dataset. a) Scatter plots (left) the number of counts against percent mitochondrial genes; (right) the number of counts against the number of genes in each cell. We remove all cells that have greater than 0.1% mitochondrial genes and greater than 2000 genes...... 37

3-6 Linear regression model for batch correction. The model mea- sures the effect of the covariates on the observed expression profile. We subtract out the covariates related to batch and keep the error and

hour effect. In this case, our resulting expression wouldbe 퐵1 · (0ℎ푟) +

퐵2 · (3ℎ푟) + 푒푟푟표푟 ...... 38

3-7 Overview of cells in Perturb-Seq dataset. Low dimensional rep- resentation in UMAP space representing the 15 clusters identified in the dataset, the 0hr cells and 3hr cells. b) Average expression of LPS gene groups in the dataset across cells...... 39

12 3-8 Distribution of the total counts per batch in the Perturb-Seq dataset. Batch D9 has significantly less counts than the other andis thus removed from further processing...... 40

3-9 Distribution of cell cycle and cluster disruptive in Perturb- Seq dataset. (left to right) Cell clusters in the dataset, we remove clusters that are related to marker genes shown. Expression levels of genes related to cell cycle, Cd83 and Serpinb6b marker genes used to identify cells that are not dendritic cells...... 41

3-10 A heat map of the correlation matrix of guides from Perturb- Seq. The matrix shows that the chosen group of guides (in the green box) are strongly correlated with each other...... 42

4-1 Computing the velocity graph, which forms the basis of the

transition matrix. Depicted are cells 푥푖, 푥푗. We compute a measure

of the concordance between the velocity and 푥푖 and the difference be-

tween 푥푖 and 푥푗. If the velocity at xi points to a transition towards

cell 푥푗, then 푥푗 − 푥푖 and 푣푖(푡) will have similar directions. We compute this concordance via a cosine similarity...... 46

4-2 Estimating the expected gene expression after some time. A linear combination of the observed gene expression matrix combined with the cell velocities gives us the expected gene expression state at time t...... 46

5-1 Comparing the effect of restricting cell-cell transitions tothe nearest 푛 neighbors. a) Euclidean distances across top 50 PCs be- tween each cell and its nth closest neighbour: (left) shows the plot containing all of the cells (right) focuses on the nearest 50 neighbours to. b) Summarizes the effect on pseudotime estimates when varying the transitions allowed for each cell to its n nearest neighbours. . . . 51

13 5-2 Comparison of the various pseudo-temporal orderings against ground truth. a) Low dimensional representation of cells in UMAP space showing: ground truths in the data (top) 1hr, 2hr, 4hr and 6hr cells (bottom) based on expression scores for LPS gene groups. b) Pseudotime estimates based on each of the velocity/similarity based models. c) Swarm plots depicting the distribution of pseudo-time es- timates grouped by ground truth. d) Velocity confidence measures for each cell based on concordance of velocity with neighbours...... 55

5-3 Exploring genes that are correlated with pseudotime for the time series dataset. a) Clustermap of the genes that are highly correlated with at least one of the pseudotime estimates. We annotate each gene with the LPS gene group it belongs to (if any) from [20]. b)Heatmap of the correlation between pseudotime estimates and gene expression grouped by the LPS gene groups from [20] ...... 56

6-1 Euclidean distance among the top 50 PCs between a cell and its nth nearest neighbor. (left) shows the distance between all possible neighbours (right) Focuses on the 400 nearest neighbours. . . 59

6-2 Comparison of the various pseudo-temporal orderings against expectation based on LPS gene groups a) Low dimensional repre- sentation of cells in UMAP space showing: (left) the clusters found in the data (middle, right) ground truths in the data based on expression scores for LPS gene groups. b) Pseudotime estimates based on each of the velocity/similarity based models. c) Velocity confidence measures for each cell based on concordance of velocity with neighbours. . . . . 64

14 6-3 Exploring genes that are correlated with pseudotime for Perturb- Seq. a) Clustermap of the genes that are highly correlated with at least one of the pseudotime estimates. We annotate each gene with the LPS gene group it belongs to (if any). b) Heatmap of the correlation be- tween pseudotime estimates and gene expression grouped by the LPS gene groups from [20] ...... 65 6-4 Evaluating effect of a perturbation on each gene in each pseudo- temporal bin. (left) Clustermap representing the effect of the pertur- bation on each gene, where the cells are sorted into 3 groups based on the pseudotime values from the Dynamical model. (right) heatmap of the overall effect of the perturbation on each gene akin to prior analy- sis methods of Perturb-Seq. (a-e) show the graphs for each of the five guides we use for our analysis...... 66

15 16 Chapter 1

Introduction

The human genome has approximately 20,000 genes that work together in a complex system to enable life. However, understanding the function of each of these genes is a challenging task. One way to identify the function of a gene is to evaluate whether that gene is required in known cellular processes, by perturbing (for example, knocking out) that gene and analyzing the effect of the knockout on various cell functions [7]. Perturbing a gene can lead to one of three possible outcomes in the cell: (1) a cell biology phenotypic observable change, such as changes in growth, drug resistance, etc. (2) a molecular phenotypic observable change, such as in expression of other genes in response to the knockout (3) no observable effect. The first and second change are not mutually exclusive, and both reflect how the perturbed gene is linked tothe function or expression of other genes in the cell. This thesis focuses on characterizing the genes that are affected by a perturbed gene. We can evaluate the effect of knocking out a gene by measuring which genes become activated or repressed in response to knocking out the gene of interest. One approach to do this is through a combination of genetic screens (for knocking out the gene) and RNA sequencing (RNA-seq, for measuring the response of cells to the perturbation)[16, 4]. RNA-seq measures the extent to which each gene is active in a given cell, through that cell’s gene expression profile. Once we measure the gene expression profile for each cell, we can identify the specific genes that change duetothe perturbation by comparing the gene expression profile of perturbed vs. control cells

17 (that were not perturbed, but were otherwise subjected to all the same experimental steps). This allows us to learn the consequences of knocking out a specific gene. Often there are multiple genes that are affected by knocking out a specific gene, especially when the knocked out gene is a transcription factor (genes that regulate the transcription of other genes). Genes that are co-regulated (regulated by at least one common transcription factor) are frequently linked to the same processes. Therefore, gene expression profiles serve as a proxy for the overall cellular state of a celland changes in cell state induced by a perturbation, as measured by RNA-seq, are a readout of the importance of a given gene for cellular processes.

1.1 Perturb-Seq: genetic screens for studying gene function

Recent advances have enabled us to conduct and analyze multiple genetic pertur- bations in parallel for their effect on expression profiles. Perturbation experiments have massively increased in scale and throughput due to advances in both (1) per- turbation techniques, especially genome editing techniques using CRISPR, resulting in the ability to perform thousands of perturbations in parallel in a single experi- ment with high precision[6, 17], and (2) the ability to perform RNA-seq at the single cell level, making it possible not only to read out changes in gene expression pro- files within single cells, but also to pool together numerous perturbations inasingle experiment. Such advances have enabled researchers to carry out knockout experi- ments in a pooled format in an experimental design called Perturb-Seq. In particular, Perturb-Seq and related experimental approaches[7, 1, 13] enable the study of hun- dreds of perturbations at once, measuring both the expression profile and the identity of the perturbation present in each cell (where the perturbation identity is detected as an expressed transcript or an expressed transcript barcode). From Perturb-Seq we can infer the effect of each perturbation on each individual gene using techniques such as linear regression[7].

18 1.1.1 Previous analysis of Perturb-Seq: using linear regres- sion for identifying which genes are affected by a given perturbation

Perturb-Seq experiments provide us with two important datasets; the expression levels for all the genes across cells (Y) and the perturbations in each cell (X). Given these two datasets we can fit a linear regression model to infer 훽 the effect of a perturbation on each gene (Figure 1-1). This allows us to measure the effect of each perturbation

(i) on each gene (j) in a single coefficient 훽푖,푗 which captures the overall consequence of knocking out a gene on cell states.

Figure 1-1: Linear regression model for Perturb-Seq. The model predicts the gene expression matrix Y (given) as a product of X, a matrix that specifies which cell received which perturbation (given), and beta, a set of coefficients, which represent the effects of each perturbation on each gene. The beta coefficients are thenusedby biologists to understand the biological processes affected by different perturbations.

1.1.2 Challenges in previous analysis methods of Perturb-Seq: understanding the temporal progression of gene expres- sion changes induced by a gene knockout

There are two main shortcomings to the previous analysis methods applied to Perturb- Seq data. First, previous Perturb-Seq analysis does not distinguish between direct and indi- rect targets of the perturbation, which leads to a loss in critical information about the order in which genes are affected. However, in the cell, the perturbed gene can have

19 both early direct targets, as well as indirect targets affected later as the cell responds to the direct outcome of the perturbation. For example, consider the case when the perturbed gene encodes a transcription factor that directly regulates the expression of other genes, in a simplified transcriptional pathway (Figure 1-2). Knocking out gene A will affect the expression of genes B, C, D, X, Y, and Z. Traditional Perturb-Seq analysis will reveal that B,C, D, X, Y, and Z have been impacted by gene A, however it will not distinguish between groups of genes B, C, D and X, Y, Z which are direct and indirect effects of the perturbation respectively.

Figure 1-2: Direct and indirect effects of a Gene A. In this simplified model of a cell pathway, Gene A activates expression of Gene B and C, and represses expression of Gene D which go to activate and repress other genes. In this figure Gene X, Y and Y are indirectly affected by the expression of Gene A.

Second, prior Perturb-Seq analyses estimated the overall effect of the perturbation on each gene and did not capture temporal responses to a stimulus. In this thesis, we focus on the impact of gene perturbations in response to lipopolysaccharide (LPS) in mouse bone-marrow derived dendritic cells (BMDCs) after 3 hours of stimulation. Previous studies on the LPS response pathway in dendritic cells (DCs) have revealed a distinct temporal response observed among a group of genes [20]. For example, a subset of genes called the “peaked inflammatory” program first increases in expression

20 and then decreases, whereas in another program, the “sustained inflammatory” pro- gram consistently increases in expression in response to LPS. Traditional Perturb-Seq analysis may miss such differences in timing of effects on genes if the cells inagiven sample are not completely synchronized in their response.

1.2 RNA Velocity: inferring a time based ordering of cells

Though all of the cells in the same batch receive the perturbation reagents and the stimulus at the same time, the perturbation reagent may have its impact at different times [16, 7], and they do not respond to the stimulus in a synchronized manner [20]. Therefore, if we profile cells 3 hours after LPS stimulation, as in the dataset we use here, the cells could still be at various stages of their response to the stimulus. Some cells may have begun responding as soon as the stimulus was added, whereas others could have initiated a response only 30 minutes or an hour later. Traditional Perturb-Seq analysis addresses this variation among the data by considering the mean expression for each gene across all cells receiving the same perturbation in the exper- iment. However this ignores the potential time signal present in the dataset, which is what we set out to explore in this work. An emerging approach in the analysis of scRNA-seq, not yet applied to perturba- tion data, is RNA Velocity, which is a predictor of the future state of a cell. RNA velocity is a measure of the rate and direction of change within a cell. RNA velocity uses the relative abundance of unspliced and spliced RNAs in the cell to predict the future state of a cell given its current expression profile [14, 3]. This is based onthe observation that newly activated genes manifest as a low spliced/unspliced ratio (as most of the transcript would be new, and not yet spliced [18, 19], whereas repressed genes should show the opposite pattern. As such, RNA velocity may be a promising measure to arrange cells across time as it captures relationships between cell states represented in a pool of cells.

21 We discuss the formulation of RNA velocity in further detail in Chapter 2.

1.3 Our proposed approach: using RNA velocity to increase the temporal resolution of perturbation- induced gene expression changes

In this thesis, I explore whether we can use the newly developed framework of RNA velocity [14] to understand the effect of a single perturbation over time. If successful, this will allow us to identify the cascading behaviours of cell pathways for further investigation, and will serve as a general methodology to extract time-resolved infor- mation from Perturb-Seq experiments.

1.3.1 Using RNA velocity to arrange cells over time, towards distinguishing cells that are early from late responders

Our first goal is to assess whether we can use RNA velocity to arrange the cellsin a Perturb-Seq experiment across time and resolve the implicit time variation in the pool of cells. This will allow us to identify the cells that are early vs late in their response to the stimulus, and how the perturbation effects may cascade over time. My key hypothesis is that RNA velocity will allow us to learn how cell states are related to each other temporally by providing an estimate of the future state of each cell. For example, consider a cell c at time t following the stimulus, and denote its cell state as c(t). We can use RNA velocity to compute its expected state after some time c(t+1). Now, if we identify another cell, c’ in our set of perturbed cells whose state at time t, c’(t) is most similar to c(t+1), we can infer that cell c’ is in a “later” state than cell c. In other words, that cell c’ responded to the stimulus earlier than cell c (Figure 1-3). Using this approach, we devise an algorithm to systematically arrange the cells across a time dimension. As there are two distinct stimuli applied to the cells, gene perturbation and LPS

22 Figure 1-3: Illustration of RNA velocity on Perturb-Seq. RNA velocity allows us to compute c(t+1) from which we can infer that c’(t) likely responded to the stimulus before c(t), and is therefore at a later cell state as compared to c(t).

stimulation, velocity could be picking up on variations caused by either one or both to inform the prediction of future cell states. Therefore, we will look at the genes that are driving velocity, so we can better understand which one (if any) of the stimuli is driving change. Control cells in the pool, which are affected only by LPS stimulation, will allow us to distinguish the two temporal processes.

1.3.2 Compare the effect of perturbations on gene expression as inferred by a) traditional Perturb-Seq analyses and b) incorporating insights from RNA velocity

My second goal for this thesis is to determine whether RNA velocity is able to capture temporal responses to the stimulus at the gene level. This addresses the core hypoth- esis of this thesis, which is to determine if RNA velocity will enable us to differentiate between direct and indirect effects of the perturbation (Figure 1-2) and/or capture temporal responses to a stimulus. The key idea here is that RNA velocity will enable us to arrange the cells over time such that we can split the cells into three time resolved groups (“early”, “mid”, “late”). Then we can compute the effect of each perturbation on each gene in a coefficient matrix 훽 (as described in Section 1.1.1) for each group of cells. The variations in the magnitude of 훽 values for a gene across the five time bins mirrors temporal changes to the expression of that gene in response to the stimulus. We consider a gene to be

23 important if it has a significant 훽 value in any one time bin. Ultimately, we compare the important genes as evaluated by the described method against what we get from a typical analysis of Perturb-Seq data.

24 Chapter 2

Related Work

2.1 Trajectory inference: infers projection of cells using diffusion maps

Trajectory inference estimates a pseudo-temporal ordering of cells based on random- walk based distances in the diffusion map space [9, 21]. The root (starting) cellfor the trajectory is explicitly defined by setting adata.uns[’iroot’] . The pseudotime estimates for this approach are computed in three steps.

1. Compute a transition matrix that approximates the transition of one cell, x, transitioning to another cell, y. This is done by by superimposing local kernels at the expression levels of cells x and y.

2. The distance between two cells x and y (푑푝푡(푥, 푦)) is computed as:

∞ ∑︁ ˜푡 푑푝푡(푥, 푦) = ‖푀(푥, .) − 푀(푦, .)‖1/휑0 , 푀 = 푇 (2.1) 푡=1

This computes the accumulated transition probability of going from cell y to cell x over random walks of all lengths. Since the root cell is defined by the user, the pseudotime of cell x is the distance from that cell to the root (푑푝푡(푥, 푟)).

3. Branching points (point where cells diverge into two separate lineages) are iden-

25 tified.

In Chapter 5 and 6, we compare the effectiveness of a diffusion based pseudotime ordering method as compared to RNA velocity based approaches when applied to stimulation and perturbation experiments.

2.2 RNA Velocity: captures the rate of change in the cell’s expression state

The work done by [14, 3] in estimating RNA velocity and predicting future cell states based on the current gene expression and the estimated velocity has provided the groundwork for this thesis. RNA velocity uses the relative abundance of unspliced and spliced RNAs in the cell to determine the rate at which each gene is being induced (activated) or repressed, based on the observation that newly activated genes manifest as a high unspliced-spliced ratio, whereas repressed genes show the opposite [14]. Previous work by [14, 3] showed that RNA velocity is a promising measure for accurately arranging cells across time when working with datasets collected along cell differentiation, such as Pancreatic endocrinogenesis [2] and Dentate Gyrus neurogene- sis [10]. In this thesis, we extend the application of RNA velocity to arrange cells over time in a response to an environmental stimulus within a Perturb-Seq setting, where there are fewer genes that are changing over time. We will analyse the arrangement of cells based on three velocity models (namely the deterministic, stochastic, dynamical models) introduced by [3] implemented and maintained through the python package scVelo.

2.2.1 Estimating RNA velocity

RNA velocity measures the expected change in gene expression based on the ratio between unspliced and spliced RNA in the cell. The cell’s DNA containing the genetic code is stored in the nucleus, therefore it needs to be transcribed to RNA and trans- ported to the cytoplasm to be translated to protein. Once DNA has been transcribed

26 to RNA, it is spliced to remove non-protein coding intervening regions (also known as introns) leaving only coding exons and untranslated regions prior to the translation start and stop codon. [14, 3] assume that each gene is in one of four states described below:

∙ Induction: the gene is actively induced, i.e. there is an increase in the expres- sion of this gene.

∙ Repression: the gene is actively repressed, i.e. there is a decrease in the expression of this gene.

∙ Steadily on: the gene is expressed at a steady rate.

∙ Steadily off: the gene is not expressed.

Figure 2-1: Overview of RNA Velocity. a) Gene transcription model, DNA is transcribed to RNA at rate , spliced at rate , and degraded at rate . b) Phase diagram capturing regions of induction and repression based on the amount of unspliced and spliced RNA.

RNA velocity v(t) leverages the transcription 훼, splicing 훽, and degradation 훾 rates and the abundance of unspliced, 푢(푡), and spliced, 푠(푡), RNAs to determine the rate at which new spliced RNA is produced [14]. The rate constants (훼, 훽, and 훾)

27 are gene specific and estimated based on the observed RNA expression for that gene among all cells. The formula for determining RNA velocity is outlined below:

푑푢 = 훼 − 훽푢(푡) 푑푡 푑푠 푣(푡) = = 훽푢(푡) − 훾푠(푡) (2.2) 푑푡 where 훼 is the rate at which new unspliced RNA is being made, 훽 is the rate at which unspliced RNA gets spliced and 훾 is the rate at which spliced RNA is degraded [14].

[14, 3] showed that given 훼, 훽, 훾 and the abundance of unspliced and spliced RNA in the cell, we can predict how the level of mature (spliced) mRNA in the cell after some time has passed. The accuracy of this prediction depends highly on the fidelity of the estimated 훼, 훽, 훾 rates for each gene. There are two main approaches to computing the gene specific rate constants: 1) steady-state model that assumes we observe the steady state on and off states and computes a rate constants based onan estimated steady state ratio; and 2) a dynamical model that makes no assumptions about the observed states and instead computes a different rate constant for each transcriptional state.

2.2.2 Steady-state model: captures variations from an ob- served steady-state expression

The steady state model for measuring velocity by [14, 3] assumes that the observed expression profile captures, at least a subset of, cells that are induced and repressed at steady states. At steady state, there is a constant transcriptional state where 푣(푡) = 0. Based on this observation and the above equation, the velocity at steady 훾 state is equivalent to the ratio 훽 . Once we have the steady state ratio, velocity is computed as deviations from steady state. There are two approaches to solving this steady state ratio described below.

28 Deterministic model

In the deterministic model, a linear regression on the observed cells at steady state is 훾 used to approximate the ratio 훽 . These cells at steady state are expected at the lower and upper quantiles in the phase space. For this thesis, we use the [3] implementation of this method maintained through the scVelo package. The method can be called using scvelo.tl.veocity(adata, mode=’deterministic’).

Stochastic model

The stochastic model extends the steady state assumption to treat transcription, splicing and degradation as probabilistic events. This allows the model to account for the fact that 푢(푡) and 푠(푡) are not independent events and their joint distribution

is captured by ⟨푢푡푠푡⟩Therefore, we get three new equations to use to compute the steady state ratio.

푑⟨푢2⟩ 푡 = 훼 + 2훼⟨푢 ⟩ + 훽⟨푢 ⟩ − 2훽⟨푢2⟩ 푑푡 푡 푡 푡 푑⟨푢 푠 ⟩ 푡 푡 = 훼⟨푠 ⟩ + 훽⟨푢2⟩ − 훽⟨푢 푠 ⟩ − 훾⟨푢 푠 ⟩ 푑푡 푡 푡 푡 푡 푡 푡 푑⟨푠2⟩ 푡 = 훽⟨푢 ⟩ + 2훽⟨푢 푠 ⟩ + 훾⟨푠 ⟩ − 2훾⟨푠2⟩ (2.3) 푑푡 푡 푡 푡 푡 푡

The steady state ratio is obtained through a generalized least squares on Equation (2) and (1). We use the [3] implementation of this method maintained through the scVelo package. The method can be called using scvelo.tl.veocity(adata, mode=’stochastic’).

2.2.3 Dynamical model: solves the full gene-wise transcrip- tional dynamics

Contrary to the steady state models, the dynamical model removes the assumption that the cells at steady on and off states are observed among the experimental set.

29 Instead, it adds an additional parameter that represents the transcription state for each cell, and assumes that the rate constants (훼, 훽, and 훾) and initial conditions are

푘 state-dependent. Consequently, we need to keep track of another parameter 푡0 that captures the time point of switching from one state to another. Integrating equation (1) from above gives the following equation:

훼(푘) 푢(푡) = 푢 푒−훽휏 + (1 − 푒−훽휏 ) 0 훽 훼(푘) 훼(푘) − 훽푢 푠(푡) = 푠 푒−훾휏 + (1 − 푒−훾휏 ) + 0 (푒−훾휏 − 푒−훽휏 ), 휏 = 푡 − 푡(푘) (2.4) 0 훾 훾 − 훽 0

(푘) with reaction rates 훼 , 훽, 훾, cell-specific time points 푡 ∈ (푡1, ..., 푡푛), and initial conditions 푢0 = 푢(푡0), 푠0 = 푠(푡0). As described above each cell could be in one of four transcription states: induction, repression, steady on and steady off. An expectation maximisation algorithm is runto infer the transcription state, state dependent rate constants, time point for switching states as well as a cell-specific time points. We used scvelo.tl.recover_dynamics(adata) method in order to compute these when working with the dynamical model. Once the described parameters have been estimated for each gene, we are able to compute RNA velocity directly using equation (1) from above. The functional- ity is implemented in scvelo.tl.velocity(adata, mode=’dynamical’), note that recover_dynamics must first be run to compute the rate estimates. Unlike the steady state model, the dynamical model computes a cell and gene specific time point based on that cell’s location in the phase space for eachgene. This time point captures the cell’s internal clock for each gene. In order to get a single latent time value for each cell that is shared across all genes, [14] normalize to a common overall timescale (across well-fitted genes), and find the median time point across all genes.

30 Chapter 3

Data

3.1 Time-series dataset: mouse BMDCs sequenced at 1 hour intervals following LPS stimulation

In order to test our hypothesis that RNA velocity is able to capture cell dynamics and provide a temporal ordering of cells over time in a response to environmental stimulus, we apply it to a time series dataset from [20]. We focus on an experiment examining mouse bone marrow-derived dendritic cells (BMDCs) at 1, 2, 4 and 6 hours following stimulation with lipopolysaccharide (LPS), an element of gram-negative bacteria, and a control unstimulated group. We perform a series of filtering steps, described in section 3.1.2 to obtain afinal dataset of 1,233 cells and 2,759 genes, where there were 344 cells from the 1hr group, 277 from the 2hr group, 304 from the 4hr group, and 308 from the 6hr group. Next we describe how we obtained, filtered and processed the data in order to get the final dataset. The code used to:

∙ download and prepare (section 3.1.1) the dataset can be found here

∙ combine the individual cell information into one object can be found here

∙ filter and process the dataset (sections 3.1.2 - 3.1.3) is made available here: here

31 3.1.1 Downloading and preparing the dataset

The time course dataset metadata is publicly available through GEO GSE48968. We downloaded the reads for cells (in FASTQ format) that belonged to these stimulation groups: 1h LPS Stimulation, 2h LPS Stimulation, 4h LPS Stimulation, 6h LPS Stim- ulation, Unstimulation (0hr). We did not include cells that were from technical and biological replicate experiments, as well as cells where secretion had been inhibited. For each cell in the dataset, we aligned the reads to the mouse genome (mm10) and counted the occurrence of each gene using the package STAR [8]. Finally, we use velocyto to count the abundance of spliced and unspliced RNA for each gene. The reads are processed separately for each cell and thus we need to combine them into one adata object before the filtering and processing step. We use scanpy (version 1.5.1) and scvelo (version 0.2.2) to read and combine the dataset getting an initial size of 1,916 cells and 27,998 genes.

3.1.2 Filtering the dataset

Our intention with this filtering process is to identify and reduce the impact ofex- perimental bias that may have leaked into the dataset. The first step in doing so is removing cells and genes that do not have enough information i.e. cells that have less than 200 genes expressed and genes that are detected in less than 3 cells. The next step is to ensure that no group of cells is overrepresented, as this could bias our analysis. In order to identify such bias, we plot the distribution of total counts for each group (Figure 3-1). This plot reveals that the unstimulated cells have more reads than all of the other groups. In order to account for this we downsample the counts per cell to 500,000 using the method scanpy.pp.downsample_counts(). Our next step is to check for cell quality. The percentage of mitochondrial gene counts/total gene counts can be used as a proxy for cell quality, as profiles that have a high proportion of mitochondrial genes are indicative of poor-quality [12, 11]. This is likely due to presumed loss of RNA in the cytoplasm from perforated cells, whereas mitochondrial genes are bigger in size and thus less able to pass through small holes in

32 Figure 3-1: Distribution of total counts for each group of cells in time series dataset The plot reveals that the unstimulated group has more reads than the others. We subsample the reads per cell to 500,000 to account for this. the cell membrane. Based on the plot in Figure 3-2 left, we chose an upper bound of 0.1 for the percentage of mitochondrial genes to other genes in the cell and removed all cells that had a higher percentage than that. We also removed cells that had more than 7,000 expressed genes, which are more likely to be cell doublets (Figure 3-2 right). This leaves us with 1,683 cells. Next, we remove all genes where either the spliced or unspliced RNA count are 0, or the shared spliced + unspliced read is not greater than 10 using the scvelo.pp.filter_genes() method. We normalize the reads to 10000 per cell, so that it is comparable among cells and store the logarithmic of the reads so it resembles a normal distribution. Lastly, we filter the genes to the highly variable genes across all the cells, asthese are likely the genes that are changing in response to LPS.

3.1.3 Processing the dataset

We correct for biases to the expression data caused by total counts and percentage of mitochondrial genes using the method scanpy.pp.regress_out(). Next, we convert the expression values to z-scores and clip values higher than 10 for each gene, i.e.

33 Figure 3-2: Quality check for cells in time series dataset. Scatter plots (left) the number of counts against percent mitochondrial genes; (right) the number of counts against the number of genes in each cell. We remove all cells that have greater than 0.1% mitochondrial genes and greater than 7,000 genes.

each gene has a mean of 0 and a standard deviation of 1 across all cells. To further de-noise our data, we reduce its dimensionality using principal component analysis. We compute and store the top 50 principal components, using scanpy.tl.pca(). We use this reduced representation to compute the 10 nearest neighbours for each cell and cluster cells into 20 unique clusters, using Louvain clustering [5, 15]. Figure 3-3a shows these clusters and the distribution of cells in each group. Figure 3-3b depicts the mean expression of the LPS genes groups identified by [20].

Lastly, we remove the cells that have high-levels of cluster disruptive genes, Cd83 and SerpinB6b (which cause pathogen-independent maturation among cells) [20]. Cd83 and SerpinB6b are positive markers of cluster disruption which is a known artifact of the culture process. We remove cluster of cells that have high expression of these genes (Figure 3-4). We also filter out the unstimulated (0hr) cells at this step, as they represent a different cell state than the cells that are dynamically responding to LPS stimulation.

This results in our final dataset of 1,233 cells and 2,759 genes.

34 Figure 3-3: Overview of cells in time series dataset. a) Low dimensional representation in UMAP space representing the 20 clusters identified in the dataset, the 0hr, 1hr, 2hr, 4hr, and 6hr cells. b) Average expression of LPS gene groups in the dataset across cells.

3.2 Perturb-Seq: mouse BMDCs with 24 perturba- tions at 0 and 3 hours after LPS stimulation

For this thesis, we use the Perturb-Seq dataset from [7] for our analysis, focusing on an experiment studying mouse BMDCs at 0 and 3 hours after stimulation with LPS, in the presence of 24 gene knockouts, achieved via 57 guides, including 1 negative control guide. In future work, we plan to expand across more datasets.

We performed a series of filtering steps described in the following section, ulti-

35 Figure 3-4: Cluster disruption. Expression of cluster disruptive genes (Cd83 and SerpinB6b) across the cells represented in the UMAP space. mately obtaining a filtered dataset containing 1,877 cells (from the 3hr group) and 2,088 genes. Below we describe how we got the dataset, the filtering and processing steps and the clusters observed in the resulting dataset. The code used to:

∙ prepare the dataset (section 3.2.1) is made available here

∙ filter and process the datest (sections 3.2.2 - 3.2.4) is made available here

3.2.1 Downloading and preparing the dataset

We downloaded the metadata for the dataset from GEO GSE90063 and obtained the aligned reads from the SRA Run Selector SRX2360554 (for 0hr) and SRX2360553 (for 3hr) in bam format. We used the bamtofastq package to convert the aligned reads to fastq files, which served as input to our analysis. The data processing consisted of two parts. First, we aligned the reads using cellranger count (version 3.0.2) to the mouse genome (mm10) and counted the number of UMIs per gene for each cell. Second, we quantified the number of spliced and unspliced UMIs in each cell using the velocyto package. Finally we load the datasets into Jupyter notebooks for further analysis. The initial dataset consists of 88,890 cells and 27,998 genes.

3.2.2 Filtering the dataset

We filtered out cells that have fewer than 200 genes and genes that are not expressed in at least 3 cells. We also filter out cells that have more than 2,000 genes, to remove

36 doublets (Figure 3-5 right). We used different parameters given the different scRNA- Seq technology.

Next, we examined the percentage of mitochondrial gene counts/total gene counts to serve as a proxy for cell quality, as described in Section 3.1.2 (Figure 3-5 left). We chose an upper bound of 0.1 for the percentage of mitochondrial genes to other genes in the cell.

Figure 3-5: Quality check on Perturb-Seq dataset. a) Scatter plots (left) the number of counts against percent mitochondrial genes; (right) the number of counts against the number of genes in each cell. We remove all cells that have greater than 0.1% mitochondrial genes and greater than 2000 genes.

Next we normalize counts such that the total counts in every cell sum to 10,000, to compare expression across cells in our dataset. We also compute raw expression as log(counts+1) towards downweighting the contribution of the highly expressed genes in the dataset and making the expression values closer to a normal distribution. This resulted in 73,771 cells and 8,780 genes in the dataset.

To focus on genes that likely changed in response to the perturbation and LPS stimulation, we selected highly variable genes with scanpy.pp.highly_variable_genes. We compute the highly variable genes for each time group separately, as well as to- gether and take the variable genes to be the union of the two groups. These steps lead to a processed dataset with 73,771 cells and 2,088 genes.

37 3.2.3 Processing the Dataset

We regress out the effect of total counts and percentage of mitochondrial genes from the expression data using the scanpy.pp.regress_out() function. Next, we aim to regress out the batch effect in expression. For this, we aimto account for batch-specific effects without removing biological variations in 0hrand 3hr. In order to achieve this, we fit a linear regression to get the impact of the batches and time point on expression. Consider a simple case where we have 4 batches, two for each time point, i.e. batches 1 and 2 contain the cells from 0hr and batches 3 and 4 contain the cells from 3hr. The observed gene expression is affected by both biological changes and experimental conditions, such as minor changes in handling conditions between the batch. Our goal is to correct for these batch effects in the expression, however in doing so we can mute the expression variation caused by time. That is, the expected gene expression for the 3hr cells is inherently different from the 0hr cells. We want to ensure that this variation is not lost in the batch correction. To account for this, we fit a linear regression to identify the contribution of each of these variables and subtract out the effect on expression that is explained by the batch while keeping the contribution from time.

Figure 3-6: Linear regression model for batch correction. The model mea- sures the effect of the covariates on the observed expression profile. We subtract out the covariates related to batch and keep the error and hour effect. In this case, our resulting expression would be 퐵1 · (0ℎ푟) + 퐵2 · (3ℎ푟) + 푒푟푟표푟 .

We convert the batch-corrected expression values to z-scores for each gene sepa- rately, such that every gene has a mean of 0 and standard deviation of 1 across cells,

38 and clip values higher than 10. Then we compute and store the top 50 principal com- ponents to reduce the dimensionality of the data. We work with this reduced dataset to compute the 10 nearest neighbors for each cell. We then use Louvain clustering to group similar cells into clusters, obtaining 15 clusters. Figure 3-7a shows these lou- vain clusters and the distribution of 0hr and 3hr cells in the dataset, whereas Figure 3-7b displays the mean expression across the LPS gene groups [20] in our dataset.

Figure 3-7: Overview of cells in Perturb-Seq dataset. Low dimensional rep- resentation in UMAP space representing the 15 clusters identified in the dataset, the 0hr cells and 3hr cells. b) Average expression of LPS gene groups in the dataset across cells.

3.2.4 Post-Processing

Further investigation of the dataset reveals that batch D9 has fewer reads (Figure 3-8) than the rest of the batches, so we discarded it in our analysis. Figure 3-8 also

39 shows that we have fewer reads among the 0hr cells, as compared the 3hr cells. We intend to account for this in future iterations of this work by downsampling the reads among the 3hr to be the same as the 0hr cells.

Figure 3-8: Distribution of the total counts per batch in the Perturb-Seq dataset. Batch D9 has significantly less counts than the other and is thus removed from further processing.

Filtering out sources of variation unrelated to time

To focus on a homogeneous cell type, we clustered all cells in the dataset using Louvain clustering, resulting in 15 clusters. We noticed that clusters 10 and 11 likely represented cluster-disrupted DCs, which are marked by Cd83 and SerpinB6b [20] . Second, clusters 5 and 9 represent cycling cells (Figure 3-9). This led us to filter out clusters 5, 9, 10, and 11 that were identified by the above markers and which we believed to vary along an axis distinct from time. This left us with 54,847 cells. To focus on a specific subset of related perturbations, we set out to identify those perturbations that had a significant effect on gene expression and which showed con- cordant effects across different guides targeting the same gene. One major assumption of RNA velocity is that all cells can be ordered along a time axis and that there are no subsets of cells that would not be allowed to transition between each other. However,

40 Figure 3-9: Distribution of cell cycle and cluster disruptive in Perturb- Seq dataset. (left to right) Cell clusters in the dataset, we remove clusters that are related to marker genes shown. Expression levels of genes related to cell cycle, Cd83 and Serpinb6b marker genes used to identify cells that are not dendritic cells. in the case of Perturb-Seq, different perturbations can induce different trajectories, and we should not allow a cell with perturbation x to transition to a cell with pertur- bation y. However, although such inter-perturbation transitions cannot occur, they are likely to be wrongly identified in our dataset, for two reasons (1) perturbations may have small effects and thus make cells with different perturbations look similar to each other and (2) two perturbations in the same pathway, can have very similar effects . To address the need to select against inter-perturbation transitions, weneed to ensure that the cells all include guides that either target the same gene, or target different genes with the same outcome when knocked out. We chose the following guides for our analysis, targeting the two genes Stat1 and Stat2: Stat1_1, Stat1_3, Stat2_4, Stat2_2, Stat2_3. This particular group of guides was chosen because they depict 1) a strong positive correlation with each other and 2) and a low correlation with the negative control guide (Figure 3-10). In accordance with this, we also remove cells that received multiple guides. Lastly, we remove the unstimulated (0hr) cells at this step, as they represent a different cell state than the cells that are dynamically responding to LPS stimulation (3hr). In the end, the dataset we used consisted of 1,877 cells and 2,088 genes.

41 Figure 3-10: A heat map of the correlation matrix of guides from Perturb- Seq. The matrix shows that the chosen group of guides (in the green box) are strongly correlated with each other.

42 Chapter 4

Methods

To characterize the benefit of using RNA velocity for temporal gene regulatory infer- ence in Perturb-Seq, we first get a pseudotime ordering of the cells with or without velocity information, and then compare these two classes of approaches in terms of their inferred temporally resolved regulatory matrices. Below, we describe each of these steps in further detail.

4.1 Characterising the various pseudotime ordering methods

There are two general approaches we use for ordering cells. The first is based on infer- ring a “latent time” from RNA velocity alone, and the second computes a pseudotime estimate for each cell based on a random-walk framework that is described below. The random-walk frameworks depend on a cell-cell transition matrix that outlines the probability that one cell will transition into another cell in the dataset after some time. The various approaches used are outlined below:

∙ Dynamical pseudtime: gene velocities are computed using the dynamical model from scVelo [3]. These velocities are used to compute a transition matrix.

∙ Stochastic Pseudotime: gene velocities are computed using the stochastic model from scVelo [3]. These velocities are used to compute a transition matrix.

43 ∙ Deterministic Pseudotime: gene velocities are computed using the deter- ministic model from scVelo [3]. These velocities are used to compute a transition matrix.

∙ Similarity Pseudotime: we measure the similarity between each pair of cells and compute the transition probabilities based on how similar the cells are to each other.

∙ DPT Pseudotime: this is the benchmark similarity based pseudotime method developed by scanpy [9, 21]. We call scanpy.tl.dpt() in order to get the results from this ordering.

∙ Latent time: Computes a latent time estimate for each cell based on the estimates of each cell’s position along a latent time of gene expression dynamics composed of: steady state low expression, expression induction, steady state high expression and expression repression [3]. This is only available for the dynamical velocity model

4.2 Computing a pseudo-temporal ordering of cells

Given the expression matrix and velocity for each cell, we calculate the probability that one cell will transition to another after some time has passed. This transition probability is calculated for each cell-cell pair in the dataset. Then, we infer an order- ing of cells using a Markov process based on these transition probabilities, computing for each cell the probability of being reached at steady state. We then use these estimates to order cells by pseudotime. We begin at each cell with equal probability and follow a random-walk to identify the end points, i.e. the cells that are later in the pseudotime ordering. Once we have identified the end points, we can traverse the random-walk starting at the endpoints in order to find the root cells. Therefore, the pseudotime for each cell is a measure of how far the cell is to the root and end cells in the random-walk.

44 Note that we assume that this Markov process and associated transition matrix specifically capture temporal progression. However, the transition matrix could also be guided by other variation that is independent from time (see chapter 4 for a comparison between velocity and a time course). This is an assumption that should be further investigated and tested experimentally.

4.3 Computing the transition matrix

Cell-cell transition probabilities are captured in the transition matrix. We obtain the transition matrix by first computing a connectivity graph (either based on RNAve- locity, resulting in a velocity graph or based on cell-cell similarity), and then applying an exponential kernel to obtain a valid transition matrix.

1 휃 · 푒푥푝( 푖푗 ) (4.1) ∑︀푗 휃푖푗 휎2 1 푒푥푝( 2 ) 푖 휎푖

Where 휃푖푗 measures concordance between a velocity and similarity (when using velocity) and similarity between cells (when not using velocity); 휎푖 is the kernel width that is adjusted for each cell across neighboring cells. For all of our approaches, we further subset the transition matrix to only contain the top 200 nearest neighbors, and set the rest of the entries to 0.

4.3.1 For velocity informed models

We obtain for each cell xi a velocity vector vi(t). There are three ways of computing RNA velocity (chapter 2), and here we compare the outputs from all three velocity models. Based on insights from [3] we expected the dynamical and stochastic model to perform better than the deterministic model as they are better able to dynamical information from the cells.

Next we compute the velocity graph that captures how likely one cell, 푥푖, transi- tions into another cell, 푥푗, for each pair of cells. The transition, 푃 푟(푥푖, 푥푗), is defined

45 ′ as the cosine similarity between the expected cell state 푥푖 (cell 푥푖 plus the velocity ′ vector at cell 푥푖) and 푥푗. This is proportional to the angle, 휃 between 푥푖 and 푥푗 in Figure 4-1, with smaller angles between velocity and cell-cell directionality indicating more likely transitions.

Figure 4-1: Computing the velocity graph, which forms the basis of the transition matrix. Depicted are cells 푥푖, 푥푗. We compute a measure of the con- cordance between the velocity and 푥푖 and the difference between 푥푖 and 푥푗. If the velocity at xi points to a transition towards cell 푥푗, then 푥푗 − 푥푖 and 푣푖(푡) will have similar directions. We compute this concordance via a cosine similarity.

′ More specifically, we get the expected expression vector 푥푖 by considering how cell 푥푖 will change after 푡 time has passed, guided by RNA velocity. Specifically, the ′ expression after time 푡, 푥푖 = 푥푖 + ... (Figure 4-2). This will represent the expected gene expression for each cell and each gene at time 푡.

Note that the resulting transition matrix is not symmetric.

Figure 4-2: Estimating the expected gene expression after some time. A linear combination of the observed gene expression matrix combined with the cell velocities gives us the expected gene expression state at time t.

46 4.3.2 For cell-cell similarity based methods

To estimate the transition matrix using cell-cell similarity, we calculate the cosine similarity between each pair of cells. In order to enhance the impact of highly variable genes, we compute this similarity with the first 50 principal components. Tobe consistent with the other approaches we post-process the transition matrix in a similar way, and thus only allow transitioning to the nearest n neighbours, and set the rest to 0. Once we have computed the transition matrix we used the same methods in scVelo to compute the root and end cells, as well as the pseudotime arrangement of cells.

4.4 Software Packages

For this work we worked with: scVelo 0.2.2, scanpy 1.5.1, anndata 0.7, umap 0.4.4, numpy 1.19.0, scipy 1.5.0, pandas 1.0.5, scikit-learn 0.23.1, statsmodels 0.11.1, python-igraph 0.8.2, louvain 0.7.0, and python 3.8.3.

47 48 Chapter 5

Analysing the time-series dataset

5.1 Goal

In this chapter, we aim to test the accuracy and reliability of the various RNA velocity models in providing a temporal ordering of cells in a time course dataset after addition of an external stimulus. The time series dataset from [20] contains cells that were sequenced in roughly 1 hour intervals following LPS stimulation thereby providing a realistic ground truth reference to validate the pseudotime ordering we get from the various models.

In order to achieve this goal, we sort the pseudotime ordered cells into four groups and compare the cells in each group to the corresponding ground truth group of cells. We do this for each velocity and similarity based ordering of cells to understand the additive effect of velocity in discerning a temporal ordering.

We find that the velocity based models are better able to capture the underlying temporal trend and order cells in accordance with the time following LPS stimulation as compared to the similarity based models. The pseudotime estimates using dynam- ical model for computing velocity performs best across the three velocity models. We also find that the genes correlated with pseudotime are aligned with the core antiviral (Id) and sustained infammatory (IIId) LPS gene groups from [20].

49 5.2 Data and code

We use the filtered and processed time series dataset from [20] as described inSection 3.1. The code reproducing the work can be found:

∙ Evaluation of distance between neighbours (section 5.3.1)

∙ Pseudotime computation based on the various models (section 5.3)

∙ Comparison and analysis of the results (section 5.3.2-5.3.3)

5.3 Results and Analysis

5.3.1 Picking the number of neighbours for the transition ma- trix

A subset of the methods compared here use random walks to assign cells a positioning across pseudotime. The key idea is to use either cell-cell similarity or RNA velocity to direct random walks along a graph of cells, such that a given cell is more likely to transition to cells similar to it. The steady state distribution of the random walk across cells then defines so-called end-points, with a reversed random walk procedure starting at these endpoints being used to identify starting point cells, or root cells. In our analysis, we checked whether subsetting the transition matrix used for the random walk to only the closest n neighbors for each cell would result in different qualitative pseudotime estimates, and we found this to be the case. Specifically, low numbers of neighbors result in disconnected pseudotime estimates for each cluster of cells in the UMAP representation, whereas larger numbers of neighbors leads to a reversed pseudotime ordering (the cells with low pseudotime values become the 4/6hr cells). To determine the number of neighbors to use, we considered the distances between each cell and their nth closest neighbour to identify if there exists a sharp change between a cell and its 푛푡ℎ vs 푛 + 1푡ℎ neighbour (Figure 5-1a). We found that for most

50 cells there is a sharp increase in the distance between the 1st neighbour and the 5th neighbour followed by a plateau, suggesting that considering the closest 5 neighbors may contain most of the information about cell-cell similarity.

We compare the effect on the resulting pseudotime ordering using 5, 30, 100,200, 500, and all (no optimization on the allowed transitions) neighbours (Figure 5-1b). We performed this analysis only for the dynamical model for estimating velocity as it had the most promising ordering as we will see in the next section. Based on these representations, we decided to proceed with 200 neighbors when comparing the results of these models against the ground truth. Future work is needed to determine an orthogonal way to select the number of neighbors to use in this type of analysis.

Figure 5-1: Comparing the effect of restricting cell-cell transitions to the nearest 푛 neighbors. a) Euclidean distances across top 50 PCs between each cell and its nth closest neighbour: (left) shows the plot containing all of the cells (right) focuses on the nearest 50 neighbours to. b) Summarizes the effect on pseudotime estimates when varying the transitions allowed for each cell to its n nearest neighbours.

51 5.3.2 Comparing the pseudotime estimates with the ground truth

In this chapter, we benchmark the extent to which pseudotime methods and RNA velocity are able to recover known ground truth temporal ordering of cells collected in a time-course. Specifically, we consider cells sequenced following LPS stimulation at timepoints 1h, 2h, 4h and 6h [20]. Based on this hourly information, [20] have identified groups of genes that co-vary in response to LPS stimulation, which were grouped into 9 gene groups, including ones capturing core antiviral response (Id), ma- turity (IIIb), peaked (IIIc) and sustained inflammatory responses (IIId). Figure 5-2a highlights the mean expression for these groups of genes as well as the distribution of cells across the 4 time groups. For example, during the response to LPS there is increased expression of the core antiviral set of genes and the sustained inflamma- tory sets , while the peaked inflammatory genes first increase and then decrease in expression. Finally, the maturity genes are at their highest towards the end of the time course, In general, the most changes in expression occur by the 4h time point, with small differences between 4h and 6h.

Given a set of cells with temporal ground truth labels, we proceeded to ask which of the models for estimating pseudotime is best at capturing the known progression of cells across time. We observed a strong agreement between the dynamical, stochastic and deterministic models in the pseudotime ordering (Figure 5-2b). This alignment is largely concordant with the ground truth timestamp of cells.

The dynamical model for computing velocity also computes the latent time (rep- resenting the cell’s internal “clock”) based on underlying cellular processes. When applied to differentiating datasets such as Pancreatic endocrinogenesis [3] show that latent time is the most promising measure for approximating the real time experienced by cells. However, the latent time estimates for the time series dataset considered here revealed less separable pseudotime values between distinct timepoints, compared to velocity pseudotime estimates (Figure 5-2b). Latent time still captures the overall trend of the fact that cells in the 1/2hr groups are earlier in their LPS response as

52 compared to cells in the 4/6hr group. The two similarity based pseudotime approaches have varying outcomes. The tra- jectory inference technique described in section 2.2 is able to capture the overall cell dynamics to a similar degree as the models informed by RNA velocity. However, it is unclear whether these models are driven by the temporal changes in gene expres- sion, or whether they are highlighting changes in densities of cells across timepoints. This is because there is no fixed start and end point provided, but rather these are estimated from the steady state distribution of the random walks biased by cell-cell similarity. In this time series dataset, the 4hr and 6hr cells group together creating a densely populated cluster which could be the driver for the pseudotime estimates for the similarity-based methods. To check the effect of density on the performance of similarity-based methods, In the future, we plan to subsample the cells in the 4/6hr group to control for variation in density across the groups. The similarity based model described in Section 4.2 performs worse than the velocity informed models confirming our initial hypothesis that velocity provides additional information towards discerning the temporal ordering of cells. Overall, we find that the velocity based methods are better able to identify a temporal ordering of cells that is well aligned with the ground truth. The pseudotime estimates informed by the dynamical model for computing velocity have the best concordance with the underlying ground truth.

5.3.3 Exploring the genes that are strongly correlated with pseudotime estimates for each method

Given the pseudotime estimates, we asked which genes are driving the ordering of cells by identifying the genes that are most highly correlated with these estimates (Figure 5-3a). We found that there are parallels between the LPS gene groups identified by [20] and the genes that are strongly correlated with pseudotime (especially in the velocity informed models). We verify this notion by studying the correlation among the pseudotime estimate and the aggregated score from each LPS gene group

53 across single cells (Figure 5-3b). We find that genes from the core antiviral (Id) and sustained inflammatory (IIId) groups are positively correlated with the velocity based pseudotime methods whereas IIa group genes are negatively correlated. Using more sophisticated approaches to detect genes without monotonic patterns and their relation to the pseudotime estimates are future work.

54 Figure 5-2: Comparison of the various pseudo-temporal orderings against ground truth. a) Low dimensional representation of cells in UMAP space show- ing: ground truths in the data (top) 1hr, 2hr, 4hr and 6hr cells (bottom) based on expression scores for LPS gene groups. b) Pseudotime estimates based on each of the velocity/similarity based models. c) Swarm plots depicting the distribution of pseudo-time estimates grouped by ground truth. d) Velocity confidence measures for each cell based on concordance of velocity with neighbours.

55 Figure 5-3: Exploring genes that are correlated with pseudotime for the time series dataset. a) Clustermap of the genes that are highly correlated with at least one of the pseudotime estimates. We annotate each gene with the LPS gene group it belongs to (if any) from [20]. b)Heatmap of the correlation between pseudotime estimates and gene expression grouped by the LPS gene groups from [20]

56 Chapter 6

Analysing the Perturb-Seq dataset

6.1 Goal

In this section we study the impact of RNA velocity in determining a temporal or- dering of cells, hypothesizing that such an ordering would allow us to distinguish between genes that are early direct effects of a perturbation versus indirect effects that occur later. Such ordering is typically done with trajectory inference methods that compute a so-called pseudotime based on cell-cell similarity, without dynamic information such as RNA velocity. Here, we ask how much additional information the RNA velocity vector contributes to the pseudotime ordering of cells as compared to cell-cell similarity. To achieve this goal, we compare the pseudotime ordering of cells obtained using RNA velocity on the one hand and cell-cell similarity on the other hand. Our second goal is to identify whether a pseudotime ordering based on RNA ve- locity enables us to identify temporal responses to a perturbation. Previous analysis of Perturb-Seq studied the overall consequence of the perturbation on each gene, how- ever we aim to investigate whether RNA velocity can be useful in discerning temporal responses. To investigate our hypothesis, we first partition the cells into three equal sized bins (“early”, “mid”, “late”) based on the pseudotime order we computed using RNA velocity. Then we fit a linear regression model to each group of cells sepa- rately to identify the genes that are induced and repressed within each pseudotime

57 bin. Finally, we compare the two approaches to identify common and distinct gene regulatory inferences between them. We find that RNA velocity helps capture subtle changes in expression enabling us to provide an arrangement of cells within the same time group (3h). However, the movement direction captured by these models is the reverse of what we expect, i.e. the directionality goes from mature dendritic cells towards those expressing peaked inflammatory genes.

6.2 Data and Code

We work with the Perturb-Seq dendritic cell dataset from [7] that has been filtered and processed as described in Chapter 3.2. For this analysis, we strive to eliminate known sources of variation beyond time, such as cell type, cell cycle and perturbation. Therefore, we restrict the analysis to a subset of the dataset and filter cells to remove known sources of heterogeneity unrelated to time (Section 3.2.4). We plan in future work to extend our analysis beyond this subset of cells and perturbations to a more generalized framework. The code reproducing the:

∙ Finding the number of neighbours to use (Section 6.3.1)

∙ Pseudotime computation based on the various models (section 6.3)

∙ Comparison and analysis of the results (Section 6.3.2-6.3.4).

6.3 Results and Analysis

6.3.1 Picking the number of neighbours for the transition ma- trix

As before, we verified the degree to which subsetting the transition matrices for the random walks to a subset of nearest neighbors affected results. We plotted the

58 distance between each cell and its neighbours in order to determine an intuitive cutoff for the number of neighbours a cell is allowed to transition to. Based on the graphs in Figure 6-1, the distances between the 30th and 50th neighbour increase sharply, suggesting it as a cutoff. However, we found most biologically consistent results using a higher number of neighbors (200), and this is what we used in our analysis.

Figure 6-1: Euclidean distance among the top 50 PCs between a cell and its nth nearest neighbor. (left) shows the distance between all possible neighbours (right) Focuses on the 400 nearest neighbours.

6.3.2 Comparing the pseudotime estimates among each of the velocity models

In this section, we examine the difference in the pseudotime estimates for each cellin our dataset based on various velocity models and similarity based approaches. Our goal is to analyse the impact of velocity in determining a temporal ordering of cells compared to similarity based methods. We evaluate the expression of gene groups (namely the peaked inflammatory and maturity genes) that have been shown tobe related to the LPS response of dendritic cells [20] ((Figure 6-2a). We find that the peaked inflammatory genes are highly expressed among the left subset of 3hrcells, whereas the maturity genes are expressed in the right subset of 3hr cells as represented with UMAP. Based on these expression patterns, we expect the temporal arrangement

59 of cells to be left to right. We compute the pseudotime estimate for each model and depict the distribution of these estimates across the low dimensional representation of these cells (Figure 6-2b). We find that all three velocity models suggest a direction of movement (right toleft) that is the reverse of what we expect based on the expression of peaked inflammatory and maturity gene groups. This is inconsistent with what we find when applying RNA velocity to determine a pseudotime ordering in the time series dataset. There are two notable differences among the two datasets that could have lead to the inconsistent results: (1) the time series dataset has 100x more reads per cell than the Perturb-Seq dataset (2) Perturb-Seq contains perturbations in addition to LPS stimulations that could confound the velocity models. In future work, we aim to test if subsampling the reads in the time series dataset and/or working exclusively with unperturbed cells from Perturb-Seq can explain the reversal of direction. Close examination of the pseudotime estimates for the three velocity models re- veals that the dynamical model is able to identify three distinct groups of cells that would be closely aligned to the expression of peaked inflammatory and maturity genes (Figure 6-2b). The stochastic and deterministic pseudotime estimates are less clear at distinguishing cells with high expression of peaked inflammatory genes. We conclude that the dynamical model is better able to capture the cell dynamics however, as noted above, we plan to do more analysis to understand the direction of movement. The similarity based pseudotime estimates, trajectory inference and similarity, capture fewer (two) distinct temporally ordered groups of cells as compared to the velocity based models (Figure 6-2b). However, we find that trajectory inference is able to capture the expected direction of movement (left to right). This is likely due to the fact that trajectory inference requires a single user defined root cell and projects the trajectory in reference to the root. For our analysis we choose one of cells in cluster 3 (high expression of peaked inflammatory genes) to be the root cell. The latent time model provides the least distinctly ordered group of cells as com- pared to the others. However, the latent time model is still under development [3]. While latent time was computed in the earlier versions in an unsupervised way, in

60 more recent versions, it includes the root cells identified by the dynamical model in its estimation methods. Thus, we are careful in making concrete conclusions with regards to this model.

We can calculate the model’s confidence in the velocity estimates for a particu- lar cell based on how the velocity vector in a cell correlates to the velocities of its neighbouring cells. We use scVelo.tl.velocity_confidence() function to com- pute these confidence values for each of the velocity models. The dynamical model has the most cells that have velocities of high confidence, with the exception of the matured cells (bottom right cluster). The stochastic model has similarly high velocity confidence values suggesting that these two models produce more consistent estimates for similar cells than the deterministic model.

6.3.3 Exploring the genes that are strongly correlated with pseudotime

Given a pseudotime based arrangement of cells for each of the models, our next goal is to understand the genes that are driving these pseudotime estimates. We compute the spearman correlation coefficient between each of the pseudotime estimates and the gene expressions. In Figure 6-3a, we subset to the genes that are correlated with at least one of the pseudotime estimates (determined by a threshold of ±0.4). This reveals that all the pseudotime estimates, except for trajectory inference, are positively correlated with expression of genes from the core antiviral (Id), peaked inflammatory (IIIc) and sustained inflammatory (IIId) gene groups, as established by [20], and negatively correlated with maturity (IIIb) genes. We also observe that there are very few genes that are strongly correlated with pseudotime (≤ 20 genes with ±0.6) making these models very sensitive to small changes in how we subset the dataset. Further work will test the robustness of conclusions derived from RNA velocity for Perturb-seq.

61 6.3.4 Compare the Beta’s that we get by sorting the cells into pseudotime ordered groups to the Beta’s we get from previous analysis of Perturb-Seq

Previous analysis of Perturb-Seq identified the overall consequence of a perturbation on each gene by studying changes in expression among the perturbed and control group. However, such an approach does not identify temporal effects in response to the perturbation/stimulation. For instance in the time series dataset, we find that there is a distinct correlation between the expression of LPS gene groups, such as core antiviral (Id) and sustained inflammatory (IIId), and hour since LPS stimulation. In this section we compare the results of prior analysis of Perturb-Seq against one informed by RNA velocity.

To identify the temporal effect of a perturbation on each gene, we begin by binning the cells into three equal sized groups (“early”, “mid”, “late”) based on their pseudotime estimate. We use the pseudotime estimate from the dynamical model for our binning as we show it is the most consistent with our expectation (Section 6.4.2). Once we have binned the cells into the pseudotime ordered groups we fit a linear regression to each group in order to identify the effect of the perturbation/stimulation on each gene in each pseudo-temporal bin. This allows us to measure the response of a gene at three distinct timepoints and compare that to the overall effect (with all cells) on the gene as per previously established analysis of Perturb-Seq.

We find that for most genes there is a distinct increase (or decrease) in theexpres- sion of the gene in response to the perturbation/stimulation that is seen in one time bin but the effect is muted when we consider all of the cells. This confirms ourinitial hypothesis that RNA velocity can help to identify and characterise such temporal effects in response to a stimulus. In future work, we aim to further investigate these genes that display temporal response patterns.

62 6.4 Conclusion and Future work

In conclusion, RNA velocity is able to provide additional information that cannot be captured by similarity based techniques alone. However, both (velocity informed and cell-cell similarity) approaches to estimating pseudotime are driven by relatively few genes (≤20 genes with a spearman correlation better than ±0.6). Therefore these models are very sensitive to minor changes in the gene or cell subset leading us to be cautious in their interpretations. In following iterations, we hope to:

1. Explore why the pseudotime estimates (for both velocity and cell-cell similarity based approaches) are reversed.

2. Further investigate the genes that are driving velocity and pseudotime.

3. Extend the study to include other guide groups from [7].

4. Examine the latent time estimates and why they deviate from our expectation.

63 Figure 6-2: Comparison of the various pseudo-temporal orderings against expectation based on LPS gene groups a) Low dimensional representation of cells in UMAP space showing: (left) the clusters found in the data (middle, right) ground truths in the data based on expression scores for LPS gene groups. b) Pseudotime estimates based on each of the velocity/similarity based models. c) Velocity confidence measures for each cell based on concordance of velocity with neighbours.

64 Figure 6-3: Exploring genes that are correlated with pseudotime for Perturb-Seq. a) Clustermap of the genes that are highly correlated with at least one of the pseudotime estimates. We annotate each gene with the LPS gene group it belongs to (if any). b) Heatmap of the correlation between pseudotime estimates and gene expression grouped by the LPS gene groups from [20]

65 Figure 6-4: Evaluating effect of a perturbation on each gene in each pseudo-temporal bin. (left) Clustermap representing the effect of the perturbation on each gene, where the cells are sorted into 3 groups based on the pseudotime values from the Dynamical model. (right) heatmap of the overall effect of the perturbation on each gene akin to prior analysis methods of Perturb-Seq. (a-e) show the graphs for each of the five guides we use for our analysis.

66 Bibliography

[1] Britt Adamson, Thomas M Norman, Marco Jost, Min Y Cho, James K Nuñez, Yuwen Chen, Jacqueline E Villalta, Luke A Gilbert, Max A Horlbeck, Marco Y Hein, et al. A multiplexed single-cell crispr screening platform enables systematic dissection of the unfolded protein response. Cell, 167(7):1867–1882, 2016.

[2] Aimée Bastidas-Ponce, Sophie Tritschler, Leander Dony, Katharina Scheibner, Marta Tarquis-Medina, Ciro Salinno, Silvia Schirge, Ingo Burtscher, Anika Böttcher, Fabian J Theis, et al. Comprehensive single cell mrna profiling re- veals a detailed roadmap for pancreatic endocrinogenesis. Development, 146(12), 2019.

[3] Volker Bergen, Marius Lange, Stefan Peidli, F Alexander Wolf, and Fabian J Theis. Generalizing RNA velocity to transient cell states through dynamical modeling. Nat. Biotechnol., August 2020.

[4] Alice H Berger, Angela N Brooks, Xiaoyun Wu, Yashaswi Shrestha, Can- dace Chouinard, Federica Piccioni, Mukta Bagul, Atanas Kamburov, Marcin Imielinski, Larson Hogstrom, Cong Zhu, Xiaoping Yang, Sasha Pantel, Ryo Sakai, Jacqueline Watson, Nathan Kaplan, Joshua D Campbell, Shantanu Singh, David E Root, Rajiv Narayan, Ted Natoli, David L Lahr, Itay Tirosh, Pablo Tamayo, Gad Getz, Bang Wong, John Doench, Aravind Subramanian, Todd R Golub, Matthew Meyerson, and Jesse S Boehm. High-throughput phenotyping of lung cancer somatic mutations. Cancer Cell, 32(6):884, December 2017.

[5] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks. Journal of statis- tical mechanics: theory and experiment, 2008(10):P10008, 2008.

[6] Le Cong, F Ann Ran, David Cox, Shuailiang Lin, Robert Barretto, Naomi Habib, Patrick D Hsu, Xuebing Wu, Wenyan Jiang, Luciano A Marraffini, and Feng Zhang. Multiplex genome engineering using CRISPR/Cas systems. Science, 339(6121):819–823, February 2013.

[7] Atray Dixit, Oren Parnas, Biyu Li, Jenny Chen, Charles P Fulco, Livnat Jerby- Arnon, Nemanja D Marjanovic, Danielle Dionne, Tyler Burks, Raktima Ray- chowdhury, Britt Adamson, Thomas M Norman, Eric S Lander, Jonathan S Weissman, Nir Friedman, and Aviv Regev. Perturb-Seq: Dissecting molecular

67 circuits with scalable Single-Cell RNA profiling of pooled genetic screens. Cell, 167(7):1853–1866.e17, December 2016.

[8] Alexander Dobin, Carrie A Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R Gingeras. STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1):15–21, January 2013.

[9] Laleh Haghverdi, Maren Büttner, F Alexander Wolf, Florian Buettner, and Fabian J Theis. Diffusion pseudotime robustly reconstructs lineage branching. Nature methods, 13(10):845, 2016.

[10] Hannah Hochgerner, Amit Zeisel, Peter Lönnerberg, and Sten Linnarsson. Con- served properties of dentate gyrus neurogenesis across postnatal development revealed by single-cell rna sequencing. Nature neuroscience, 21(2):290–299, 2018.

[11] Tomislav Ilicic, Jong Kyoung Kim, Aleksandra A Kolodziejczyk, Frederik Otzen Bagger, Davis James McCarthy, John C Marioni, and Sarah A Teichmann. Clas- sification of low quality cells from single-cell rna-seq data. Genome biology, 17(1):1–15, 2016.

[12] Saiful Islam, Amit Zeisel, Simon Joost, Gioele La Manno, Pawel Zajac, Maria Kasper, Peter Lönnerberg, and Sten Linnarsson. Quantitative single-cell rna-seq with unique molecular identifiers. Nature methods, 11(2):163, 2014.

[13] Diego Adhemar Jaitin, Assaf Weiner, Ido Yofe, David Lara-Astiaso, Hadas Keren-Shaul, Eyal David, Tomer Meir Salame, Amos Tanay, Alexander van Oudenaarden, and Ido Amit. Dissecting immune circuits by linking crispr-pooled screens with single-cell rna-seq. Cell, 167(7):1883–1896, 2016.

[14] Gioele La Manno, Ruslan Soldatov, Amit Zeisel, Emelie Braun, Hannah Hochgerner, Viktor Petukhov, Katja Lidschreiber, Maria E Kastriti, Peter Lön- nerberg, Alessandro Furlan, Jean Fan, Lars E Borm, Zehua Liu, David van Bruggen, Jimin Guo, Xiaoling He, Roger Barker, Erik Sundström, Gonçalo Castelo-Branco, Patrick Cramer, Igor Adameyko, Sten Linnarsson, and Peter V Kharchenko. RNA velocity of single cells. Nature, 560(7719):494–498, August 2018.

[15] Jacob H Levine, Erin F Simonds, Sean C Bendall, Kara L Davis, D Amir El- ad, Michelle D Tadmor, Oren Litvin, Harris G Fienberg, Astraea Jager, Eli R Zunder, et al. Data-driven phenotypic dissection of aml reveals progenitor-like cells that correlate with prognosis. Cell, 162(1):184–197, 2015.

[16] Oren Parnas, Marko Jovanovic, Thomas M Eisenhaure, Rebecca H Herbst, Atray Dixit, Chun Jimmie Ye, Dariusz Przybylski, Randall J Platt, Itay Tirosh, Neville E Sanjana, Ophir Shalem, Rahul Satija, Raktima Raychowdhury, Philipp Mertins, Steven A Carr, Feng Zhang, Nir Hacohen, and Aviv Regev. A genome- wide CRISPR screen in primary immune cells to dissect regulatory networks. Cell, 162(3):675–686, July 2015.

68 [17] Lei S Qi, Matthew H Larson, Luke A Gilbert, Jennifer A Doudna, Jonathan S Weissman, Adam P Arkin, and Wendell A Lim. Repurposing CRISPR as an RNA-guided platform for sequence-specific control of gene expression. Cell, 152(5):1173–1183, February 2013.

[18] Michal Rabani, Joshua Z Levin, Lin Fan, Xian Adiconis, Raktima Raychowdhury, Manuel Garber, Andreas Gnirke, Chad Nusbaum, Nir Hacohen, Nir Friedman, Ido Amit, and Aviv Regev. Metabolic labeling of RNA uncovers principles of RNA production and degradation dynamics in mammalian cells. Nat. Biotech- nol., 29(5):436–442, May 2011.

[19] Michal Rabani, Raktima Raychowdhury, Marko Jovanovic, Michael Rooney, Deborah J Stumpo, Andrea Pauli, Nir Hacohen, Alexander F Schier, Perry J Blackshear, Nir Friedman, Ido Amit, and Aviv Regev. High-resolution sequenc- ing and modeling identifies distinct dynamic RNA regulatory strategies. Cell, 159(7):1698–1710, December 2014.

[20] Alex K Shalek, Rahul Satija, Joe Shuga, John J Trombetta, Dave Gennert, Di- ana Lu, Peilin Chen, Rona S Gertner, Jellert T Gaublomme, Nir Yosef, Schraga Schwartz, Brian Fowler, Suzanne Weaver, Jing Wang, Xiaohui Wang, Ruihua Ding, Raktima Raychowdhury, Nir Friedman, Nir Hacohen, Hongkun Park, An- drew P May, and Aviv Regev. Single-cell RNA-seq reveals dynamic paracrine control of cellular variation. Nature, 510(7505):363–369, June 2014.

[21] F Alexander Wolf, Fiona K Hamey, Mireya Plass, Jordi Solana, Joakim S Dahlin, Berthold Göttgens, Nikolaus Rajewsky, Lukas Simon, and Fabian J Theis. Paga: graph abstraction reconciles clustering with trajectory inference through a topol- ogy preserving map of single cells. Genome biology, 20(1):1–9, 2019.

69