RNA Velocity Analysis for Pertrub-Seq Mesert Kebed
Total Page:16
File Type:pdf, Size:1020Kb
RNA Velocity Analysis for Pertrub-Seq by Mesert Kebed B.S. Computer Science and Engineering, Massachusetts Institute of Technology (2018) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2020 ○c Massachusetts Institute of Technology 2020. All rights reserved. Author................................................................ Department of Electrical Engineering and Computer Science August 14, 2020 Certified by. Aviv Regev Professor of Biology Thesis Supervisor Accepted by . Katrina LaCurts Chair, Master of Engineering Thesis Committee 2 RNA Velocity Analysis for Pertrub-Seq by Mesert Kebed Submitted to the Department of Electrical Engineering and Computer Science on August 14, 2020, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science Abstract Recent developments in single-cell RNA seq and CRISPR based perturbations have enabled researchers to carry out hundreds of perturbation experiments in a pooled format in an experimental approach called Perturb-Seq [7]. Prior analysis of Perturb- Seq measured the overall effect of a perturbation on each gene, however it remains difficult to capture temporal responses to a perturbation. In this thesis, we compare the effectiveness of three RNA velocity informed models and two cell-cell similarity based models in providing a pseudo-temporal ordering of cells. We find pseudotime estimated with the dynamical model for computing velocity provides the most reli- able ordering of cells. We use this pseudo-temporal ordering to bin cells into three time resolved groups and compute the effect of a perturbation at each time point. This analysis provides a promising start to understanding the temporal effects of a perturbation. Thesis Supervisor: Aviv Regev Title: Professor of Biology 3 4 Acknowledgments Growing up I often heard the proverb it takes a village to raise a child - and I am no exception. There are lot more people that have made this work a possibility than I can exhaustively acknowledge here but I hope to express my gratitude a few of those people. First, I would like to thank Oana Ursu, who mentored me through the duration of the thesis and patiently taught me everything I know in the field. This work would not have been possible without her gentle guidance and kind words of encouragement. She has made me a better researcher, writer, learner and person and for that I will be forever grateful. I would like to thank Professor Regev, who initially inspired me to pursue compu- tational biology and exposed me to a wonderfully balanced group of experimentalists, mathematicians and computer scientists. I’m extremely grateful to the Regev Group and the Broad Institute for providing an environment that nurtured and celebrated my curiosity. I would like to thank all of my professors, instructors and staff who have instilled in me a desire to seek and tackle challenging problems; the Dept. of EECS admins that have supported me through numerous years at the Institute; and John Guttag and Ana Bell who have welcomed me into their teaching staff with open arms. Lastly, I would like to thank my family for inspiring my curiosity, creativity, and drive from a young age. I’m also grateful to my friends who have kept me company through the late nights and early mornings, exposed me to new experiences and truly made MIT my home away from home. 5 6 Contents 1 Introduction 17 1.1 Perturb-Seq: genetic screens for studying gene function . 18 1.1.1 Previous analysis of Perturb-Seq: using linear regression for identifying which genes are affected by a given perturbation .19 1.1.2 Challenges in previous analysis methods of Perturb-Seq: under- standing the temporal progression of gene expression changes induced by a gene knockout . 19 1.2 RNA Velocity: inferring a time based ordering of cells . 21 1.3 Our proposed approach: using RNA velocity to increase the temporal resolution of perturbation-induced gene expression changes . 22 1.3.1 Using RNA velocity to arrange cells over time, towards distin- guishing cells that are early from late responders . 22 1.3.2 Compare the effect of perturbations on gene expression as in- ferred by a) traditional Perturb-Seq analyses and b) incorpo- rating insights from RNA velocity . 23 2 Related Work 25 2.1 Trajectory inference: infers projection of cells using diffusion maps . 25 2.2 RNA Velocity: captures the rate of change in the cell’s expression state 26 2.2.1 Estimating RNA velocity . 26 2.2.2 Steady-state model: captures variations from an observed steady- state expression . 28 7 2.2.3 Dynamical model: solves the full gene-wise transcriptional dy- namics . 29 3 Data 31 3.1 Time-series dataset: mouse BMDCs sequenced at 1 hour intervals fol- lowing LPS stimulation . 31 3.1.1 Downloading and preparing the dataset . 32 3.1.2 Filtering the dataset . 32 3.1.3 Processing the dataset . 33 3.2 Perturb-Seq: mouse BMDCs with 24 perturbations at 0 and 3 hours after LPS stimulation . 35 3.2.1 Downloading and preparing the dataset . 36 3.2.2 Filtering the dataset . 36 3.2.3 Processing the Dataset . 38 3.2.4 Post-Processing . 39 4 Methods 43 4.1 Characterising the various pseudotime ordering methods . 43 4.2 Computing a pseudo-temporal ordering of cells . 44 4.3 Computing the transition matrix . 45 4.3.1 For velocity informed models . 45 4.3.2 For cell-cell similarity based methods . 47 4.4 Software Packages . 47 5 Analysing the time-series dataset 49 5.1 Goal . 49 5.2 Data and code . 50 5.3 Results and Analysis . 50 5.3.1 Picking the number of neighbours for the transition matrix . 50 5.3.2 Comparing the pseudotime estimates with the ground truth . 52 8 5.3.3 Exploring the genes that are strongly correlated with pseudo- time estimates for each method . 53 6 Analysing the Perturb-Seq dataset 57 6.1 Goal . 57 6.2 Data and Code . 58 6.3 Results and Analysis . 58 6.3.1 Picking the number of neighbours for the transition matrix . 58 6.3.2 Comparing the pseudotime estimates among each of the veloc- ity models . 59 6.3.3 Exploring the genes that are strongly correlated with pseudotime 61 6.3.4 Compare the Beta’s that we get by sorting the cells into pseudo- time ordered groups to the Beta’s we get from previous analysis of Perturb-Seq . 62 6.4 Conclusion and Future work . 63 9 10 List of Figures 1-1 Linear regression model for Perturb-Seq. The model predicts the gene expression matrix Y (given) as a product of X, a matrix that specifies which cell received which perturbation (given), and beta, aset of coefficients, which represent the effects of each perturbation oneach gene. The beta coefficients are then used by biologists to understand the biological processes affected by different perturbations. 19 1-2 Direct and indirect effects of a Gene A. In this simplified model of a cell pathway, Gene A activates expression of Gene B and C, and represses expression of Gene D which go to activate and repress other genes. In this figure Gene X, Y and Y are indirectly affected bythe expression of Gene A. 20 1-3 Illustration of RNA velocity on Perturb-Seq. RNA velocity allows us to compute c(t+1) from which we can infer that c’(t) likely responded to the stimulus before c(t), and is therefore at a later cell state as compared to c(t). 23 2-1 Overview of RNA Velocity. a) Gene transcription model, DNA is transcribed to RNA at rate , spliced at rate , and degraded at rate . b) Phase diagram capturing regions of induction and repression based on the amount of unspliced and spliced RNA. 27 11 3-1 Distribution of total counts for each group of cells in time series dataset The plot reveals that the unstimulated group has more reads than the others. We subsample the reads per cell to 500,000 to account for this. 33 3-2 Quality check for cells in time series dataset. Scatter plots (left) the number of counts against percent mitochondrial genes; (right) the number of counts against the number of genes in each cell. We remove all cells that have greater than 0.1% mitochondrial genes and greater than 7,000 genes. 34 3-3 Overview of cells in time series dataset. a) Low dimensional representation in UMAP space representing the 20 clusters identified in the dataset, the 0hr, 1hr, 2hr, 4hr, and 6hr cells. b) Average expression of LPS gene groups in the dataset across cells. 35 3-4 Cluster disruption. Expression of cluster disruptive genes (Cd83 and SerpinB6b) across the cells represented in the UMAP space. 36 3-5 Quality check on Perturb-Seq dataset. a) Scatter plots (left) the number of counts against percent mitochondrial genes; (right) the number of counts against the number of genes in each cell. We remove all cells that have greater than 0.1% mitochondrial genes and greater than 2000 genes. 37 3-6 Linear regression model for batch correction. The model mea- sures the effect of the covariates on the observed expression profile. We subtract out the covariates related to batch and keep the error and hour effect. In this case, our resulting expression wouldbe B1 · (0hr) + B2 · (3hr) + error ............................. 38 3-7 Overview of cells in Perturb-Seq dataset. Low dimensional rep- resentation in UMAP space representing the 15 clusters identified in the dataset, the 0hr cells and 3hr cells.