Museum of Spatial Transcriptomics
Total Page:16
File Type:pdf, Size:1020Kb
Museum of Spatial Transcriptomics Lambda Moses, [email protected] Lior Pachter, [email protected] 2021-05-11 1mailto:[email protected] 2mailto:[email protected] 2 Contents Preface 5 1 Introduction 7 I Prequel era 11 2 Prequel era 13 2.1 Enhancer and gene traps ....................... 14 2.2 In situ reporter ............................ 18 2.3 ISH and WMISH atlases ....................... 20 2.4 Databases of the prequel era ..................... 27 2.5 Geography of the prequel era .................... 29 3 Data analysis in the prequel era 33 3.1 Gene patterns ............................. 34 3.2 Spatial regions ............................ 38 3.3 Gene interactions ........................... 38 3.4 Decline ................................ 39 3.5 Geography of prequel data analysis ................. 40 II Current era 43 4 From the past to the present 45 3 4 CONTENTS 5 Current era technologies 57 5.1 Microdissection ............................ 57 5.2 Single molecular FISH ........................ 66 5.3 In situ sequencing .......................... 86 5.4 In situ array capture ......................... 94 5.5 No imaging .............................. 102 5.6 Spatial multi-omics .......................... 104 5.7 Databases of the current era ..................... 104 6 Text mining LCM transcriptomics abstracts 107 6.1 Topic modeling ............................ 109 6.2 Changes of word usage through time ................ 113 6.3 Changes of topic prevalence through time ............. 115 6.4 Association of topics with city .................... 118 6.5 GloVe word embedding ....................... 122 7 Data analysis in the current era 127 7.1 Preprocessing ............................. 133 7.2 Exploratory data analysis ...................... 142 7.3 Spatial reconstruction of scRNA-seq data ............. 145 7.4 Cell type deconvolution ....................... 154 7.5 Spatially variable genes ....................... 157 7.6 Gene patterns ............................. 163 7.7 Spatial regions ............................ 165 7.8 Cell-cell interaction .......................... 167 7.9 Gene-gene interaction ........................ 171 7.10 Subcellular transcript localization .................. 172 7.11 Gene expression imputation from H&E ............... 174 III Future perspectives 177 8 From the past to the present to the future 179 References 183 Preface This supplement to the paper Museum of Spatial Transcriptomics and the associated database of spatial transcriptomics literature1 is inspired by mu- seum catalogs that provide insight and detail to further understanding of the exhibits. The results presented are based on code that can be run interac- tively on RStudio Cloud2. We present key analyses of metadata curated for the database, and provide further analyses and results beyond what could be included here in the more_analyses directory of this repository. The mark- down that generates this text is on GitHub, and is version controlled so that its development can be tracked now and in the future. Please notify us of errors, omissions, or other suggestions via submission of issues on GitHub: https://github.com/pachterlab/LP_2021 This document is built with the bookdown package from a collection of R Markdown files. How some of figures look depends on parameters thatcan be changed, such as size of bins when binning number of publications in time to show a trend. The source code is on RStudio Cloud3. The dependencies are pre-installed in the RStudio Cloud project. By default, when the database is queried by code, the most up to date version is used, which can be newer than the rendered static version on github.io. To build the document in RStudio Cloud (i.e., both the web page gitbook version and a PDF), run this in the R console: bookdown::render_book("index.Rmd", output_format = c("bookdown::gitbook", "bookdown::pdf_book")) If you are cloning this repo into a fresh RStudio Cloud project or a fresh machine, install the packages required to build the book as follows: First install remotes with install.packages("remotes"). Then use remotes:install_deps(dependencies = TRUE) to install all required pack- ages from CRAN, Bioconductor, and GitHub. So in short, 1https://docs.google.com/spreadsheets/d/1sJDb9B7AtYmfKv4-m8XR7uc3XXw_ k4kGSout8cqZ8bY/edit?usp=sharing 2https://rstudio.cloud/project/2492054 3https://rstudio.cloud/project/2492054 5 6 CONTENTS install.packages("remotes") remotes::install_deps(dependencies = TRUE) Because many packages are installed, the installation can be sped up with the argument Ncpus in install_deps() to specify the number of CPU cores to use to install packages in parallel, such as Ncpus = 2L for 2 cores. The free plan of RStudio Cloud only has 1 core, but this argument can be used when multiple cores are available. By default, the most up to date version of the database is downloaded for analy- ses in this book. However, as the museumst R package written for these analyses contains a cached version of the database, historical versions of the database can be viewed by installing older versions of museumst and setting update = FALSE when calling museumst::read_metadata() when running code from this book on RStudio Cloud or your computer. Older versions of museumst can be installed with remotes::install_github("pachterlab/museumst", ref = "v0.0.0.9016") where ref refers to a release. Release history of museumst can be seen here4. Documentation of museumst can be seen here5. 4https://github.com/pachterlab/museumst/releases 5https://pachterlab.github.io/museumst/ Chapter 1 Introduction The spatial organization of the components of biological systems is crucial for their proper function. For instance, morphogen gradients in embryos are tightly regulated to ensure that the right cell types differentiate at the right place. In adults, spatial organization of cells in tissues is important to proper functions of organs. For instance, the liver lobule is divided in labor according to distance from the portal triad as such distance affects suitability of different tasks. Both oxygen level and morphogen gradient regulate zonation of metabolism (Geb- hardt 2014); there is more oxidative phosphorylation and gluconeogenesis in the more oxygenated periportal region and more glycolysis in the more deoxy- genated pericentral region. How cell types and cellular functions vary in space is measured by quantifying gene expression in space. Conversely, the expression of an unknown gene in space can give clues to its function. Gene expression is usually quantified by quantifying proteins or transcripts encoded by the gene, and high throughput spatial methods exist for both protein and transcripts. In other words, cellular function exemplifies the maxim that “the whole is greater than the sum of its parts”, and in large part this follows from “location, location, location”. Here we focus on spatial transcriptomics (the field of spatial proteomics is cov- ered elsewhere (Lundberg and Borner 2019; Baharlou et al. 2019; Buchberger et al. 2018)). Even spatial transcriptomics is a vast field, and it is useful to begin by considering the scope of what it contains. Naïvely, one may say, spa- tial transcriptomics means quantifying the complete set of RNAs encoded by the genome in space. Usually the “in space” is at some microscopic resolution rather than geospatial as often assumed in the term “spatial statistics”; the resolution is usually cellular, though sometimes subcellular. The “spatial” is in contrast to other transcriptomics methods that by virtue of the nature of their assays, lose information of tissue structure in space. That is the case with microarray technology for bulk tissue analysis, for bulk RNA-seq, and single cell RNA-seq (scRNA-seq) that is based on dissociation of tissue – the “spatial” 7 8 CHAPTER 1. INTRODUCTION usually means tissue structure in space. More broadly, the “spatial” can mean knowing spatial context of samples although the spatial context is only a label and the coordinates are not collected or not used, such as in some laser capture microdissection (LCM) literature (Baccin et al. 2020; Nichterwitz et al. 2016), Niche-seq (Medaglia et al. 2017), and APEX-seq (Fazal et al. 2019). The “spatial” can also mean preserving spatial coordinates of samples within tissue, though the coordinates may or may not be explicitly used in data analysis, such as in the various single molecular fluorescent in situ hybridization (smFISH) based technologies such as seqFISH (Lubeck et al. 2014) and MERFISH (K. H. Chen et al. 2015) and array based technologies such as Spatial Transciptomics (ST) (Ståhl et al. 2016). There is more complexity in defining “transcriptomics”. While some technologies usually called “spatial transcriptomics” are indeed transcriptome-wide, such as ST, Visium, and LCM followed by RNA-seq, many technologies that only profile a panel of usually a few hundred genes are nevertheless considered part of “spa- tial transcriptomics”. Here “transcriptomics” actually means high-throughput quantification of gene expression, preferably highly multiplexed, quantifying nu- merous genes within the same piece of tissue at the same time. However, what counts as “high-throughput”? Is there a minimum number of genes required? Should 50 genes be enough? Or a hundred genes? The threshold number of genes required to be considered “high-throughput” is difficult to define; here, by “high-throughput”, we mean the intent to quantify expression of more genes than normally done with fluorescent in situ hybridization (FISH)