bioRxiv preprint doi: https://doi.org/10.1101/844522; this version posted November 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
1 A comprehensive online database for exploring ~20,000 public
2 Arabidopsis RNA-Seq libraries
3 Hong Zhang1†, Fei Zhang1†, Li Feng1, Jinbu Jia1, and Jixian Zhai1,*
4
5 1 Department of Biology & Institute of Plant and Food Science, Southern University of Science
6 and Technology, Shenzhen, Guangdong 518055, China.
7 † These authors contributed equally to this work.
8 * Correspondence: [email protected] (J.Z.)
9 Abstract
10 Application of Next Generating Sequencing (NGS) technology in transcriptome profiling has
11 greatly improved our understanding of transcriptional regulation at genome-wide scale in the last
12 decade, and tens of thousands of RNA-sequencing (RNA-seq) libraries have been produced by the
13 research community. However, accessing such huge amount of RNA-seq data poses a big
14 challenge for groups that lack dedicated bioinformatic personnel or expensive computational
15 resources. Here, we introduce the Arabidopsis RNA-seq database (ARS), a free, web-accessible,
16 and user-friendly to quickly explore expression level of any gene in 20,000+ publicly available
17 Arabidopsis RNA-seq libraries.
18
19 In the last decade, RNA-sequencing (RNA-seq) has surpassed microarray to become the gold
20 standard for gene expression profiling due to the continuous drop in sequencing cost and the
21 development of easy-to-use library construction kits. To date, the Arabidopsis community has
1 bioRxiv preprint doi: https://doi.org/10.1101/844522; this version posted November 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
22 collectively released more than 20,000 RNA-seq libraries, with over 1,300 libraries deposited just
23 in the first quarter of 2019 (Figure 1A). This vast resource is tremendously useful for all
24 Arabidopsis researchers to study transcriptional regulation, tissue specificity, and developmental
25 dynamics of genes they are interested in. However, accessing a large amount of RNA-seq data
26 remains a big challenge for many groups that lack dedicated bioinformatic personnel or expensive
27 computational resources.
28
29 Here, we present the Arabidopsis RNA-seq database (ARS, http://ipf.sustc.edu.cn/pub/athrna/)
30 that integrates 20,068 publicly available Arabidopsis RNA-seq data deposited at Gene Expression
31 Omnibus (GEO) (Barrett et al., 2013), Sequence Read Archive (SRA) (Kodama et al., 2012),
32 European Nucleotide Archive (ENA) (Harrison et al., 2019) and DNA Data Bank of Japan (DDBJ)
33 (Kodama et al., 2019) database (Table S1) before the end of March 2019. We downloaded raw
34 data of all libraries and re-processed them with a standardized pipeline, mapped the reads to
35 TAIR10 genome (Figure 1B), and calculated a normalized expression level in FPKM (Fragments
36 Per Kilobase of transcript, per Million mapped reads) for all the 37,336 genes annotated in
37 Araport11 (Cheng et al., 2017) in each library (Figure 1C).
38
39 ARS is a free, web-accessible, and user-friendly database which supports queries of gene IDs,
40 library IDs, or BioProject IDs, or a combination of these to show specific genes in selected libraries.
41 The search result is displayed in various forms, including data table, plots, and a built-in online
42 IGV browser for convenient exploration. Taking the query results of AT2G17690 (SUPPRESSOR
43 OF DRM1 DRM2 CMT3, SDC) as an example (Figure S1), the “Information” page shows the basic
44 information of SDC, including the statistics of the maximum, minimum, median and mean value
2 bioRxiv preprint doi: https://doi.org/10.1101/844522; this version posted November 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
45 of FPKMs in all libraries, and the details of the SDC, such as locus type, alias, symbol, genome
46 coordinate and gene direction (Figure S1A). The “Data Table” page presents the FPKM values of
47 SDC in all ~20,000 libraries, with library information such as the sample name, project, ecotype,
48 genotype, tissue, release date of each library (Figure S1B). The “Data Plot” visualizes the FPKM
49 values of SDC for easy comparison across multiple libraries (Figure S1D). ARS also integrates an
50 online Integrative Genomics Viewer (IGV) (Robinson et al., 2017) to browse read coverage on
51 SDC locus in selected libraries (Figure S1C).
52
53 Furthermore, our database can be used to quickly perform in silico screening among 20,000+
54 libraries to identify specific genotypes, conditions, or tissues that exhibit altered expression of any
55 gene of interest. Still taking SDC as an example, which is a classic marker gene whose normal
56 expression is strictly restricted to endosperm but can be found in somatic tissues when
57 transcriptional gene silencing is defective (Henderson et al., 2008). A search of SDC in our
58 database immediately identified RNA-seq libraries from mutants that are well-known components
59 of the silencing pathway, such as met1 and drm12cmt23, and also discovered two sets of libraries
60 related to epigenetic regulation but have not been previously reported to be involved in silencing
61 of the endogenous SDC (Figure 1D). The first set is from two biological replicates of collin1
62 mutant (hgf1-1 rep1 and rep2) (Figure 1D, Figure S2), which encodes a scaffold protein required
63 for the formation of Cajal body (Kanno et al., 2016), which colocalizes with the AGO4 and NRPE1
64 in the nuclear processing center and require for the full function of DNA methylation (Li et al.,
65 2006; Pontes et al., 2006; Li et al., 2008). Another set of libraries are three biological replicates
66 from the mutant of GCN5, a histone acetyltransferase (Figure S2), and reader proteins for
67 acetylated histone have been shown to regulate SDC expression (Zhang et al., 2016). Therefore,
3 bioRxiv preprint doi: https://doi.org/10.1101/844522; this version posted November 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
68 ARS provides quick, sensitive, and high-throughput screening methods to help identify novel
69 players using the “bigdata” method. This approach will continue to grow in efficiency as the
70 number of RNA-seq libraries deposited in public domains increases.
71
72 Many excellent web-based resources have been developed for hosting and analyzing mRNA-seq
73 data, such as the MPSS database (Nakano et al., 2006), the UCSC Genome Browser (Kent et al.,
74 2002), Anno-J Browser (Lister et al., 2008), EPIC-CoGe Browser (Nelson et al., 2018), ePlant
75 (Waese et al., 2017) and CoNekt (Proost and Mutwil, 2018). Compared to these existing resources
76 that are designed for either hosting a single-project or multiple-projects, or exploring the spatial
77 and temporal dynamics of gene expression that requires smaller number of microarray/RNA-seq
78 libraries (ePlant with 1,385 microarray samples, CoNekT with 913 RNA-seq samples), ARS can
79 quickly extract the abundance information of any gene from 20,000+ Arabidopsis RNA-seq
80 libraries using a simple “Google-like” search, and also provide easy access to view these data via
81 a built-in online IGV browser. With a rapidly growing number of RNA-seq libraries, we plan to
82 update ARS regularly in the future.
83
84 RNA-seq data processing
85 We used the search term “((Arabidopsis thaliana[Organism]) AND "transcriptomic"[Source])
86 AND "rna seq"[Strategy]” to collect libraries from GEO and other public databases. For data
87 processing, we aligned raw reads to TAIR10 reference genome using HISAT2 version 2.0.5 (Kim
88 et al., 2015) with parameters “-max-intron-length=5000 -k 1 -dta --n-ceil -L,0,0.15”, and removed
4 bioRxiv preprint doi: https://doi.org/10.1101/844522; this version posted November 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
89 the duplicate reads with Samtools rmdup version 1.4.1 (Li et al., 2009). FPKMs were calculated
90 by StringTie version 1.3.3b with the parameter “-e -r -G” (Pertea et al., 2015).
91
92 Acknowledgements
93 We thank all the research groups that contributed RNA-seq data to the public domain, and we
94 apologize for not being able to cite all the related papers in the main text due to limited space.
95 References for all libraries that we used are listed in Table S1. The group of J.Z. is supported by
96 National Natural Science Foundation of China (31871234), the Program for Guangdong
97 Introducing Innovative and Entrepreneurial Teams (2016ZT06S172), and the Shenzhen Sci-Tech
98 Fund (KYTDPT 20181011104005).
99
100 Author Contributions
101 H.Z. processed the data, H.Z., F.Z., and L.F. built the database and website, J.J. and J.Z oversaw
102 the study. H.Z., F.L., and J.Z. wrote the manuscript
103
104 References
105 Barrett, T., Wilhite, S.E., Ledoux, P., Evangelista, C., Kim, I.F., Tomashevsky, M., 106 Marshall, K.A., Phillippy, K.H., Sherman, P.M., Holko, M., Yefanov, A., Lee, 107 H., Zhang, N., Robertson, C.L., Serova, N., Davis, S., and Soboleva, A. 108 (2013). NCBI GEO: archive for functional genomics data sets--update. Nucleic 109 Acids Res 41, D991-995.
5 bioRxiv preprint doi: https://doi.org/10.1101/844522; this version posted November 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
110 Cheng, C.Y., Krishnakumar, V., Chan, A.P., Thibaud‐Nissen, F., Schobel, S., and Town, 111 C.D.J.T.P.J. (2017). Araport11: a complete reannotation of the Arabidopsis 112 thaliana reference genome 89, 789-804.
113 Harrison, P.W., Alako, B., Amid, C., Cerdeno-Tarraga, A., Cleland, I., Holt, S., Hussein, 114 A., Jayathilaka, S., Kay, S., Keane, T., Leinonen, R., Liu, X., Martinez- 115 Villacorta, J., Milano, A., Pakseresht, N., Rajan, J., Reddy, K., Richards, E., 116 Rosello, M., Silvester, N., Smirnov, D., Toribio, A.L., Vijayaraja, S., and 117 Cochrane, G. (2019). The European Nucleotide Archive in 2018. Nucleic Acids 118 Res 47, D84-D88.
119 Henderson, I.R., Jacobsen, S.E.J.G., and development. (2008). Tandem repeats upstream of 120 the Arabidopsis endogene SDC recruit non-CG DNA methylation and initiate 121 siRNA spreading 22, 1597-1606.
122 Kanno, T., Lin, W.D., Fu, J.L., Wu, M.T., Yang, H.W., Lin, S.S., Matzke, A.J., and Matzke, 123 M. (2016). Identification of Coilin Mutants in a Screen for Enhanced Expression 124 of an Alternatively Spliced GFP Reporter Gene in Arabidopsis thaliana. Genetics 125 203, 1709-1720.
126 Kent, W.J., Sugnet, C.W., Furey, T.S., Roskin, K.M., Pringle, T.H., Zahler, A.M., and 127 Haussler, D. (2002). The human genome browser at UCSC. Genome Res 12, 128 996-1006.
129 Kim, D., Langmead, B., and Salzberg, S.L. (2015). HISAT: a fast spliced aligner with low 130 memory requirements. Nat Methods 12, 357-360.
131 Kodama, Y., Shumway, M., Leinonen, R., and International Nucleotide Sequence 132 Database, C. (2012). The Sequence Read Archive: explosive growth of 133 sequencing data. Nucleic Acids Res 40, D54-56.
134 Kodama, Y., Mashima, J., Kosuge, T., and Ogasawara, O. (2019). DDBJ update: the 135 Genomic Expression Archive (GEA) for functional genomics data. Nucleic Acids 136 Res 47, D69-D73.
6 bioRxiv preprint doi: https://doi.org/10.1101/844522; this version posted November 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
137 Li, C.F., Henderson, I.R., Song, L., Fedoroff, N., Lagrange, T., and Jacobsen, S.E.J.P.g. 138 (2008). Dynamic regulation of ARGONAUTE4 within multiple nuclear bodies in 139 Arabidopsis thaliana 4, e27.
140 Li, C.F., Pontes, O., El-Shami, M., Henderson, I.R., Bernatavichute, Y.V., Chan, S.W.-L., 141 Lagrange, T., Pikaard, C.S., and Jacobsen, S.E.J.C. (2006). An 142 ARGONAUTE4-containing nuclear processing center colocalized with Cajal 143 bodies in Arabidopsis thaliana 126, 93-106.
144 Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, 145 G., Durbin, R., and Subgroup, G.P.D.P. (2009). The Sequence Alignment/Map 146 format and SAMtools. Bioinformatics 25, 2078-2079.
147 Lister, R., O'Malley, R.C., Tonti-Filippini, J., Gregory, B.D., Berry, C.C., Millar, A.H., and 148 Ecker, J.R. (2008). Highly integrated single-base resolution maps of the 149 epigenome in Arabidopsis. Cell 133, 523-536.
150 Nakano, M., Nobuta, K., Vemaraju, K., Tej, S.S., Skogen, J.W., and Meyers, B.C. (2006). 151 Plant MPSS databases: signature-based transcriptional resources for analyses of 152 mRNA and small RNA. Nucleic Acids Res 34, D731-735.
153 Nelson, A.D.L., Haug-Baltzell, A.K., Davey, S., Gregory, B.D., and Lyons, E. (2018). EPIC- 154 CoGe: managing and analyzing genomic data. Bioinformatics 34, 2651-2653.
155 Pertea, M., Pertea, G.M., Antonescu, C.M., Chang, T.C., Mendell, J.T., and Salzberg, S.L. 156 (2015). StringTie enables improved reconstruction of a transcriptome from RNA- 157 seq reads. Nat Biotechnol 33, 290-295.
158 Pontes, O., Li, C.F., Nunes, P.C., Haag, J., Ream, T., Vitins, A., Jacobsen, S.E., and 159 Pikaard, C.S.J.C. (2006). The Arabidopsis chromatin-modifying nuclear siRNA 160 pathway involves a nucleolar RNA processing center 126, 79-92.
161 Proost, S., and Mutwil, M.J.N.a.r. (2018). CoNekT: an open-source framework for 162 comparative genomic and transcriptomic network analyses 46, W133-W140.
163 Robinson, J.T., Thorvaldsdottir, H., Wenger, A.M., Zehir, A., and Mesirov, J.P. (2017). 164 Variant Review with the Integrative Genomics Viewer. Cancer Res 77, e31-e34.
7 bioRxiv preprint doi: https://doi.org/10.1101/844522; this version posted November 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
165 Waese, J., Fan, J., Pasha, A., Yu, H., Fucile, G., Shi, R., Cumming, M., Kelley, L.A., 166 Sternberg, M.J., and Krishnakumar, V.J.T.P.C. (2017). ePlant: visualizing and 167 exploring multiple levels of data for hypothesis generation in plant biology 29, 168 1806-1821.
169 Zhang, C.-J., Hou, X.-M., Tan, L.-M., Shao, C.-R., Huang, H.-W., Li, Y.-Q., Li, L., Cai, T., 170 Chen, S., and He, X.-J.J.N.c. (2016). The Arabidopsis acetylated histone- 171 binding protein BRAT1 forms a complex with BRP1 and prevents transcriptional 172 silencing 7, 11715.
173
8 bioRxiv preprint doi: https://doi.org/10.1101/844522; this version posted November 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
1 Figure Legends
2 Figure 1. Overview of the Arabidopsis RNA-Seq Database.
3 A. The number of sequenced bases per year from 2009 to 2018. X-axis represents the year of data
4 generation, and y-axis is the number of sequenced bases in GB.
5 B. Overall distribution of the percentage of uniquely mapped reads verses the total number of
6 sequenced raw reads in 20,068 libraries. Each dot represents a library, and the color represents the
7 density of the dots.
8 C. Overflow for the construction of Arabidopsis RNA-Seq Database (ARS). A total of 20,068
9 publicly available Arabidopsis RNA-seq libraries were collected from GEO, SRA, DDBJ, and
10 ENA databases, and processed with a unified pipeline. All genes and libraries-related information
11 can be accessed via keyword-based searching on our ARS website
12 (http://ipf.sustech.edu.cn/pub/athrna/).
13 D. The expression levels of SDC in all libraries. A housekeeping gene UBQ is plotted on the x- 14 axis. Some mutant libraries known roal in silencing of SDC is labeled black, and two potential 15 novel players (hgf1 and gcn5 mutants) are labeled red.
16 Figure S1. Example of a query using. SDC. A-D show the gene information, data table, data plot, 17 and IGV browser, respectively.
18 Figure S2. Expression level SDC in hgf1 and gcn5 mutants. The data plot page by searching 19 AT2G17690 (SDC) and the two related BioProject numbers.
20 Table S1. The detailed information of all RNA-seq libraries collected from public databases.
21
9 bioRxiv preprint doi: https://doi.org/10.1101/844522; this version posted November 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
A B 60,000
49,292 50,000 Bases deposited per year (GB)
40,000 33,384 Total number of bases (GB)
30,000
19,590 20,000
8,729 10,000 3,482 1,717 23 36 150 943 0 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
C D Data Collection Database & Website
GEO Search Filter Browse Download
꞊ 20,068 SRA Gene Read The basic Gene ID, expression alignments in information SRA Library ID, levels in all one or more and FPKMs BioProject ID libraries on DDBJ libraries of genes online IGV in libraries ENA
Search one or more Information Input User Citation Search genes, libraries, Additional filters of 20,000+ examples tutorials information history and studies for query a libraries single gene
Figure 1. Arabidopsis RNA-Seq Database.
A. The number of sequenced bases per year from 2009 to 2018. X-axis represents the year of data generation, and y-axis is the number of sequenced bases in GB. B. Overall distribution of the percentage of uniquely mapped reads verses the total number of sequenced raw reads in 20,068 libraries. Each dot represents a library, and the color represents the density of the dots. C. Overflow for the construction of Arabidopsis RNA-Seq Database (ARS). A total of 20,068 publicly available Arabidopsis RNA-seq libraries were collected from GEO, SRA, DDBJ, and ENA databases, and processed with a unified pipeline. All genes and libraries-related information can be accessed via keyword-based searching on our ARS website (http://ipf.sustech.edu.cn/pub/asrd/). D. The expression levels of SDC in all libraries. A housekeeping gene UBQ is plotted on the x-axis. Some mutant libraries known roal in silencing of SDC is labeled black, and two potential novel players (hgf1 and gcn5 mutants) are labeled red. bioRxiv preprint doi: https://doi.org/10.1101/844522; this version posted November 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Supplemental Figures
A D
B
C
Figure S1. Example of a query using. SDC. A-D show the gene information, data table, data plot, and IGV browser, respectively. bioRxiv preprint doi: https://doi.org/10.1101/844522; this version posted November 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Figure S2. Expression level SDC in hgf1 and gcn5 mutants. The data plot page by searching AT2G17690 (SDC) and the two related BioProject numbers.