<<

bioRxiv preprint doi: https://doi.org/10.1101/844522; this version posted November 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

1 A comprehensive online database for exploring ~20,000 public

2 RNA-Seq libraries

3 Hong Zhang1†, Fei Zhang1†, Li Feng1, Jinbu Jia1, and Jixian Zhai1,*

4

5 1 Department of Biology & Institute of and Food Science, Southern University of Science

6 and Technology, Shenzhen, Guangdong 518055, China.

7 † These authors contributed equally to this work.

8 * Correspondence: [email protected] (J.Z.)

9 Abstract

10 Application of Next Generating (NGS) technology in transcriptome profiling has

11 greatly improved our understanding of transcriptional regulation at -wide scale in the last

12 decade, and tens of thousands of RNA-sequencing (RNA-seq) libraries have been produced by the

13 research community. However, accessing such huge amount of RNA-seq data poses a big

14 challenge for groups that lack dedicated bioinformatic personnel or expensive computational

15 resources. Here, we introduce the Arabidopsis RNA-seq database (ARS), a free, web-accessible,

16 and user-friendly to quickly explore expression level of any in 20,000+ publicly available

17 Arabidopsis RNA-seq libraries.

18

19 In the last decade, RNA-sequencing (RNA-seq) has surpassed microarray to become the gold

20 standard for gene expression profiling due to the continuous drop in sequencing cost and the

21 development of easy-to-use library construction kits. To date, the Arabidopsis community has

1 bioRxiv preprint doi: https://doi.org/10.1101/844522; this version posted November 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

22 collectively released more than 20,000 RNA-seq libraries, with over 1,300 libraries deposited just

23 in the first quarter of 2019 (Figure 1A). This vast resource is tremendously useful for all

24 Arabidopsis researchers to study transcriptional regulation, tissue specificity, and developmental

25 dynamics of they are interested in. However, accessing a large amount of RNA-seq data

26 remains a big challenge for many groups that lack dedicated bioinformatic personnel or expensive

27 computational resources.

28

29 Here, we present the Arabidopsis RNA-seq database (ARS, http://ipf.sustc.edu.cn/pub/athrna/)

30 that integrates 20,068 publicly available Arabidopsis RNA-seq data deposited at Gene Expression

31 Omnibus (GEO) (Barrett et al., 2013), Sequence Read Archive (SRA) (Kodama et al., 2012),

32 European Nucleotide Archive (ENA) (Harrison et al., 2019) and DNA Data Bank of Japan (DDBJ)

33 (Kodama et al., 2019) database (Table S1) before the end of March 2019. We downloaded raw

34 data of all libraries and re-processed them with a standardized pipeline, mapped the reads to

35 TAIR10 genome (Figure 1B), and calculated a normalized expression level in FPKM (Fragments

36 Per Kilobase of transcript, per Million mapped reads) for all the 37,336 genes annotated in

37 Araport11 (Cheng et al., 2017) in each library (Figure 1C).

38

39 ARS is a free, web-accessible, and user-friendly database which supports queries of gene IDs,

40 library IDs, or BioProject IDs, or a combination of these to show specific genes in selected libraries.

41 The search result is displayed in various forms, including data table, plots, and a built-in online

42 IGV browser for convenient exploration. Taking the query results of AT2G17690 (SUPPRESSOR

43 OF DRM1 DRM2 CMT3, SDC) as an example (Figure S1), the “Information” page shows the basic

44 information of SDC, including the statistics of the maximum, minimum, median and mean value

2 bioRxiv preprint doi: https://doi.org/10.1101/844522; this version posted November 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

45 of FPKMs in all libraries, and the details of the SDC, such as locus type, alias, symbol, genome

46 coordinate and gene direction (Figure S1A). The “Data Table” page presents the FPKM values of

47 SDC in all ~20,000 libraries, with library information such as the sample name, project, ,

48 genotype, tissue, release date of each library (Figure S1B). The “Data Plot” visualizes the FPKM

49 values of SDC for easy comparison across multiple libraries (Figure S1D). ARS also integrates an

50 online Integrative Viewer (IGV) (Robinson et al., 2017) to browse read coverage on

51 SDC locus in selected libraries (Figure S1C).

52

53 Furthermore, our database can be used to quickly perform in silico screening among 20,000+

54 libraries to identify specific genotypes, conditions, or tissues that exhibit altered expression of any

55 gene of interest. Still taking SDC as an example, which is a classic marker gene whose normal

56 expression is strictly restricted to endosperm but can be found in somatic tissues when

57 transcriptional gene silencing is defective (Henderson et al., 2008). A search of SDC in our

58 database immediately identified RNA-seq libraries from mutants that are well-known components

59 of the silencing pathway, such as met1 and drm12cmt23, and also discovered two sets of libraries

60 related to epigenetic regulation but have not been previously reported to be involved in silencing

61 of the endogenous SDC (Figure 1D). The first set is from two biological replicates of collin1

62 mutant (hgf1-1 rep1 and rep2) (Figure 1D, Figure S2), which encodes a scaffold required

63 for the formation of Cajal body (Kanno et al., 2016), which colocalizes with the AGO4 and NRPE1

64 in the nuclear processing center and require for the full function of DNA methylation (Li et al.,

65 2006; Pontes et al., 2006; Li et al., 2008). Another set of libraries are three biological replicates

66 from the mutant of GCN5, a histone acetyltransferase (Figure S2), and reader for

67 acetylated histone have been shown to regulate SDC expression (Zhang et al., 2016). Therefore,

3 bioRxiv preprint doi: https://doi.org/10.1101/844522; this version posted November 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

68 ARS provides quick, sensitive, and high-throughput screening methods to help identify novel

69 players using the “bigdata” method. This approach will continue to grow in efficiency as the

70 number of RNA-seq libraries deposited in public domains increases.

71

72 Many excellent web-based resources have been developed for hosting and analyzing mRNA-seq

73 data, such as the MPSS database (Nakano et al., 2006), the UCSC Genome Browser (Kent et al.,

74 2002), Anno-J Browser (Lister et al., 2008), EPIC-CoGe Browser (Nelson et al., 2018), ePlant

75 (Waese et al., 2017) and CoNekt (Proost and Mutwil, 2018). Compared to these existing resources

76 that are designed for either hosting a single-project or multiple-projects, or exploring the spatial

77 and temporal dynamics of gene expression that requires smaller number of microarray/RNA-seq

78 libraries (ePlant with 1,385 microarray samples, CoNekT with 913 RNA-seq samples), ARS can

79 quickly extract the abundance information of any gene from 20,000+ Arabidopsis RNA-seq

80 libraries using a simple “Google-like” search, and also provide easy access to view these data via

81 a built-in online IGV browser. With a rapidly growing number of RNA-seq libraries, we plan to

82 update ARS regularly in the future.

83

84 RNA-seq data processing

85 We used the search term “((Arabidopsis thaliana[Organism]) AND "transcriptomic"[Source])

86 AND "rna seq"[Strategy]” to collect libraries from GEO and other public databases. For data

87 processing, we aligned raw reads to TAIR10 reference genome using HISAT2 version 2.0.5 (Kim

88 et al., 2015) with parameters “-max-intron-length=5000 -k 1 -dta --n-ceil -L,0,0.15”, and removed

4 bioRxiv preprint doi: https://doi.org/10.1101/844522; this version posted November 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

89 the duplicate reads with Samtools rmdup version 1.4.1 (Li et al., 2009). FPKMs were calculated

90 by StringTie version 1.3.3b with the parameter “-e -r -G” (Pertea et al., 2015).

91

92 Acknowledgements

93 We thank all the research groups that contributed RNA-seq data to the public domain, and we

94 apologize for not being able to cite all the related papers in the main text due to limited space.

95 References for all libraries that we used are listed in Table S1. The group of J.Z. is supported by

96 National Natural Science Foundation of China (31871234), the Program for Guangdong

97 Introducing Innovative and Entrepreneurial Teams (2016ZT06S172), and the Shenzhen Sci-Tech

98 Fund (KYTDPT 20181011104005).

99

100 Author Contributions

101 H.Z. processed the data, H.Z., F.Z., and L.F. built the database and website, J.J. and J.Z oversaw

102 the study. H.Z., F.L., and J.Z. wrote the manuscript

103

104 References

105 Barrett, T., Wilhite, S.E., Ledoux, P., Evangelista, C., Kim, I.F., Tomashevsky, M., 106 Marshall, K.A., Phillippy, K.H., Sherman, P.M., Holko, M., Yefanov, A., Lee, 107 H., Zhang, N., Robertson, C.L., Serova, N., Davis, S., and Soboleva, A. 108 (2013). NCBI GEO: archive for functional genomics data sets--update. Nucleic 109 Acids Res 41, D991-995.

5 bioRxiv preprint doi: https://doi.org/10.1101/844522; this version posted November 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

110 Cheng, C.Y., Krishnakumar, V., Chan, A.P., Thibaud‐Nissen, F., Schobel, S., and Town, 111 C.D.J.T.P.J. (2017). Araport11: a complete reannotation of the Arabidopsis 112 thaliana reference genome 89, 789-804.

113 Harrison, P.W., Alako, B., Amid, C., Cerdeno-Tarraga, A., Cleland, I., Holt, S., Hussein, 114 A., Jayathilaka, S., Kay, S., Keane, T., Leinonen, R., Liu, X., Martinez- 115 Villacorta, J., Milano, A., Pakseresht, N., Rajan, J., Reddy, K., Richards, E., 116 Rosello, M., Silvester, N., Smirnov, D., Toribio, A.L., Vijayaraja, S., and 117 Cochrane, G. (2019). The European Nucleotide Archive in 2018. Nucleic Acids 118 Res 47, D84-D88.

119 Henderson, I.R., Jacobsen, S.E.J.G., and development. (2008). Tandem repeats upstream of 120 the Arabidopsis endogene SDC recruit non-CG DNA methylation and initiate 121 siRNA spreading 22, 1597-1606.

122 Kanno, T., Lin, W.D., Fu, J.L., Wu, M.T., Yang, H.W., Lin, S.S., Matzke, A.J., and Matzke, 123 M. (2016). Identification of Coilin Mutants in a Screen for Enhanced Expression 124 of an Alternatively Spliced GFP Reporter Gene in Arabidopsis thaliana. 125 203, 1709-1720.

126 Kent, W.J., Sugnet, C.W., Furey, T.S., Roskin, K.M., Pringle, T.H., Zahler, A.M., and 127 Haussler, D. (2002). The human genome browser at UCSC. Genome Res 12, 128 996-1006.

129 Kim, D., Langmead, B., and Salzberg, S.L. (2015). HISAT: a fast spliced aligner with low 130 memory requirements. Nat Methods 12, 357-360.

131 Kodama, Y., Shumway, M., Leinonen, R., and International Nucleotide Sequence 132 Database, C. (2012). The Sequence Read Archive: explosive growth of 133 sequencing data. Nucleic Acids Res 40, D54-56.

134 Kodama, Y., Mashima, J., Kosuge, T., and Ogasawara, O. (2019). DDBJ update: the 135 Genomic Expression Archive (GEA) for functional genomics data. Nucleic Acids 136 Res 47, D69-D73.

6 bioRxiv preprint doi: https://doi.org/10.1101/844522; this version posted November 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

137 Li, C.F., Henderson, I.R., Song, L., Fedoroff, N., Lagrange, T., and Jacobsen, S.E.J.P.g. 138 (2008). Dynamic regulation of ARGONAUTE4 within multiple nuclear bodies in 139 Arabidopsis thaliana 4, e27.

140 Li, C.F., Pontes, O., El-Shami, M., Henderson, I.R., Bernatavichute, Y.V., Chan, S.W.-L., 141 Lagrange, T., Pikaard, C.S., and Jacobsen, S.E.J.C. (2006). An 142 ARGONAUTE4-containing nuclear processing center colocalized with Cajal 143 bodies in Arabidopsis thaliana 126, 93-106.

144 Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, 145 G., Durbin, R., and Subgroup, G.P.D.P. (2009). The /Map 146 format and SAMtools. 25, 2078-2079.

147 Lister, R., O'Malley, R.C., Tonti-Filippini, J., Gregory, B.D., Berry, C.C., Millar, A.H., and 148 Ecker, J.R. (2008). Highly integrated single-base resolution maps of the 149 epigenome in Arabidopsis. Cell 133, 523-536.

150 Nakano, M., Nobuta, K., Vemaraju, K., Tej, S.S., Skogen, J.W., and Meyers, B.C. (2006). 151 Plant MPSS databases: signature-based transcriptional resources for analyses of 152 mRNA and small RNA. Nucleic Acids Res 34, D731-735.

153 Nelson, A.D.L., Haug-Baltzell, A.K., Davey, S., Gregory, B.D., and Lyons, E. (2018). EPIC- 154 CoGe: managing and analyzing genomic data. Bioinformatics 34, 2651-2653.

155 Pertea, M., Pertea, G.M., Antonescu, C.M., Chang, T.C., Mendell, J.T., and Salzberg, S.L. 156 (2015). StringTie enables improved reconstruction of a transcriptome from RNA- 157 seq reads. Nat Biotechnol 33, 290-295.

158 Pontes, O., Li, C.F., Nunes, P.C., Haag, J., Ream, T., Vitins, A., Jacobsen, S.E., and 159 Pikaard, C.S.J.C. (2006). The Arabidopsis chromatin-modifying nuclear siRNA 160 pathway involves a nucleolar RNA processing center 126, 79-92.

161 Proost, S., and Mutwil, M.J.N.a.r. (2018). CoNekT: an open-source framework for 162 comparative genomic and transcriptomic network analyses 46, W133-W140.

163 Robinson, J.T., Thorvaldsdottir, H., Wenger, A.M., Zehir, A., and Mesirov, J.P. (2017). 164 Variant Review with the Integrative Genomics Viewer. Cancer Res 77, e31-e34.

7 bioRxiv preprint doi: https://doi.org/10.1101/844522; this version posted November 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

165 Waese, J., Fan, J., Pasha, A., Yu, H., Fucile, G., Shi, R., Cumming, M., Kelley, L.A., 166 Sternberg, M.J., and Krishnakumar, V.J.T.P.C. (2017). ePlant: visualizing and 167 exploring multiple levels of data for hypothesis generation in plant biology 29, 168 1806-1821.

169 Zhang, C.-J., Hou, X.-M., Tan, L.-M., Shao, C.-R., Huang, H.-W., Li, Y.-Q., Li, L., Cai, T., 170 Chen, S., and He, X.-J.J.N.c. (2016). The Arabidopsis acetylated histone- 171 binding protein BRAT1 forms a complex with BRP1 and prevents transcriptional 172 silencing 7, 11715.

173

8 bioRxiv preprint doi: https://doi.org/10.1101/844522; this version posted November 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

1 Figure Legends

2 Figure 1. Overview of the Arabidopsis RNA-Seq Database.

3 A. The number of sequenced bases per year from 2009 to 2018. X-axis represents the year of data

4 generation, and y-axis is the number of sequenced bases in GB.

5 B. Overall distribution of the percentage of uniquely mapped reads verses the total number of

6 sequenced raw reads in 20,068 libraries. Each dot represents a library, and the color represents the

7 density of the dots.

8 C. Overflow for the construction of Arabidopsis RNA-Seq Database (ARS). A total of 20,068

9 publicly available Arabidopsis RNA-seq libraries were collected from GEO, SRA, DDBJ, and

10 ENA databases, and processed with a unified pipeline. All genes and libraries-related information

11 can be accessed via keyword-based searching on our ARS website

12 (http://ipf.sustech.edu.cn/pub/athrna/).

13 D. The expression levels of SDC in all libraries. A housekeeping gene UBQ is plotted on the x- 14 axis. Some mutant libraries known roal in silencing of SDC is labeled black, and two potential 15 novel players (hgf1 and gcn5 mutants) are labeled red.

16 Figure S1. Example of a query using. SDC. A-D show the gene information, data table, data plot, 17 and IGV browser, respectively.

18 Figure S2. Expression level SDC in hgf1 and gcn5 mutants. The data plot page by searching 19 AT2G17690 (SDC) and the two related BioProject numbers.

20 Table S1. The detailed information of all RNA-seq libraries collected from public databases.

21

9 bioRxiv preprint doi: https://doi.org/10.1101/844522; this version posted November 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

A B 60,000

49,292 50,000 Bases deposited per year (GB)

40,000 33,384 Total number of bases (GB)

30,000

19,590 20,000

8,729 10,000 3,482 1,717 23 36 150 943 0 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018

C D Data Collection Database & Website

GEO Search Filter Browse Download

꞊ 20,068 SRA Gene Read The basic Gene ID, expression alignments in information SRA Library ID, levels in all one or more and FPKMs BioProject ID libraries on DDBJ libraries of genes online IGV in libraries ENA

Search one or more Information Input User Citation Search genes, libraries, Additional filters of 20,000+ examples tutorials information history and studies for query a libraries single gene

Figure 1. Arabidopsis RNA-Seq Database.

A. The number of sequenced bases per year from 2009 to 2018. X-axis represents the year of data generation, and y-axis is the number of sequenced bases in GB. B. Overall distribution of the percentage of uniquely mapped reads verses the total number of sequenced raw reads in 20,068 libraries. Each dot represents a library, and the color represents the density of the dots. C. Overflow for the construction of Arabidopsis RNA-Seq Database (ARS). A total of 20,068 publicly available Arabidopsis RNA-seq libraries were collected from GEO, SRA, DDBJ, and ENA databases, and processed with a unified pipeline. All genes and libraries-related information can be accessed via keyword-based searching on our ARS website (http://ipf.sustech.edu.cn/pub/asrd/). D. The expression levels of SDC in all libraries. A housekeeping gene UBQ is plotted on the x-axis. Some mutant libraries known roal in silencing of SDC is labeled black, and two potential novel players (hgf1 and gcn5 mutants) are labeled red. bioRxiv preprint doi: https://doi.org/10.1101/844522; this version posted November 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Supplemental Figures

A D

B

C

Figure S1. Example of a query using. SDC. A-D show the gene information, data table, data plot, and IGV browser, respectively. bioRxiv preprint doi: https://doi.org/10.1101/844522; this version posted November 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Figure S2. Expression level SDC in hgf1 and gcn5 mutants. The data plot page by searching AT2G17690 (SDC) and the two related BioProject numbers.