Comparisons Between Arabidopsis Thaliana and Drosophila
Total Page:16
File Type:pdf, Size:1020Kb
University of Wollongong Research Online Faculty of Science, Medicine and Health - Papers Faculty of Science, Medicine and Health 2015 Comparisons between Arabidopsis thaliana and Drosophila melanogaster in relation to coding and noncoding sequence length and gene expression Rachel Caldwell University of Wollongong, [email protected] Yan-Xia Lin University of Wollongong, [email protected] Ren Zhang University of Wollongong, [email protected] Publication Details Caldwell, R., Lin, Y. & Zhang, R. (2015). Comparisons between Arabidopsis thaliana and Drosophila melanogaster in relation to coding and noncoding sequence length and gene expression. International Journal of Genomics, 2015 269127-1 - 269127-13. Research Online is the open access institutional repository for the University of Wollongong. For further information contact the UOW Library: [email protected] Comparisons between Arabidopsis thaliana and Drosophila melanogaster in relation to coding and noncoding sequence length and gene expression Abstract There is a continuing interest in the analysis of gene architecture and gene expression to determine the relationship that may exist. Advances in high-quality sequencing technologies and large-scale resource datasets have increased the understanding of relationships and cross-referencing of expression data to the large genome data. Although a negative correlation between expression level and gene (especially transcript) length has been generally accepted, there have been some conflicting results arising from the literature concerning the impacts of different regions of genes, and the underlying reason is not well understood. The research aims to apply quantile regression techniques for statistical analysis of coding and noncoding sequence length and gene expression data in the plant, Arabidopsis thaliana, and fruit fly, Drosophila melanogaster, to determine if a relationship exists and if there is any variation or similarities between these species. The quantile regression analysis found that the coding sequence length and gene expression correlations varied, and similarities emerged for the noncoding sequence length (5' and 3' UTRs) between animal and plant species. In conclusion, the information described in this study provides the basis for further exploration into gene regulation with regard to coding and noncoding sequence length. Disciplines Medicine and Health Sciences | Social and Behavioral Sciences Publication Details Caldwell, R., Lin, Y. & Zhang, R. (2015). Comparisons between Arabidopsis thaliana and Drosophila melanogaster in relation to coding and noncoding sequence length and gene expression. International Journal of Genomics, 2015 269127-1 - 269127-13. This journal article is available at Research Online: http://ro.uow.edu.au/smhpapers/3227 Hindawi Publishing Corporation International Journal of Genomics Volume 2015, Article ID 269127, 13 pages http://dx.doi.org/10.1155/2015/269127 Research Article Comparisons between Arabidopsis thaliana and Drosophila melanogaster in relation to Coding and Noncoding Sequence Length and Gene Expression Rachel Caldwell,1 Yan-Xia Lin,2 and Ren Zhang1 1 School of Biological Sciences, University of Wollongong, Northfields Avenue, Keiraville, Wollongong, NSW 2522, Australia 2National Institute for Applied Statistics Research Australia (NIASRA), School of Mathematics and Applied Statistics, University of Wollongong, Northfields Avenue, Keiraville, Wollongong, NSW 2522, Australia Correspondence should be addressed to Rachel Caldwell; [email protected] Received 18 March 2015; Accepted 11 May 2015 Academic Editor: Sylvia Hagemann Copyright © 2015 Rachel Caldwell et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. There is a continuing interest in the analysis of gene architecture and gene expression to determine the relationship that may exist. Advances in high-quality sequencing technologies and large-scale resource datasets have increased the understanding of relationships and cross-referencing of expression data to the large genome data. Although a negative correlation between expression level and gene (especially transcript) length has been generally accepted, there have been some conflicting results arising from the literature concerning the impacts of different regions of genes, and the underlying reason is not well understood. The research aims to apply quantile regression techniques for statistical analysis of coding and noncoding sequence length and gene expression data in the plant, Arabidopsis thaliana,andfruitfly,Drosophila melanogaster, to determine if a relationship exists and if there is any variation or similarities between these species. The quantile regression analysis found that the coding sequence length and gene expression correlations varied, and similarities emerged for the noncoding sequence length (5 and 3 UTRs) between animal and plant species. In conclusion, the information described in this study provides the basis for further exploration into gene regulation with regard to coding and noncoding sequence length. 1. Introduction genetic mechanisms, sequence structures, and functions [11], and comparative studies are more powerful than studying Advances in high-quality sequencing technologies [1, 2]and the sequence of a single genome [5]. Furthermore, control large-scale resource datasets [3, 4]haveenhancedgenomics of gene expression has been used as a measurement of research. Conducting large-scale sequence comparisons has variation and is often well conserved between species in the advantage of identifying the genetic variation and specia- the coding sequences. In unicellular organisms such as the tion among organisms [5]. Whole-genome expression exper- yeast Saccharomyces cerevisiae,researchhasfoundthathighly iments have also expanded a new era in bioinformatics analy- expressed genes tend to have smaller compact protein sizes ses [6–9]. Understanding relationships and cross-referencing [12]. Other animal genome studies have found that highly of expression data to large genome data can now be attained expressed genes have fewer and shorter introns and shorter and facilitates a greater insight of organismal complexity and coding sequences and protein lengths and favour more com- the tightly regulated process of gene expression. pactness in highly expressed genes [13, 14]. Previous research, Thereisacontinuinginterest intheanalysisof genearchi- however, is divided in opinion, with highly expressed genes tecture and gene expression to determine the relationship not always being compact in plants. There is evidence that that may exist [10]. Current investigations on the similarities suggests that, in higher plant genomes, highly expressed and differences between plant and animal genome structure genes comprise longer introns and primary transcripts [15] have led to a greater understanding in biochemical pathways, in contrast with other research on Arabidopsis and rice, 2 International Journal of Genomics finding that highly expressed genes are more compact [16], the annotation file which contained only one gene model for specifically the lengths of the coding sequence (CDS) [17]. each gene. The downloaded expression data were already nor- Negative correlation between protein length and gene expres- malized by Bioconductor (http://www.bioconductor.org/)R sion breadth in the plant species Populus tremula was also software. The final sample size for analysis was 18,445 genes, observed [18]. Taken together, these observations suggest that excluding two (2) genes from the coding sequence that only the difference in length in relation to gene expression is had 1 bp which was classified as an intron. The accession not merely due to adaptive evolution but rather has specific string and ID reference from the arrays were used to link the biological significance19 [ ]. data together to create a master database of length and gene Significance of noncoding regions is less understood expression data for analysis. across species compared to the coding regions. A range The Drosophila melanogaster sequence data were down- of genomic studies over the last decade has supported the loaded from FlyBase website: http://www.flybase.org/.The opinion that there are tightly regulated processes and levels raw CEL gene expression data files were downloaded from of control in the regulation of gene expression. This has the NCBI GEO Datasets database (http://www.ncbi.nlm included the untranslated gene regions, notably the 5 and .nih.gov/geo/query/acc.cgi?acc=GSE42255) under series 3 untranslated regions (UTRs), which may play the most GSE42255 [27]. Affymetrix microarrays were used to analyse important role in the regulation of gene expression [20]. A the adult Drosophila and the raw CEL files were normalized studybyLinandLirevealedastrongnegativecorrelation using the Bioconductor (http://www.bioconductor.org/)affy between the 5 UTR length and expression correlation with packageintheRsoftwareenvironment.Theannotationfile cytosolic ribosomal protein patterns in S. cerevisiae and C. was included and the Entrez UniGene name (GC numbers) albicans [21], with highly expressed eukaryotic genes tending and the ID from the platform data table were used to link the to have more compact 5 UTR regions [22]. A plant study on data together to create a master database of length and gene both Arabidopsis andricealsoreportednegativecorrelation