Revealing the Missing Expressed Genes Beyond the Human Reference
Total Page:16
File Type:pdf, Size:1020Kb
Chen et al. BMC Genomics 2011, 12:590 http://www.biomedcentral.com/1471-2164/12/590 RESEARCH ARTICLE Open Access Revealing the missing expressed genes beyond the human reference genome by RNA-Seq Geng Chen1, Ruiyuan Li2, Leming Shi3, Junyi Qi1, Pengzhan Hu1, Jian Luo1, Mingyao Liu1 and Tieliu Shi1,4* Abstract Background: The complete and accurate human reference genome is important for functional genomics researches. Therefore, the incomplete reference genome and individual specific sequences have significant effects on various studies. Results: we used two RNA-Seq datasets from human brain tissues and 10 mixed cell lines to investigate the completeness of human reference genome. First, we demonstrated that in previously identified ~5 Mb Asian and ~5 Mb African novel sequences that are absent from the human reference genome of NCBI build 36, ~211 kb and ~201 kb of them could be transcribed, respectively. Our results suggest that many of those transcribed regions are not specific to Asian and African, but also present in Caucasian. Then, we found that the expressions of 104 RefSeq genes that are unalignable to NCBI build 37 in brain and cell lines are higher than 0.1 RPKM. 55 of them are conserved across human, chimpanzee and macaque, suggesting that there are still a significant number of functional human genes absent from the human reference genome. Moreover, we identified hundreds of novel transcript contigs that cannot be aligned to NCBI build 37, RefSeq genes and EST sequences. Some of those novel transcript contigs are also conserved among human, chimpanzee and macaque. By positioning those contigs onto the human genome, we identified several large deletions in the reference genome. Several conserved novel transcript contigs were further validated by RT-PCR. Conclusion: Our findings demonstrate that a significant number of genes are still absent from the incomplete human reference genome, highlighting the importance of further refining the human reference genome and curating those missing genes. Our study also shows the importance of de novo transcriptome assembly. The comparative approach between reference genome and other related human genomes based on the transcriptome provides an alternative way to refine the human reference genome. Background structure of genes, inferring splicing patterns and quan- The latest version of the public human genome assem- tifying the expression level of each transcript [7-10]. bly NCBI build 37 (also known as GRCh37) has been Therefore, the completeness of human reference gen- released and is considered to be the successor to NCBI ome is crucial for comprehensively understanding the Build 36. Currently, different types of human genetic structure of human genome. variation studies including single-nucleotide polymorph- Although the annotation information on the human isms (SNPs), deletions, insertions and structure variation reference genome has provided important information are all based on the human reference genome [1-6]. in the study of evolutionary history, population diversity, Moreover, the reference genome is the fundamental gene regulatory mechanisms and various biological resource for the researches in genomics and transcrip- events, it is still incomplete. In previous studies, Khaja tomics, such as investigating the transcriptional et al. [11] and Kidd et al. [12] have reported that a nota- ble portion of human genomic sequences were absent from NCBI build 35 or build 36, suggesting that the * Correspondence: [email protected] 1Center for Bioinformatics and Computational Biology, and the Institute of updated human reference genome is still not completely Biomedical Sciences, School of Life Science, East China Normal University, assembled and annotated. During the processes of map- Shanghai 200241, China ping transcriptome sequencing reads onto the human Full list of author information is available at the end of the article © 2011 Chen et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Chen et al. BMC Genomics 2011, 12:590 Page 2 of 9 http://www.biomedcentral.com/1471-2164/12/590 reference genome to quantify the gene expression and MicroArray Quality Control (MAQC) project [18] with infer alternative splicing or transcripts, the reads that Illumina next-generation sequencing technology to carry cannot be mapped onto the human reference genome out our study (Figure 1). The two reference RNA sam- are usually simply discarded. In fact, some of those ples consist of the Universal Human Reference RNA unmapped reads could be generated from certain func- (UHRR or UHR) from 10 human cell lines of various tional genes that not present in the reference genome. origins [19] and the Human Brain Reference RNA Simply discarding such sequences can result in the loss (HBRR or HBR) from several regions of the brain of 23 of important information. adult donors (22 of these donors are Caucasian, one is By comparing to the human reference genome NCBI unknown). The two datasets consist of ~70.1 million build 36, Li et al. also identified a significant amount of and ~58.6 million of sequencing reads, respectively, with Asian (YH) and African (NA18507) novel sequences read-lengths of 100 bp (see Methods). that not present in NCBI build 36, and they speculated In a previous study, by comparing to the human refer- that most of those novel sequences are specific to Asian ence genome NCBI build 36 with > 100 bp long and < and African [13]. The new version of human reference 90% identity criteria, Li et al. identified ~5 Mb of Asian genome NCBI build 37 resolves some problems existed (YH) and ~5 Mb African (NA18507) novel sequences in NCBI build 36 and provides us the opportunity to were not present in NCBI build 36, and they inferred further explore these potential Asian (YH) and African that most of those novel sequences are individual or (NA18507) novel sequences. Besides, the high sequen- population specific [13]. To examine how many of these cing throughput and feasibility of the sequencing tech- Asian (YH) and African (NA18507) novel sequences nology for transcriptome [14-17] allow us to better have been incorporated into GRCh37, we aligned [20] understand the genome structure of the human genome. these novel sequences to GRCh37. Using 90% identity Here we further investigated the completeness of and 98% coverage as threshold, we found that only human reference genome and revealed a significant 991.5 kb (19.35%) Asian (YH) and 926.7 kb (19.31%) number of human genes and novel transcript contigs African (NA18507) novel sequences could be aligned to that are absent from NCBI build 37. GRCh37, and the rest were still unalignable to GRCh37. Our findings show that the Asian (YH) and African The results suggest that those alignable Asian (YH) and (NA18507) novel sequences that are absent from NCBI African (NA18507) novel sequences are part of gaps or build 36 contain a significant portion of regions that can unsolved issues in NCBI build 36. be transcribed. And our results indicate that many of those transcribed sequences are not specific to Asian Human brain and cell line (YH) and African (NA18507), but are also present in transcriptome sequencing Caucasian populations. We also found that about half of reads 176 RefSeq genes that are unalignable to the human reference genome are expressed in human brain or cell lines. In addition, we identified hundreds of novel tran- Validate transcribed regions in Asian and Find expressed unalignable Identify novel transcript RefSeq genes script contigs in human brain and cell lines through de African novel sequences contigs novo assemblies of brain and cell line transcriptome. Align RefSeq mRNAs Predict genes in Infer splice junctions Assemble brain and cell onto GRCh37 to collect Our study suggests that the human reference genome is Asian and African in Asian and African line transcriptomes unalignable RefSeq novel sequences novel sequences still missing a significant number of genes that can be genes transcribed and many of those missing genes are likely Align transcript contigs de novo onto GRCh37, RefSeq functional. With the help of transcriptome Carry out transcript Quantify the expression genes, EST assemblies on Asian of unalignable RefSeq assembly, we can further detect the missing information and African novel genes of the reference genome and better understand the sequences Collect unmapped transcript contigs structure of human genome. The comparative approach of genome sequences between reference genome and Identify novel transcript other related human genomes based on the transcrip- contigs tome provides an alternative approach to refine the Validate conserved novel transcript contigs by RT- human reference genome. PCR Figure 1 Overview of identification of the missing expressed Results genes beyond the human reference genome. Human brain and Detecting transcribed regions in Asian and African novel cell transcriptome sequencing reads were used to validate the sequences transcribed regions in Asian and African novel sequences, quantify We used two transcriptome sequencing datasets from the expression of unalignable RefSeq genes and identify novel transcript contigs. two reference RNA samples established by the Chen et al. BMC Genomics 2011, 12:590 Page 3 of 9 http://www.biomedcentral.com/1471-2164/12/590