Designating Eukaryotic Orthology Via Processed Transcription Units Meng-Ru Ho1,2,3, Wen-Jung Jang2,3,4, Chun-Houh Chen5, Lan-Yang Ch’Ang3 and Wen-Chang Lin1,3,*
Total Page:16
File Type:pdf, Size:1020Kb
3436–3442 Nucleic Acids Research, 2008, Vol. 36, No. 10 Published online 29 April 2008 doi:10.1093/nar/gkn227 Designating eukaryotic orthology via processed transcription units Meng-Ru Ho1,2,3, Wen-Jung Jang2,3,4, Chun-houh Chen5, Lan-Yang Ch’ang3 and Wen-chang Lin1,3,* 1Institute of Biomedical Informatics, National Yang-Ming University, 2Bioinformatics Program, Taiwan International Graduate Program, Academia Sinica, 3Institute of Biomedical Sciences, Academia Sinica, Taipei, 4Institute of Bioinformatics, National Chiao-Tung University, Hsinchu and 5Institute of Statistical Sciences, Academia Sinica, Taipei, Taiwan Received January 4, 2008; Revised April 8, 2008; Accepted April 11, 2008 ABSTRACT functional annotations. Orthologs are defined as genes in different species that originated from a single gene locus in Orthology is a widely used concept in comparative the last common ancestor (1–3). Therefore, orthology is and evolutionary genomics. In addition to prokar- a strong indicator of functional conservation, allows yotic orthology, delineating eukaryotic orthology has genome annotation based on information available from provided insight into the evolution of higher organ- other species, provides raw material for evolutionary anal- isms. Indeed, many eukaryotic ortholog databases ysis and comparative genomics and identifies taxonomi- have been established for this purpose. However, cally restricted sequences. However, computer-generated unlike prokaryotes, alternative splicing (AS) has ortholog designation is complicated because it demands hampered eukaryotic orthology assignments. There- knowledge of the ancestral state of genes and requires fore, existing databases likely contain ambiguous knowledge of complete gene repertoires. It is particularly eukaryotic ortholog relationships and possibly challenging with respect to complex eukaryotic genomes due to gene duplication, conversion to pseudogenes, and misclassify alternatively spliced protein isoforms gene loss and fusion. as in-paralogs, which are duplicated genes that There are several approaches for delineating ortho- arise following speciation. Here, we propose a new logous genes. The phylogenetic tree-based methods approach for designating eukaryotic orthology using HOVERGEN (4) and TreeFam (5) provide good accuracy, processed transcription units, and we present an but because the trees are constructed by multiple sequence orthology database prototype using the human and alignments or are corrected by experts, they provide mouse genomes. Currently existing programs cover limited comprehensiveness, homogeneity of quality and less than 69% of the human reference sequences expandability to new species. Alternatively, the Best- when assigning human/mouse orthologs. In con- Reciprocal-Hits (BRHs) method clusters orthologous trast, our method encompasses up to 80% of the genes based on their whole-length protein sequence human reference sequences. Moreover, the ortholog similarities and was first introduced by the Cluster of database presented herein is more than 92% con- Orthologous Groups (6) database. Phylogenetically, BRHs could be interpreted as genes from different species sistent with the existing databases. In addition to having the shortest connecting path over the distance- managing AS, this approach is capable of identifying based tree. The identification of BRHs is widely adopted orthologs of embedded genes and fusion genes in comparative genomics for its simplicity and feasibil- using syntenic evidence. In summary, this new ity of application to large-scale data. For example, approach is sensitive, specific and can generate a Ensembl Compara (7) (http://www.ensembl.org/biomart/ more comprehensive and accurate compilation of martview/) identifies orthologous genes by searching eukaryotic orthologs. BRHs between paired species. This notion has also been extended and applied to other approaches. Guided by a sequence similarity-based tree, HomoloGene (http:// www.ncbi.nlm.nih.gov/sites/entrez?db=homologene) uses INTRODUCTION a blastp program to group proteins from input organisms The rapidly growing number of available complete and to restrict the molecular distance to prevent unlikely genome sequences attests to the insistent need for orthologs from being grouped together. Inparanoid (8,9) *To whom correspondence should be addressed. Tel: +(886) 2 2652 3967; Fax: +(886) 2 2782 7654; Email: [email protected] ß 2008 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Nucleic Acids Research, 2008, Vol. 36, No. 10 3437 (http://inparanoid.sbc.su.se/cgi-bin/index.cgi) assigns the (http://www.ncbi.nlm.nih.gov/RefSeq/), (http://www.ncbi. orthologous pair having the shortest distance among nlm.nih.gov/Genomes/). There were 40 814 coding the BRHs of a given ortholog group as the anchor, and sequences (NM_ and XM_) among the 41 677 human identifies other additional co-orthologs as in-paralogs. reference sequences and 48 797 (NM_ and XM_) coding Allowing ortholog designations among multiple genomes, sequences among the 50 481 mouse reference sequences. OrthoMCL (10,11) utilizes the Markov Cluster algorithm, This study focused only on the coding sequences. The a probabilistic model that groups (putative) orthologs and detailed genomic locations of human reference sequences paralogs. were obtained and downloaded from the NCBI ftp site: All currently available databases mentioned above are (ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/H_sapiens/ based on protein sequence alignments and do not take mapview/). This information is used as the gold standard alternative splicing (AS) events into full consideration. for further dataset comparison and examination. In AS in eukaryotic genomes plays an important role in comparison of various datasets, we did find some incon- augmenting biological complexity (12) such that a single sistent coordinates of annotated genes from different gene can result in the generation of multiple proteins with data sources, which could be resulted from different high similarity. However, these proteins are isoforms and genomic mapping processes or distinct genome assembly should not be annotated as in-paralogs or orthologs. versions. We then performed subsequent re-alignment Unfortunately, the current eukaryotic ortholog databases process to ensure the genomic transcription locations and all discard AS by utilizing an all-against-all protein to extract genomic sequences in order to perform comparison that cannot exclude isoforms, which results orthologous gene database comparison on the identical in aberrant assignment of orthologs and in-paralogs. For version of genome build. example, the gene SORBS2 has two NM entries for human and several dozen XM entries in the transitory The gene-oriented ortholog database mouse annotation of May 2006 at our initial data exam- Human and mouse genomes were used to demonstrate ination (Supplementary Table 1). Identifying the ortholog the ortholog assignment capability of our program, and of SORBS2 between human and mouse without taking the resulting data are presented as the Gene Oriented AS into account would be a challenging task. Thanking Ortholog Database (GOOD). Figure 1A illustrates the the continuous efforts of the reference gene curation at basic methodology used to designate orthologs based on National Center for Biotechnology Information (NCBI) the processed transcription units of gene-oriented genomic and of the MGI (mouse genome informatics) project, this regions. We utilized 36 018 human reference sequences gene now has a single NM entry in the current mouse and 41 815 mouse reference sequences to identify 21 544 annotation. Therefore, taking the BRH of human iso- human and 22 511 mouse transcription regions. Among forms as the only annotated ortholog at the beginning of these transcription regions, we identified 17 214 orthologs. this study would derive incomplete ortholog annotations. This serves as an example that such issues could create Identifying genomic transcription regions assignment inconsistencies among different databases with separated update schedules. Alternatively, isoforms might To define transcription regions within genomic sequences, be misclassified as in-paralogs by methods such as we used the Blast-like alignment tool (BLAT) (15) Inparanoid, which assigns an anchor pair based on the program to determine the genome locations of reference best alignment score and leaves the rest as in-paralogs. sequences. Several of the human and mouse reference Therefore, including AS in orthologous gene identification sequences from NCBI have poly-A tails of varying is crucial for accuracy. lengths. These poly-A tails reduce the apparent sequence In this study, we propose a new approach for delineat- identities calculated by BLAT in subsequent analyses. ing orthologs among species that are somewhat related. Therefore, we removed all poly-A tails from the reference We utilized processed transcription units rather than sequences to increase the accuracy of the BLAT results. protein sequences as the BRHs input. This accounts for Because BLAT provides all possible alignment results for AS by recognizing gene-oriented genome regions, but it each sequence, we needed to define a threshold to maintains the advantages provided by sequence align-