Comprehensive Comparative Homeobox Gene Annotation in Human and Mouse Laurens G
Total Page:16
File Type:pdf, Size:1020Kb
Database, 2015, 1–12 doi: 10.1093/database/bav091 Original Article Original Article Comprehensive comparative homeobox gene annotation in human and mouse Laurens G. Wilming*,Veronika Boychenko and Jennifer L. Harrow HAVANA Group, Informatics Department, Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK *Corresponding author: Tel: +44(0)1223496830, Fax: +44(0)1223494919 Email: [email protected] Citation details: Wilming,L., Boychenko, V. and Harrow,J. Comprehensive comparative homeobox gene annotation in human and mouse. Database (2015) Vol. 2015: article ID bav091; doi:10.1093/database/bav091 Received 12 February 2015; Revised 28 August 2015; Accepted 31 August 2015 Abstract Homeobox genes are a group of genes coding for transcription factors with a DNA-bind- ing helix-turn-helix structure called a homeodomain and which play a crucial role in pat- tern formation during embryogenesis. Many homeobox genes are located in clusters and some of these, most notably the HOX genes, are known to have antisense or oppo- site strand long non-coding RNA (lncRNA) genes that play a regulatory role. Because automated annotation of both gene clusters and non-coding genes is fraught with diffi- culty (over-prediction, under-prediction, inaccurate transcript structures), we set out to manually annotate all homeobox genes in the mouse and human genomes. This includes all supported splice variants, pseudogenes and both antisense and flanking lncRNAs. One of the areas where manual annotation has a significant advantage is the annotation of duplicated gene clusters. After comprehensive annotation of all homeobox genes and their antisense genes in human and in mouse, we found some discrepancies with the current gene set in RefSeq regarding exact gene structures and coding versus pseudogene locus biotype. We also identified previously un-annotated pseudogenes in the DUX, Rhox and Obox gene clusters, which helped us re-evaluate and update the gene nomenclature in these regions. We found that human homeobox genes are enriched in antisense lncRNA loci, some of which are known to play a role in gene or gene cluster regulation, compared to their mouse orthologues. Of the annotated set of 241 human protein-coding homeobox genes, 98 have an antisense locus (41%) while of the 277 orthologous mouse genes, only 62 protein coding gene have an antisense locus (22%), based on publicly available transcriptional evidence. Introduction Homeobox genes were first discovered in Drosophila mu- Homeobox genes code for transcription factors that have tants where they were found to affect segmentation and the homeodomain, a DNA-binding helix-turn-helix struc- subsequently they have been found in virtually all other ture encoded by the homeobox, as the defining feature (1). animals and in plants and fungi (2). Through their VC The Author(s) 2015. Published by Oxford University Press. Page 1 of 12 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. (page number not for citation purposes) Page 2 of 12 Database, Vol. 2015, Article ID bav091 influence on patterning and cell differentiation and reprog- (16–18). It was shown by Zhong et al.(6) that the DUX ramming, homeodomain family proteins play an important subclass of the PRD domain family in human (with our role in embryogenesis (2–4). new annotation 47 members in total) is mostly clustered A large number of homeobox genes exist in gene clus- on chromosomes 4, 10 and Y, with isolated family mem- ters (4). Automatic annotation of cluster genes formed by bers found on other chromosomes, while a total of only six genomic duplication is hampered by high sequence similar- DUX members (eight with our new annotation) were ity between the genes and, in addition, it can be difficult to found in mice, with three of these (Duxbl1, Duxbl2 and distinguish coding genes from pseudogenes. In order to Duxbl3) found in a locally triplicated region on chromo- generate a complete and accurate homeobox gene set, we some 14 (19). Data presented in the paper showed that the initiated an annotation project within the ENCODE (5) human genome is enriched in neighboring lncRNAs com- consortium, focused on homeobox family genes in the pared with the mouse genome and in some genomic re- human and mouse genomes, using the HomeoDB database gions a human protein coding gene had an antisense (6–8) as our main reference source. lncRNA where the mouse orthologue had an opposite The emerging data on the role of long non-coding RNAs strand lincRNA and vice versa. This finding implies that (lncRNAs) in epigenetic regulation (9) inspired us to annotate the antisense nature of non-coding RNAs (as currently all lncRNAs—which includes long intergenic ncRNAs defined) is not as crucial as the simple presence of opposite (lincRNAs), anti-sense lncRNAs and sense intronic strand lncRNAs in the vicinity of a coding gene or gene lncRNAs—in the vicinity of the homeobox genes. Human cluster. This observation is in line with emerging experi- cell line microarray data on HOX clusters (10) revealed strik- mental data showing a more complex functionality of ingly coordinated transcriptional activity antisense to HOX lncRNAs than that which could be drawn from their gen- genes in intergenic regions, suggesting that previously over- omic position relative to coding genes (20). looked lincRNAs play an important role in gene expression In this article, we present an updated analysis of the regulation through yet to be discovered mechanisms. homeobox gene containing regions in human and mouse Recently, the HAVANA group (11) developed new guidelines and highlight the similarities and differences of architecture for lncRNA annotation allowing us to distinguish two main within each genome and give insights into their evolution. groups of lncRNAs according to their genomic location rela- tive to coding genes: antisense and intergenic (lincRNA). This positional classification allows researchers to study the correl- Methods ation, if any, between the expression of different types of Annotation was performed using our in-house Otterlace an- non-coding loci and protein-coding genes. We define a locus notation system, which includes the ZMap graphical ana- (or single transcript) as antisense if it is positioned on the op- lysis and annotation viewer (21, 22). Briefly, genomic posite strand of a protein-coding gene and their maximum sequence in the form of genomic clones (mostly BACs) is genomic spans overlap; intergenic non-coding RNAs, i.e. not analysed through an automated pipeline comprising se- overlapping a protein coding locus, receive the lincRNA bio- quence similarity searches against peptide and nucleotide se- type. Experimental data on a number of antisense lncRNAs quence databases and analysis for repeats, protein domains, obtained in several laboratories has not shown a direct gene CpG islands and gene predictions. In addition, data from a silencing effect caused by antisense transcripts or siRNAs large number of external sources are imported, such as derived from it. However, it was shown that some partici- ENCODE and Ensembl BodyMap (23) RNA-seq transcript pated in other regulatory processes in cis, such as histone models, RNA-seq reads, polyA-seq, Ensembl and RefSeq demethylation, and, more interestingly, in trans,suchasregu- gene models, and CAGE-TSS transcription start site predic- lating genes on the same or a different chromosome by partic- tions (24, 25). Using the annotation system, annotators visu- ipating in polycomb mediated biochemical pathways (12, 13) alize the analysis results and where necessary perform or as ceRNA (competitive endogenous RNA) ‘sponges’ regu- additional analysis, and annotate transcript models where lating the distribution of miRNAs (14). evidence is deemed to support such models. Support comes In general, homeobox genes and gene clusters are con- primarily from aligned sequence data (ESTs, mRNAs, pep- served between the human and mouse genomes. However, tides) from the analysis pipeline with additional features, some families exist in one species but not in the other, or such as functional genomics, transcriptomics and prote- have expanded in one species relative to another. For ex- omics data, taken into account, as outlined in our annota- ample, Obox (oocyte specific homeobox) clusters (15) are tion guidelines (11) and described elsewhere (26–29). specific to mouse (or rodents) while the Rhox (reproduct- Relative levels of sequence similarity referred to in this paper ive homeobox) family is represented by just three members were judged from BLAST or Dotter (30) alignments or, in in human versus 42 members on mouse chromosome X the case of RHOXF versus Rhox, CLUSTALW. Alignments Database, Vol. 2015, Article ID bav091 Page 3 of 12 Figure 1. The Obox cluster and its neighbourhood compared to the orthologous region in human. Figure is not to scale. See figure for a guide to sym- bols and colours. Overlapping symbols on same strand indicate nested genes; overlapping symbols on opposite strands indicate antisense genes. Gene names in italic between brackets indicate—for yet to be named coding genes—the name of the family or closest homologue or—for pseudo- genes—the name of the parent gene or gene family; approved gene names are in bold; pseudogene and lncRNA names are in italic. Some unnamed genes are provided with RefSeq or VEGA identifiers (for the latter, prefix the 11-digit number with OTTHUMG or OTTMUSG for the full ID for human and mouse, respectively). Core duplicated gene cassettes are boxed. Note the complete absence of any OBOX loci in the human genome between the orthologues of the mouse genes that flank its Obox cluster. The bulk of the expansion of the cluster, which contains 52 Obox genes, appears to have been through the tandem duplication of a six-gene cassette—Obox–Obox3–Obox4–Gtpbp4–Ranbp2–Obox—of which eleven copies (not all complete) are present.