<<

MEETING REPORT

Highlights of the ‘ Across ’ Meeting Elspeth A. Bruford*

Project Coordinator, HUGO Gene Nomenclature Committee (HGNC), EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, UK *Correspondence to: E-mail: [email protected]

Date received (in revised form): 24th February, 2010

Abstract The first ‘Gene Nomenclature Across Species’ meeting was held on 12th and 13th October 2009, at the Møller Centre in Cambridge, UK. This meeting, organised and hosted by the HUGO Gene Nomenclature Committee (HGNC), brought together invited experts from the fields of gene nomenclature, phylogenetics and genome assembly and annotation. The central aim of the meeting was to discuss the issues of coordinating gene naming across vertebrates, culminating in the publication of recommendations for assigning nomenclature to across multiple species.

Meeting summary coorganiser of the meeting, kicked off by discussing The meeting began with a welcome and outline of the current work of the HGNC, ‘An Essential the agenda from Elspeth Bruford, one of the Resource for the Genome’. Matt outlined meeting organisers and the group coordinator for the roles of the HGNC, including a summary of the HUGO Gene Nomenclature Committee the process of symbol assignment, and its current (HGNC). HGNC has been based at the European efforts in coordinating gene naming across ver- Bioinformatics Institute (EBI) at Hinxton, UK, tebrates. He also highlighted instances where the since 2007. Since its inception in 1979, the lack of approved gene nomenclature for most mam- HGNC has been assigning gene symbols and malian genomes has resulted in valuable published names to all human genes, including pseudogenes data for these species being absent or confused in and non-coding RNAs. the genomic databases. He was followed by Janan The first session was chaired by Jennifer Harrow, Eppig, principal investigator of the MGI, who, in who leads the Human and Vertebrate Analysis and her talk, ‘What’s in a Name’, told us about current Annotation (Havana) group from the Wellcome nomenclature issues and activities for the mouse. Trust Sanger Institute (WTSI), also located on the As well as genes, the group at MGI also name Hinxton campus. This session was devoted to genetic markers, , mutations and strains. introducing the three established gene nomencla- Current efforts are focused on creating a unified ture groups for mammals — namely HGNC, the gene catalogue for the mouse, by comparing gene Mouse Genome Nomenclature Committee models from the National Center for (MGNC) — based at the Mouse Genome Informatics Biotechnology Information (NCBI)’s Entrez Gene Database (MGI) at the Jackson Laboratory in database, the Ensembl database and the Havana Maine, USA — and the Rat Genome and group’s Vega database. The mouse genetics com- Nomenclature Committee, based at the Rat munity began naming genes in a standardised way Genome Database (RGD) in Milwaukee, long before the human community, with the first Wisconsin, USA. Matt Wright from the HGNC, Mouse Nomenclature Guide published in 1940. In

# HENRY STEWART PUBLICATIONS 1479–7364. HUMAN GENOMICS. VOL 4. NO 3. 213–217 FEBRUARY 2010 213 MEETING REPORT Bruford

2003, the International Committee on nomenclature is identical to human where possible, Standardized Genetic Nomenclature for Mice, and and where a gene has been duplicated in Xenopus the Rat Genome and Nomenclature Committee relative to mammals the gene symbols are appended agreed to unify rules and guidelines for gene, with a numeral or letter suffix to indicate this. and mutation nomenclature in the mouse and rat. The newest nomenclature group, the Chicken It was therefore apt that Janan was followed by Gene Nomenclature Committee (CGNC; PMID: Mary Shimoyama from the RGD. Mary talked 19607656), also aims to name chicken genes based about ‘Nomenclature Assignment, Review and on the names assigned to human genes. Alan Resolution at the Rat Genome Database’, starting Archibald from the Roslin Institute, Edinburgh, with a discussion of the pipelines and software they UK, updated us on the progress of the CGNC, have established for naming rat genes, quantitative which has begun its naming efforts by transferring trait loci (QTLs) and strains, and for making the human gene symbols to 1:1 orthologues in nomenclature updates and orthology assignments chicken. To date, over 8,000 genes with a con- between rat, mouse and human. The state of the firmed 1:1 orthologue in human have been current rat genome assembly can prove problematic, assigned approved names by the CGNC. and there is a need to establish a core consensus rat After lunch, the third session turned to look at gene set in a manner similar to that of the other mammalian genomes that do not have an Consensus CDS (CCDS) projects that are currently established nomenclature group. Elizabeth in place for the human and mouse genomes Murchison from the WTSI spoke first on ‘Gene (PMID: 19498102). Other issues raised by Mary Annotation and Nomenclature in Marsupials and included problems with synchronising updates Monotremes’. While currently they are only rep- between databases, the need for timely adoption of resented by three ‘complete’ genomes in the public RGD gene nomenclature by some databases, and domain (namely those for the opossum, wallaby the lack of requirement for authors to use standar- and platypus), the important positions of these dised nomenclature in many journals. non-eutherian mammals in the vertebrate phylo- The second session was chaired by Derek geny mean that they should be able to teach us Stemple, head of the Vertebrate Development and some fascinating lessons about the evolution of the Genetics group at the WTSI, and focused on the mammalian genome. In most cases, marsupial and three further vertebrate nomenclature groups, start- monotreme genes do have clear eutherian ortholo- ing with a report from Monte Westerfield, the gues, but Elizabeth also discussed the platypus principal investigator of the Zebrafish Model defensin genes, which have shown us that dupli- Organism Database (ZFIN). Zebrafish gene names cation of these immune genes has independently are based on human names wherever possible, but resulted in the convergent evolution of venom in the symbols are written in lower case to distinguish both monotremes and reptiles. them from human gene symbols (which are in Chris Elsik and Ross Tellam, the analysis leaders upper case letters) or mouse/rat symbols (which of the Bovine Genome Sequencing and Analysis are lower case except for an initial upper case Consortium, then told us about the ‘Annotation of letter). Monte raised the important point that the Bovine Genome — the Easy and the Difficult’. species-specific mutants can drive the naming of This talk highlighted several common and recur- genes, such as the oep one-eyed pinhead gene in ring themes from the meeting: the importance of zebrafish, which is the orthologue of human high coverage and a quality genome assembly; the teratocarcinoma-derived growth factor 1 (TDGF1) necessity of producing a consensus gene set that is gene. The next speaker was Erik Segerdell from deposited in a centralised database (in this case the Xenbase, a Xenopus laevis and tropicalis resource Bovine Genome Database, www.bovinegenome. based at the University of Calgary in Alberta, org); and the need for expert input into specific Canada. As for zebrafish, Xenopus gene groups and families of genes. As currently there are

214 # HENRY STEWART PUBLICATIONS 1479–7364. HUMAN GENOMICS. VOL 4. NO. 3. 213–217 FEBRUARY 2010 Highlights of the ‘Gene Nomenclature Across Species’ Meeting MEETING REPORT no guidelines for assigning bovine gene symbols, of genes and over 400 mouse genes, there are only the 5,757 bovine gene models found in both around 120 sets of 1:1 orthologues, making the Ensembl and Entrez Gene, over 60 per cent have direct transfer of gene names between species different symbols assigned to them in each database, impossible without extensive manual curation. The so, clearly, there is a need for standardising the afternoon concluded with a lively discussion on nomenclature for this genome. Jim Reecy, the nomenclature guidelines across species, chaired by bioinformatics coordination leader of the USA’s Alan Archibald. All those present at the meeting National Animal Genome Research Program, then agreed that it would be useful to have a common talked to us about porcine gene annotation. To set of nomenclature rules that could be applied date, over 17,000 gene models have been annotated to any novel vertebrate genome, and that these in the swine genome, of which nearly 10,000 have would be based on human gene nomenclature but been projected from other species. Manual annota- also take into account species-specific character- tion, both from the Havana team at WTSI and istics. This should prove an invaluable resource for from community annotation, is now being used to assigning standardised gene names to newly refine these gene models. Jim also mentioned the sequenced genomes. International Society of Animal Genetics (ISAG), The next day, the proceedings began with two which is an established forum for the livestock gen- in-depth talks on complex gene families, following etics community. Its genome sequence workshops on from Lisa’s presentation the previous afternoon could provide an excellent opportunity for gene on zinc fingers. This session was chaired by Vasilis nomenclature committees to meet. The final Vasiliou from the University of Denver, Colorado, speaker of this session was Noelle Cockett, the USA, an expert in the aldehyde dehydrogenase sheep genome coordinator, based at Utah State family. The first talk came from Jed Goldstone University, USA, who updated us on the ‘Assembly from Woods Hole Oceanographic Institution in of the Ovine Whole Genome Reference Massachusetts, USA, who studies the evolution of Sequence’. The sheep genome is still in the early the (CYP) superfamily. While stages of assembly. There is currently a ‘virtual there are 57 CYP genes in , to date, over sheep genome’ available, which is based on a reor- 11,500 CYP sequences have been named across ganised version of the human, dog and bovine species by the CYP Gene Nomenclature genomes, and provides 70 per cent coverage of the Committee. This relies on the dedication of David ovine genome with a 0.05 per cent false positive Nelson at the University of Tennessee Health rate. It is anticipated that the eventual Ovine Science Center (PMID: 19951895), who individu- Whole Genome Reference Sequence will be to a ally analyses and assigns names to each sequence, depth of 7X and will cover 95 per cent of the and consults with other experts where necessary. unique ovine genome. The CYP nomenclature divides the genes into Noelle was followed by a telepresentation, cour- families (40 per cent predicted amino acid iden- tesy of Lisa Stubbs from the Kruppel Zinc Finger tity cut-off) and subfamilies (55 per cent identity Catalog, based at the University of Illinois at cut-off) — for example, cytochrome P450 family Urbana-Champaign, USA, who discussed the 1, subfamily A, polypeptide 1 is CYP1A1. Several ‘Rapidly Evolving Transcription Factor Genes: the other established gene families, such as the aldehyde KRAB-Family’. She outlined the nomenclature dehydrogenases (ALDH) and aldo-keto reductases issues raised by these complex tandem gene families (AKR) use similar rules for naming. Clear 1:1 that differ significantly in gene content between CYP orthologues have the same names across species. While most zinc fingers are grouped into species; but where the orthology is unclear, novel clusters that are found in syntenic locations, genes are given the next available number in the lineage-specific gene duplications and losses mean subfamily, which can prove complicated when that 1:1 orthologues are rare. In over 400 human dealing with incomplete genomes.

# HENRY STEWART PUBLICATIONS 1479–7364. HUMAN GENOMICS. VOL 4. NO 3. 213–217 FEBRUARY 2010 215 MEETING REPORT Bruford

Jed was followed by Doron Lancet, the principal over 11,000 dog genes and over 12,000 chimp investigator of the Human Olfactory Data Explorer genes in Entrez Gene to be assigned a meaningful (HORDE) and GeneCards databases. Olfactory name automatically, based on their human ortholo- receptor (OR) genes encode seven-helix G-- gue in Homologene. coupled receptors and comprise the largest gene The next speaker was Albert Villela from the superfamily in the human genome, with a total of EnsemblCompara group, based at the EBI, who 855 genes. Of these, around 370 are predicted to talked about the ‘EnsemblCompara GeneTrees: encode functional , around 60 are segregat- Gene Orthologs and Paralogs in Ensembl’. Albert ing pseudogenes (ie can encode both functional explained how EnsemblCompara produces and non-functional alleles in the human popu- complex gene trees that identify both orthologues lation) and the remainder are pseudogenes. In and paralogues using data from all the species in humans, this superfamily has been named using a Ensembl and using the longest translation of each similar nomenclature system to that for the CYPs, gene. In a similar situation to that at Entrez Gene, with divisions into families and subfamilies. The any 1:1 orthology assignments produced by HORDE database currently contains data on the Compara are then used to project gene names from human, chimp, dog, opossum and platypus olfac- human genes to other vertebrates, excluding zebra- tory receptor repertoires, and Doron showed us fish and rodents. how these repertoires can vary significantly The final speaker of the meeting was Leo between vertebrate species. Nevertheless, the Goodstadt from the MRC Functional Genomics presence of putative ancestral OR clusters helps in Unit at Oxford University, UK. Leo’s talk, entitled the identification of orthologues between species, ‘Accurate Inferences of Orthology Among Closely and hence could enable the current human Related Species’, began by outlining the different nomenclature scheme to be expanded to other methods of predicting orthology. He stated that species in combination with expert manual curation. different phylogenetic methods often offer compar- The final presentation session of the meeting able accuracy, which can be improved by taking concentrated on multi-species databases and orthol- into account conserved gene order (syteny), and ogy resources, and was chaired by Ewan Birney, a that phylogenetic inferences are mostly limited by senior scientist from the EBI at Hinxton, and one problems with the genomic data and information of the principal investigators of the Ensembl data- content of the sequence. By looking at the human, base. The first speaker was Donna Maglott from mouse, dog, opossum, platypus and chicken the NCBI in Bethesda, Maryland, USA, who told genomes, he has identified a set of 9,675 1:1 us about ‘Naming Genes at NCBI’. The NCBI’s orthologues. He suggested that these comprise a Entrez Gene database contains data from multiple core, conserved non-duplicating gene set that exists species that do not yet have a nomenclature auth- between vertebrate species. This set would com- ority. Currently names are assigned to genes in prise clear candidates for easily transferring gene these species based on their homology to a gene names between species. with an informative name (as calculated by The meeting concluded with a discussion HomoloGene, which uses a pairwise gene chaired by David Landsman, Chief of the comparison-based approach). Hence, the HGNC Computational Biology Branch of the NCBI, on name is projected to non-rodent mammals, the how to implement gene nomenclature across MGNC name to rodents excluding rat (which is species in the databases. This interesting debate given RGD names), the ZFIN name is used across concluded that coordination between databases and fishes, the CGNC name across birds and the orthology resources is required to identify a core set XenBase name for amphibia. These assignments of agreed 1:1 orthologues between any given exclude olfactory receptors and genes from other species. Such a consensus set could then be candi- known complex families. To date, this has allowed dates for automatic transferral of gene names

216 # HENRY STEWART PUBLICATIONS 1479–7364. HUMAN GENOMICS. VOL 4. NO. 3. 213–217 FEBRUARY 2010 Highlights of the ‘Gene Nomenclature Across Species’ Meeting MEETING REPORT between species. Everyone also agreed that it is on current guidelines and include basic rules clear that some complex gene families cannot be for the naming of paralogues; named in an automated manner, and that expert † A list of complex gene families, which will manual curation is required and should be sought require expert manual curation for cross-species for these families. nomenclature, should be compiled; † Potential funding should be sought for curation of the nomenclature of these complex gene Conclusions families and the construction of a database fra- The key points agreed as a result of this meeting mework for superfamily nomenclature; can be summarised as follows: † The formation of novel species-specific gene nomenclature committees should be encour- † Gene nomenclature should, where possible, aged, with the aim of at least one per order for reflect homologous relationships across ver- mammals; tebrate species; † Automated naming efforts should initially con- † Consensus naming, predominantly based on centrate on consensus 1:1 orthologues as ident- human gene nomenclature, has already been ified by at least two independent and implemented between six vertebrate species comprehensive orthology resources; (human, mouse, rat, chicken, zebrafish and † There is a need to increase community aware- Xenopus), and this effort should be expanded to ness of standardised gene nomenclature, other vertebrate genomes; especially in journals. † Care must be taken when attempting to assign gene names in ‘incomplete’ genomes, and to Acknowledgments avoid ‘humanisation’ of non-human genomes; The HGNC would like to acknowledge that this meeting † Guidelines for the naming of genes across ver- was made possible by funding from NHGRI grant P41 tebrates should be published; these will build HG03345 and Wellcome Trust grant 081979/Z/07/Z.

# HENRY STEWART PUBLICATIONS 1479–7364. HUMAN GENOMICS. VOL 4. NO 3. 213–217 FEBRUARY 2010 217