View metadata, citation and similar papers at core.ac.uk brought to you by CORE

provided by Elsevier - Publisher Connector

Dispatch R311

Mouse genomics: Making sense of the sequence Ian J. Jackson

Interpretation of the sequence relies on encode 616 G--coupled receptors, whereas flies have studies of model genetic organisms. Mouse genetics 146 and worms, 284 [2]. In many cases it is just not possi- and genomics will help to identify all the , and to ble to find a that is equivalent in a comparison of determine their function. human to invertebrates. By contrast, the mouse is evolu- tionarily much closer to humans and its gene content Address: MRC Human Genetics Unit, Western General Hospital, Crewe Road, Edinburgh, EH4 2XU, UK. largely identical. The mouse and human genomes are derived from a common mammalian ancestor, and as few Current Biology 2001, 11:R311–R314 as 100 chromosomal rearrangements separate each genome from this ancestral genome [1,5]. Genes linked in one 0960-9822/01/$ – see front matter © 2001 Elsevier Science Ltd. All rights reserved. species are often linked in the other; and genomic sequence comparisons indicate that gene order is typically For once the hype surrounding the publication of the conserved over many megabases of DNA [1]. draft human genome sequence [1,2] is justified; practi- cally all of the important human DNA now resides in More mouse cDNAs publicly accessible databases. Of course this fantastic Expressed sequence tags, or ESTs, are short sequence resource generates innumerable questions, but these boil reads, typically of a few hundred bases, from the ends of down to two fundamental ones. Where are the genes? And cDNA clones. A few caveats aside, these represent what does each gene do? transcripts and therefore genes, and so have been invaluable for finding genes in genomic sequence. Millions of ESTs Ongoing work on the mouse genome will provide impor- have been produced from hundreds of cDNA libraries. Pro- tant leads to answering these questions. One commonly jects such as Unigene (www.ncbi.nlm.nih.gov/Unigene) used method for identifying genes in genomic sequence is have used automated methods to assign ESTs into clus- to find the sequence represented in cDNA libraries, and a ters on the basis of sequence matches, and these clusters recently published collection of mouse cDNA sequences provide one estimate of gene number. These are overesti- adds substantially to these [3]. Furthermore, the similarity mates, partly because a substantial fraction of ESTs derive between mouse and human genomes means that the from genomic DNA rather than cDNA, and partly because respective genomic DNA sequences can be easily aligned, multiple, non-overlapping clusters can derive from the and the most highly conserved segments (mostly corre- same gene. Furthermore, ESTs by their nature are short sponding to coding exons) readily spotted. Several mouse ‘tags’, which often do not contain the protein coding genome sequencing projects are at various stages of pro- segment of the transcript, and so do not help in catalogu- duction of qualitatively different datasets, all of which will ing functional gene content, and as they do not have con- be invaluable for annotating the human sequence. served features — coding potential — they are often not useful for cross-species comparisons. Finally, a recently announced international consortium aims to discover a function for all mouse genes, and by A better representation of mouse cDNA has been homology for all human genes [4]. The genome projects produced by human curation and annotation by the have brought to biology a new modus operandi; just as Genome Exploration Group at RIKEN, Japan [3]. In this molecular technologies revolutionised cell and develop- project, almost one million mouse cDNA sequence tags mental biology 15 or 20 years ago, so genomic approaches from numerous libraries were clustered, from which about will fundamentally change the way we carry out biological 21,000 clones were sequenced. These sequences contained experimentation. redundancies identifiable by cross comparisons, and further redundancies in which non-overlapping sequences Mouse as a genetic model derived from the same known gene. By extrapolating the Many organisms are valid genetic models of humans. If a incidence of this latter redundancy across their collection human gene has a clearly identifiable equivalent in another of novel sequences, the authors could estimate that they sequenced genome, then useful information should be have representatives of just under 13,000 unique genes. gained from studying the model organism. There are over RIKEN hosted an annotation ‘jamboree’ at which curators 1,000 genes that are present in single copy in the human, examined the sequences and, where possible, assigned nematode worm and fruitfly genomes [1]. On the whole, definitions to the cDNA on the basis of likely function or however, the gene content in invertebrate models appears similarity to known genes. A key tool in the annotation is to be quite different from human. For example, humans the vocabulary developed by the (GO) R312 Current Biology Vol 11 No 8

Consortium [6]. GO annotations assign to gene products genome sequence, with the intention of releasing three- standard terms that describe the biological process, the fold coverage by April 2001. This sequence is a whole molecular function and the cellular component or location genome shotgun, which essentially means it is random of the product. GO terms are intended to enable gene reads, each of a few hundred base pairs, from throughout function and content information to be readily inter- the genome, and these will not overlap into larger pretable across species. contiguous segments to any significant degree. Instead, the intention is that the mouse data will align along the human This cDNA resource has already proved useful in measur- sequence, indicating conserved sequence. This is currently ing the gene content of the draft human sequence. The viewable at the Ensembl web site (www.ensembl.org). At International Human Genome Sequencing Consortium the moment, these mouse matches should be treated with attempted to compile an index of genes from the available caution, but indications are that they will be useful cross- sequence [1]. They derived a list of over 31,000 predic- species sequence tags, whose location in the human tions, almost 15,000 of which are known genes with about genome is defined. 17,000 predicted by various methods (which probably have a fairly high false-positive rate). When the RIKEN The mouse genome is also being sequenced at a higher set was compared to the 31,000 predicted genes, 69% of level of coverage in a clone-by-clone approach. Much of them showed sequence similarity. If the same RIKEN the genome will be completed at ‘draft’ level in 2001, and sequences were used to search the total human sequence, certain regions have been targeted for production of 81% found a match. So 81% of the mouse cDNA set finished sequence by October 2002 [7]. These sequences detects a match against the whole human genome will enable a large-scale overview of sequence similarity sequence, but only 69% pick up hits to the human gene between mouse and human genomes, and identification of index, indicating that the human gene index underrepre- likely conserved exons and other features. Figure 1 shows sents the gene set and contains 69/81 or 85% of the mouse a ‘percent identity plot’ of a 100 kilobase region of the cDNA collection. The reverse comparison — of the human X of mouse and human [8,9]. Exons of set to the RIKEN collection — found 69% were repre- known genes are clearly distinguished by the higher sented, and for known genes was 78%. So the comparisons percent identity compared to surrounding DNA, and puta- indicate that both collections of genes are incomplete, but tive novel genes may be identified. there are some problems in deciding how incomplete. The mouse cDNAs were selected to bias for novel genes, but More mouse mutations we do not know the effect of that bias on overall represen- Mouse studies will also be of key importance in providing tation in the collection (although we know that only 78% information about gene function. The phenotype of a of known genes are present). mouse with a mutation in a single gene provides clear evi- dence of at least one function of that gene. There are Mouse genome sequencing several ways that such mutant mice can be made, the most These comparisons, as well as other analyses, are the basis widely used over the past decade being targeting genes for the surprisingly low prediction for the human gene by homologous recombination in embryonic stem cells. number of 32,000 (another, higher estimate based on the Several thousand genes have so far been mutagenised in same data by F.A. Wright et al. can be found in an elec- this way, but this is a long way short of the total number in tronic preprint available at genomebiology.com). A firmer the genome, and the technique is very labour intensive. estimate will come from doing a whole genome compari- Methods have been developed to accelerate the stem cell son of human to mouse. All the current methods used to approach, in particular the use of gene traps which cause predict genes in genomic sequence are subject to error. mutations by the random insertion of marker DNA, via Methods using matches to cDNA may underestimate which the insertion site can be sequenced to identify the because of incomplete representation in the libraries, disrupted gene before a mutant mouse is generated. This whereas ab initio methods produce overestimates that is a genotype-driven technology, in that the identity of the must be tempered by additional evidence, such as similar- gene is known before the mutation is made. ity to already described genes, which in turn will overcom- pensate and miss truly novel genes. Mouse genomic A complementary, phenotype-driven approach is to create sequence will give a new and powerful means of finding point mutations at random and identify mice with infor- genes. The human and rodent lineages diverged suffi- mative phenotypes from the progeny. The availability ciently long ago that only sequences subject to selection, of the mouse genome sequence should permit the mutated such as exons, will retain extensive similarity. genes to be readily identified. The last few years has seen the development of numerous phenotype-driven Coincident with the publication of the draft human programmes, utilising the powerful mutagen ethyl sequence, a consortium funded by public, charity and nitrosourea. Initial studies, in the UK and in Germany, industrial sources released the first batch of the mouse have catalogued several hundred new dominant mutations Dispatch R313

Figure 1

Percent identity plot (PIP) comparing a region of mouse and human X chromosomes [8,9]. magea9 The human sequence is represented on the 100% abscissa and percentage sequence identity is plotted on the ordinate. The symbols above 75% the plot represent features of the human 50% sequence, including confirmed and putative 0k 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k exons which are depicted as numbered black boxes. ECRA1–ECRA3 are evolutionarily conserved regions that may predict a gene. Calractin and Nsdh1 are known mouse and exon 6.4f2 ECRA1-A3 human genes. (Figure from [9].) 1 A1A2 100%

75%

50% 20k 22k 24k 26k 28k 30k 32k 34k 36k 38k 40k

ECRA1-A3 Caltractin Nsdhl A3 5 4 3 2 1 1 100%

75%

50% 40k 42k 44k 46k 48k 50k 52k 54k 56k 58k 60k

Nsdhl 2 3 100%

75%

50% 60k 62k 64k 66k 68k 70k 72k 74k 76k 78k 80k

Nsdhl 4 5 6 7 8 100%

75%

50% 80k 82k 84k 86k 88k 90k 92k 94k 96k 98k 100k

Current Biology

conferring a range of behavioural, developmental and The genomic view of biology other phenotypes [10,11]. About 20 years ago, the techniques of molecular biology began to be used to study cell biology and developmental More ENU programmes are now underway throughout the biology. It was more than the methodology that was brought world. The NIH has funded several large collaborative pro- to bear, but a particular philosophy, which was that complex jects which are looking for dominant and recessive muta- processes could be reduced to simple, tractable, interactions. tions that affect development, nervous system function Now, another fundamental change is underway in biology and complex behaviour [7]. Screens elsewhere in the world with the advent of genomic techniques. Mass collections are also expanding to find recessive mutations. The mouse of data, whether sequence, expression profiles, protein genetics community has recognised that it is now possible content, molecular interactions or mutations generate to set the goal of creating a collection of mouse lines that information on whole systems rather than on isolated parts. together have a mutation in each gene of the genome. A The philosophy is that we can gain meaningful information recent publication in Science [4] marks the formation of the from problems whose answers are too large and complex to International Mouse Mutagenesis Consortium, which be written in a lab notebook, and can only reside in a com- brings together geneticists from across the world with the puter. Biologists will have to change the way we think about common goal of assigning a function to every gene. experiments to take advantage of these resources. R314 Current Biology Vol 11 No 8

References 1. International Human Genome Sequencing Consortium: Initial sequencing and analysis of the human genome. Nature 2001, 409:860-921. 2. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, et al.: The sequence of the human genome. Science 2001, 291:1304-1351. 3. The RIKEN Genome Exploration Research Group Phase II Team and the FANTOM Consortium: Functional annotation of a full-length mouse cDNA collection. Nature 2001, 409:685-689. 4. The International Mouse Mutagenesis Consortium: Annotating genome sequences with biological functions in mice. Science 2001, 291:1251-1255. 5. Nadeau JH, Taylor, BA: Lengths of chromosomal segments conserved since divergence of mouse and man. Proc Natl Acad Sci USA 1984, 81:814-818. 6. The Gene Ontology Consortium: Gene ontology: tool for the unification of biology. Nat Genetics 2000, 25:25-29. 7. Graham B, Battey E, Jordan E: Report of second follow-up workshop on priority setting for mouse genomics. Mamm Genome 2001, 12:1-2. 8. Hardison RC, Oeltjen J, Miller W: Long human-mouse sequence alignments reveal novel regulatory elements: a reason to sequence the mouse genome. Genome Res 1997, 7:959-966. 9. Mallon AM, Platzer M, Bate R, Gloeckner G, Botcherby MR, Nordsiek G, Strivens MA, Kioschis P, Dangel A, et al.: Comparative genome sequence analysis of the Bpa/Str region in mouse and man. Genome Res 2000, 10:758-775. 10. Nolan PM, Peters J, Strivens M, Rogers D, Hagan J, Spurr N, Gray IC, Vizor L, Brooker D, Whitehill E, et al.: A systematic, genome-wide, phenotype-driven mutagenesis programme for gene function studies in the mouse. Nat Genetics 2000, 25:440-443. 11. Hrabé de Angelis M, Flaswinkel H, Fuchs H, Rathkolb B, Soewarto D, Marschall S, Heffner S, Pargent W, Wuensch K, Jung M, et al.: Genome-wide, large-scale production of mutant mice by ENU mutagenesis. Nat Genetics 2000, 25:444-447.