Primer Comparative Ross C. Hardison complete sequence of features of two will often an can be considered be encoded within the DNA that is Ato be the ultimate genetic conserved between the . More map, in the sense that the heritable precisely, the DNA sequences encoding characteristics are encoded within the and responsible the DNA and that the order of all the for functions that were conserved along each from the last common ancestor is known. However, knowledge of should be preserved in contemporary the DNA sequence does not tell us genome sequences. Likewise, the DNA directly how this genetic information sequences controlling the expression leads to the observable traits and of that are regulated similarly behaviors () that we in two related species should also be want to understand. Finding all the conserved. Conversely, sequences that functional parts of genome sequences encode (or control the expression and using this information to improve of) proteins and RNAs responsible the health of individuals and society for differences between species will are the focus of the next phase of the themselves be divergent. (Collins et al. Different questions can be addressed 2003). Comparative analyses of genome by comparing at different sequences will be a major part of this phylogenetic distances (Figure 1). effort. Broad insights about types of genes can The major principles of comparative be gleaned by genomic comparisons genomics are straightforward. Common at very long phylogenetic distances,

Ross C. Hardison is at the Center for Comparative Ge- nomics and at The Pennsylvania State Primers provide a concise introduction into an University in University Park, Pennsylvania, United important aspect of that is of broad and States of America. E-mail: [email protected] current interest. DOI: 10.1371/journal.pbio.0000058

Volume 1 | Issue 2 | Page 156 sequence to increase the number of Glossary positions with matching nucleotides. Several powerful alignment algorithms Conserved: Derived from a common ancestor and retained in contemporary related have been developed to align two or species. Conserved features may or may not be under selection. more sequences. Evolutionary drift: The accumulation of sequence differences that have little or no However, the computational power impact on the fitness of an organism; such neutral mutations are not under selection. required to align billions of nucleotides Sequence polymorphisms arise randomly in a population, most of which have no effect on between two or more species vastly function. Stochastic processes allow a small fraction of these to increase in frequency until exceeds what is normally available in they are fixed in a population; these are detectable as neutral substitutions in interspecies individual laboratories. Thus, several comparisons. research groups make available Homologs: Features (including DNA and sequences) in species being compared precomputed alignments of genome that are similar because they are ancestrally related. sequences through servers or browsers Negative selection: The removal of deleterious mutations from a population; also referred (Table 1). An early example is EnteriX, to as purifying selection. for enteric (Florea et al. Nonredundant protein sets: The set of proteins from which similar proteins, derived from 2000; McClelland et al. 2000). Aligned duplicated genes, have been removed. human, mouse, and rat genomes can Orthologs: Homologous genes (or any DNA sequences) that separated because of be accessed at several sites, including a speciation event; they are derived from the same in the last common ancestor. VISTA (Mayor et al. 2000; Couronne et Orthologs are distinguished from paralogs, which are homologous genes that separated al. 2003), the conservation tracks at the because of events. University of California at Santa Cruz Phylogenetic distances: Measures of the degree of separation between two organisms (UCSC) genome browser (Kent et al. or their genomes, expressed in various terms such as number of accumulated sequences 2002), Ensembl (Clamp et al. 2003), changes, number of years, or number of generations. The distances are often placed on and GALA (Giardine et al. 2003). phylogenetic trees, which show the deduced relationships among the organisms. Positive selection: The retention of mutations that benefit an organism; also referred to as What Can You Learn about Genome Darwinian selection. ? : The property of being on the same chromosome. Conserved synteny means that The basic observation in comparative genes that are on the same chromosome in one species are also on the same chromosome in genomics is a description of the the comparison species; these are also referred to as blocks. matches between genomes. For example, in the roughly 75–80 e.g., greater than 1 billion years class of certain DNA segments, such as million years since diverged since their separation. For example, coding exons, noncoding RNAs, and from mouse, the large-scale gene comparing the genomes of yeast, some gene regulatory regions. Examples organization and gene order have worms, and flies reveals that these of analyses at this distance include been preserved (International Mouse encode many of the same comparisons among enteric bacteria Genome Sequencing Consortium proteins, and the nonredundant (McClelland et al. 2000), among several 2002). About 90% of the human protein sets of flies and worms are species of yeast (Cliften et al. 2001, genome is in large blocks of homology about the same size, being only twice 2003; Kellis et al. 2003), and between with mouse. These regions of conserved that of yeast (Rubin et al. 2000). The mouse and human (International synteny have many genes from one more complex developmental biology Mouse Genome Sequencing human chromosome that match genes of flies and worms is reflected in the Consortium 2002). The new comparison on a mouse chromosome, often in very greater number of signaling pathways of the genomes of Caenorhabditis briggsae similar orders. in these two species than in yeast. Over and (Stein et al. Sequences with no obvious function, such very large distances, the order of 2003) falls in this category. In contrast, such as relics of transposons that were genes and the sequences regulating very similar genomes, such as those of last active in the common ancestor of their expression are generally not humans and (separated human and mouse, can still align in conserved. At moderate phylogenetic by about 5 million years of evolution), mammalian comparisons; thus, not distances (roughly 70–100 million years are particularly apt for finding the all the aligning DNA is functional. By of divergence), both functional and key sequence differences that may evaluating the quality of the alignments nonfunctional DNA is found within account for the differences in the genome-wide, the proportion that the conserved DNA. In these cases, organisms. These are sequence changes scores significantly higher than the functional sequences will show under positive selection. Comparative alignments in the ancestral (and a signature of purifying or negative genomics is thus a powerful and presumed nonfunctional) repeats can selection, which is that the functional burgeoning discipline that becomes be determined. This analysis leads to an sequences will have changed less than more and more informative as genomic estimate that about 5% of the human the nonfunctional or neutral DNA sequence data accumulate. genome is under purifying selection (Jukes and Kimura 1984). Not only Alignment of DNA sequences is the and thus is functional. This portion of does aim to core process in comparative genomics. the under selection discriminate conserved from divergent An alignment is a mapping of the is about three times larger than the and functional from nonfunctional nucleotides in one sequence onto the portion coding for protein. Within the DNA, this approach is also contributing nucleotides in the other sequence, with noncoding sequences under selection, to identifying the general functional gaps introduced into one or the other one expects to find noncoding RNA PLoS Biology | http://biology.plosjournals.org Volume 1 | Issue 2 | Page 157 mouse genome (International Mouse Genome Sequencing Consortium 2002). The other 60% is composed of at least two classes of sequences, resulting from lineage-specific insertions, deletions, and possibly other mechanisms. One class, occupying about 24% of the genome, is comprised of the repetitive elements that arose by transposition only on the human lineage (International Mouse Genome Sequencing Consortium 2002). These particular insertions did not occur in mice, and thus they cannot align between human and mouse. Likewise, rodent- and mouse-specific retrotransposons independently expanded to occupy about 33% of the mouse genome. The lineage-specific and ancestral repeated elements occupy a substantial portion of the genome of all multicellular organisms, averaging about 50% in mammalian genomes and expanding even higher in the maize genome (San Miguel et al. 1996). The remaining 36% of the human genome currently cannot be accounted for unambiguously. Some of it could DOI: 10.1371/journal.pbio.0000058.g001 be explained by limitations in the Figure 1. Comparisons of Genomes at Different Phylogenetic Distances Are Appropriate to Address sensitivity of the alignment procedures; Different Questions i.e., some of the nonaligning DNA A generalized phylogenetic tree is shown, leading to four different organisms, with A could be orthologous DNA that and D the most distantly related pairs. Examples of the types of questions that can be addressed by comparisons between genomes at the different distances are given in the has changed so much that current boxes. programs cannot recognize that the sequence has evolved from a common genes, sequences involved in regulation Only about 60% of the C. elegans genes ancestor sequence. However, the of , and other critical encoding proteins have clear homologs homologs to some of the nonaligning components of the genome. in C. briggsae (Stein et al. 2003). The DNA in human could be deleted Virtually all (99%) of the protein- two worms are difficult to distinguish in mouse (International Mouse coding genes in humans align with morphologically and probably have Genome Sequencing Consortium homologs in mouse, and over 80% similar patterns of development, but 2002; Hardison et al. 2003). Given are clear 1:1 orthologs. In most cases, they achieve these similarities with the large expansion of mammalian the intron-exon structures are highly some significant differences in the genomes by transposable elements, conserved. This extensive conservation gene sets. Detailed comparisons of one would expect that a compensatory in protein-coding regions may be the similarities and differences in the amount of the ancestral DNA would be expected, because many biochemical relevant genes in these organisms will deleted from the genome. As genome functions of humans should also be therefore provide useful insights into sequences from additional species found in mouse. However, it is not seen developmental processes. are determined, the various possible in all comparisons over an equivalent At the level, about 40% explanations for this nonaligning, amount of phylogenetic separation. of the human genome aligns with the nonrepetitive DNA can be tested.

Table 1. URLs for Accessing Precomputed Whole-Genome Alignments and Their Analysis

Server or Browser Genomes Covered URL

EnteriX enteric bacteria http://bio.cse.psu.edu/ VISTA Genome Browser human, mouse, rat http://pipeline.lbl.gov/ UCSC Genome Browser , worms, zooa http://genome.ucsc.edu/ Ensembl mammals, fish, , worms http://www.ensembl.org/ GALA human, mouse, rat http://bio.cse.psu.edu/ a Data from multiple alignments of 13 vertebrate genome sequences homologous to the human CFTR region (Thomas et al. 2003) are included. DOI: 10.1371/journal.pbio.0000058.t001

PLoS Biology | http://biology.plosjournals.org Volume 1 | Issue 2 | Page 158 DOI: 10.1371/journal.pbio.0000058.g002 Figure 2. Examples of UCSC Genome Browser Views of Genes and Alignments The unc-52 gene in C. elegans (A) and part of its homolog HSPG2 in human (B) are shown, with rectangles for exons and lines for introns; arrows along the introns show the direction of transcription. Both genes encode a chondroitin sulfate proteoglycan. The gene in C. elegans is much smaller (about 29 kb) than the gene in humans (about 180 kb; only the 5’ portion is shown in [B]). The positions of alignments between C. elegans and C. briggsae are shown by the purple rectangles in (A). The probability that alignments between human and mouse result from purifying selection are plotted along the Human Cons track in (B). Note that in both comparisons, substantial amounts of intronic and flanking regions align, and several peaks of likely-selected DNA are seen for the human-mouse alignments in the noncoding regions. Among these are candidates for regulatory elements. Not only do the rates of evolution regions of genomic instability may play genomics and bioinformatics (Rivas vary along phylogenetic lineages, but a role in expanding the diversity of the and Eddy 2001), and it is likely that also they are also highly variable within proteins encoded in the genome. interspecies comparisons will also help genomes (Wolfe et al. 1989). With the in this analysis. whole-genome alignments between What Can You Learn about Regions of noncoding DNA with mammals, which encompass many Genome Function? a particularly high similarity among sites that are highly likely to have no Information on sequence similarity species have long been recognized as function, it is clear that the neutral rate among genomes is a major resource good candidates for functional regions varies significantly in large, megabase- for finding functional regions and (Hardison et al. 1997; Pennacchio sized regions along mammalian for predicting what those functions and Rubin 2001), and several have . The rates of insertion are. One of the best examples is the been confirmed as gene regulatory of certain classes of retrotransposons, improvement in identification of sequences (e.g., Loots et al. 2000). inferred large deletions, and meiotic protein-coding genes. Software that However, the appropriate threshold for recombination vary along with the incorporates interspecies similarity the level of sequence similarity that is neutral substitution rates (International into (Batzoglou et diagnostic for functional sequences has Mouse Genome Sequencing al. 2000; Korf et al. 2001; Wiehe et al. not been established, and investigators Consortium 2002; Hardison et al. 2001; Alexandersson et al. 2003) is use a variety of such thresholds. What 2003). These observations indicate being used to analyze large genomes. is needed is a robust assessment of the that large segments of mammalian Several of the novel genes predicted likelihood that a particular alignment chromosomes have an inherent in mammals using these programs results from purifying selection rather tendency to change by any of have been verified experimentally, than evolutionary drift. The analysis several processes, such as nucleotide adding about 1,000 new genes to the is complicated by the variable rate substitution, insertion of transposons, mammalian set (International Mouse of neutral evolution within species, and recombination. Genome Sequencing Consortium but solutions have been developed Another major source of change in 2002). Comparisons of the worm and are being improved. Comparison genomes is segmental duplications, genomes led to 1,275 well-supported of the rates of within-species which are particularly prominent in suggestions for new genes in C. elegans, polymorphism and between-species primate genomes (Bailey et al. 2002). adding significantly to the roughly divergence has proven effective for These large duplications of tens to 20,000 known and predicted genes. monitoring selection in nucleotides thousands of kilobases are revealed Predicting and verifying noncoding sequences from Drosophila species by intraspecies comparisons. These RNA genes is a current challenge in (Hudson et al. 1987). This method

PLoS Biology | http://biology.plosjournals.org Volume 1 | Issue 2 | Page 159 uses the intraspecies polymorphism sequences for functional prediction is Phylogenetic footprinting reveals a nuclear protein which binds to silencer sequences in the measurements as a monitor of neutral shown dramatically by the analyses of human  and  globin genes. Mol Cell Biol 12: evolution, and deviations from 13 genomic sequences from species 4919–4929. neutrality, measured as significantly less ranging from fish to humans (Thomas Hardison R, Oeltjen J, Miller W (1997) Long human-mouse sequence alignments reveal interspecies change than expected, are et al. 2003). Other approaches using novel regulatory elements: A reason to indicators of selection. For the human- multiple sequences from more closely sequence the mouse genome. Genome Res 7: mouse genome comparisons, the related species substantially improve 959–966. Hardison RC, Roskin KM, Yang S, Diekhans local neutral rate was estimated from the resolving power of comparative M, Kent WJ, et al. (2003) Covariation the divergence of aligned ancestral genomics (Gumucio et al. 1992; Boffelli in frequencies of substitution, , transposition and recombination during repeats, and similarity scores were et al. 2003). The Human Genome eutherian evolution. Genome Res 13: 13–26. adjusted accordingly. By evaluating the Project recognizes the power of this Holt RA, Subramanian GM, Halpern A, Sutton distribution of these similarity scores broad comparative analysis (Collins et GG, Charlab R, et al. (2002) The genome sequence of the malaria Anopheles in likely-neutral DNA and in DNA al. 2003). Researchers may reasonably gambiae. 298: 129–149. inferred as being under selection, a expect in the near future to have Hudson RR, Kreitman M, Aguadé M (1987) A probability that any human-mouse results of this comparative analysis test of neutral based on nucleotide data. 116: 153–159. alignment reflects purifying selection readily available. By calibrating these International Mouse Genome Sequencing can be computed (Figure 2), and such results, such as estimated likelihoods Consortium (2002) Initial sequencing and comparative analysis of the mouse genome. scores are available genome-wide on of being under selection, likelihood Nature 420: 520–562. the UCSC Genome Browser. of being a coding exon, etc., against Jukes TH, Kimura M (1984) Evolutionary Predicting exactly what the function known functional elements, the power constraints and the neutral theory. J Mol Evol 21: 90–92. is of these noncoding sequences under of the comparative approaches should Kellis M, Patterson N, Endrizzi M, Birren B, selection is a major challenge. One improve. The critical next stage is Lander ES (2003) Sequencing and comparison promising approach is to collect good large-scale experimental tests of the of yeast species to identify genes and regulatory elements. Nature 423: 241–254. training sets of alignments within predictions, which should prove Kent WJ, Sugnet CW, Furey TS, Roskin KM, sequences of known functions, such exciting and challenging.  Pringle TH, et al. (2002) The human genome as gene regulatory sequences, and use browser at UCSC. Genome Res 12: 996–1006. References Korf I, Flicek P, Duan D, Brent MR (2001) those alignments to develop statistical Alexandersson M, Cawley S, Pachter L (2003) Integrating genomic homology into gene models for estimating a likelihood SLAM: Cross-species gene finding and structure prediction. Bioinformatics 17 Suppl 1: alignment with a generalized pair hidden S140–S148. that any given alignment could be Markov model. Genome Res 13: 496–502. Loots GG, Locksley RM, Blankespoor CM, Wang generated by that model (e.g., Elnitski Bailey JA, Gu Z, Clark RA, Reinert K, Samonte RV, ZE, Miller W, et al. (2000) Identification of a et al. 2003). This type of approach et al. (2002) Recent segmental duplications in coordinate regulator of interleukins 4, 13, and 5 the human genome. Science 297: 1003–1007. by cross-species sequence comparisons. Science could be applied to any functional Batzoglou S, Pachter L, Mesirov JP, Berger B, 288: 136–140. category in which the conserved DNA Lander ES (2000) Human and mouse gene Mayor C, Brudno M, Schwartz JR, Poliakov A, sequence is critical to the function. For structure: Comparative analysis and application Rubin EM, et al. (2000) VISTA: Visualizing to exon prediction. Genome Res 10: 950–958. global DNA sequence alignments of arbitrary instance, it is still not clear whether Boffelli D, McAuliffe J, Ovcharenko D, Lewis length. Bioinformatics 16: 1046–1047. conserved DNA sequences are critical KD, Ovcharenko I, et al. (2003) Phylogenetic McClelland M, Florea L, Sanderson K, Clifton to the function of replication origins; shadowing of primate sequences to find SW, Parkhill J, et al. (2000) Comparison of functional regions of the human genome. the Escherichia coli K-12 genome with sampled if they are not, then this analytical Science 299: 1391–1394. genomes of a Klebsiella pneumoniae and three model will not successfully predict Clamp M, Andrews D, Barker D, Bevan P, Salmonella enterica serovars, Typhimurium, Typhi Cameron G, et al. (2003) Ensembl 2002: and Paratyphi. Nucleic Acids Res 28: 4974–4986. this important functional category, Accommodating comparative genomics. Pennacchio LA, Rubin EM (2001) Genomic and other methods will need to be Nucleic Acids Res 31: 38–42. strategies to identify mammalian regulatory developed. Cliften PF, Hillier LW, Fulton L, Graves T, Miner sequences. Nat Rev Genet 2: 100–109. T, et al. (2001) Surveying Saccharomyces Rivas E, Eddy SR (2001) Noncoding RNA gene genomes to identify functional elements by detection using comparative sequence analysis. Prospects comparative DNA sequence analysis. Genome BMC Bioinformatics 2: 8. Res 11: 1175–1186. Rubin GM, Yandell MD, Wortman JR, Gabor The past year has brought the Cliften PF, Sudarsanam P, Desikan A, Fulton Miklos GL, Nelson CR, et al. (2000) genome sequences of species that L, Fulton B, et al. (2003) Finding functional Comparative genomics of the eukaryotes. are close relatives of many model features in Saccharomyces genomes by Science 287: 2204–2215. phylogenetic footprinting. Science 301: 71–76. San Miguel P, Tikhonov A, Jin YK, Motchoulskaia organisms. The list includes several Collins FS, Green ED, Guttmacher AE, Guyer N, Zakharov D, et al. (1996) Nested yeast species to compare with MS (2003) A vision for the future of genomics retrotransposons in the intergenic regions of research. Nature 422: 835–847. the maize genome. Science 274: 765–768. , another Couronne O, Poliakov A, Bray N, Ishkhanov T, Stein LD, Bao Z, Blasiar D, Blumenthal T, Brent Drosophila species and Anopheles (Holt Ryaboy D, et al. (2003) Strategies and tools for MR, et al. (2003) The genome sequence et al. 2002) to compare with Drosophila whole-genome alignments. Genome Res 13: of Caenorhabditis briggsae: A platform for 73–80. comparative genomics. PLoS Biol 1: DOI: melanogaster, mouse to compare with Elnitski L, Hardison RC, Li J, Yang S, Kolbe D, et 10.1371/journal.pbio.0000044. human, and now C. briggsae to compare al. (2003) Distinguishing regulatory DNA from Thomas JW, Touchman JW, Blakesley RW, with C. elegans (Stein et al. 2003). Fully neutral sites. Genome Res 13: 64–72. Bouffard GG, Beckstrom-Sternberg SM, et al. Florea L, Riemer C, Schwartz S, Zhang Z, (2003) Comparative analyses of multi-species harvesting the information in these Stojanovic N, et al. (2000) Web-based sequences from targeted genomic regions. comparative analyses and integrating it visualization tools for Nature 424: 788–793. alignments. Nucleic Acids Res 28: 3486–3496. Wiehe T, Gebauer-Jung S, Mitchell-Olds T, Guigo across the many comparisons will be a Giardine BM, Elnitski L, Riemer C, Makalowska I, R (2001) SGP-1: Prediction and validation continuing and fruitful exercise. Schwartz S, et al. (2003) GALA: A database for of homologous genes based on sequence Comparing more than two genomic genomic sequence alignments and annotations. alignments. Genome Res 11: 1574–1583. Genome Res 13: 732–741. Wolfe KH, Sharp PM, Li WH (1989) Mutation sequences provides even more resolving Gumucio DL, Heilstedt-Williamson H, Gray rates differ among regions of the mammalian power. The efficacy of multiple TA, Tarle SA, Shelton DA, et al. (1992) genome. Nature 337: 283–285.

PLoS Biology | http://biology.plosjournals.org Volume 1 | Issue 2 | Page 160