Comparative Genomics Ross C

Primer Comparative Genomics Ross C. Hardison complete genome sequence of features of two organisms will often an organism can be considered be encoded within the DNA that is A to be the ultimate genetic conserved between the species. More map, in the sense that the heritable precisely, the DNA sequences encoding characteristics are encoded within the proteins and RNAs responsible the DNA and that the order of all the for functions that were conserved nucleotides along each chromosome from the last common ancestor is known. However, knowledge of should be preserved in contemporary the DNA sequence does not tell us genome sequences. Likewise, the DNA directly how this genetic information sequences controlling the expression leads to the observable traits and of genes that are regulated similarly behaviors (phenotypes) that we in two related species should also be want to understand. Finding all the conserved. Conversely, sequences that functional parts of genome sequences encode (or control the expression and using this information to improve of) proteins and RNAs responsible the health of individuals and society for differences between species will are the focus of the next phase of the themselves be divergent. Human Genome Project (Collins et al. Different questions can be addressed 2003). Comparative analyses of genome by comparing genomes at different sequences will be a major part of this phylogenetic distances (Figure 1). effort. Broad insights about types of genes can The major principles of comparative be gleaned by genomic comparisons genomics are straightforward. Common at very long phylogenetic distances, Ross C. Hardison is at the Center for Comparative Ge- nomics and Bioinformatics at The Pennsylvania State Primers provide a concise introduction into an University in University Park, Pennsylvania, United important aspect of biology that is of broad and States of America. E-mail: [email protected] current interest. DOI: 10.1371/journal.pbio.0000058 Volume 1 | Issue 2 | Page 156 sequence to increase the number of Glossary positions with matching nucleotides. Several powerful alignment algorithms Conserved: Derived from a common ancestor and retained in contemporary related have been developed to align two or species. Conserved features may or may not be under selection. more sequences. Evolutionary drift: The accumulation of sequence differences that have little or no However, the computational power impact on the fitness of an organism; such neutral mutations are not under selection. required to align billions of nucleotides Sequence polymorphisms arise randomly in a population, most of which have no effect on between two or more species vastly function. Stochastic processes allow a small fraction of these to increase in frequency until exceeds what is normally available in they are fixed in a population; these are detectable as neutral substitutions in interspecies individual laboratories. Thus, several comparisons. research groups make available Homologs: Features (including DNA and protein sequences) in species being compared precomputed alignments of genome that are similar because they are ancestrally related. sequences through servers or browsers Negative selection: The removal of deleterious mutations from a population; also referred (Table 1). An early example is EnteriX, to as purifying selection. for enteric bacteria (Florea et al. Nonredundant protein sets: The set of proteins from which similar proteins, derived from 2000; McClelland et al. 2000). Aligned duplicated genes, have been removed. human, mouse, and rat genomes can Orthologs: Homologous genes (or any DNA sequences) that separated because of be accessed at several sites, including a speciation event; they are derived from the same gene in the last common ancestor. VISTA (Mayor et al. 2000; Couronne et Orthologs are distinguished from paralogs, which are homologous genes that separated al. 2003), the conservation tracks at the because of gene duplication events. University of California at Santa Cruz Phylogenetic distances: Measures of the degree of separation between two organisms (UCSC) genome browser (Kent et al. or their genomes, expressed in various terms such as number of accumulated sequences 2002), Ensembl (Clamp et al. 2003), changes, number of years, or number of generations. The distances are often placed on and GALA (Giardine et al. 2003). phylogenetic trees, which show the deduced relationships among the organisms. Positive selection: The retention of mutations that benefit an organism; also referred to as What Can You Learn about Genome Darwinian selection. Evolution? Synteny: The property of being on the same chromosome. Conserved synteny means that The basic observation in comparative genes that are on the same chromosome in one species are also on the same chromosome in genomics is a description of the the comparison species; these are also referred to as homology blocks. matches between genomes. For example, in the roughly 75–80 e.g., greater than 1 billion years class of certain DNA segments, such as million years since humans diverged since their separation. For example, coding exons, noncoding RNAs, and from mouse, the large-scale gene comparing the genomes of yeast, some gene regulatory regions. Examples organization and gene order have worms, and flies reveals that these of analyses at this distance include been preserved (International Mouse eukaryotes encode many of the same comparisons among enteric bacteria Genome Sequencing Consortium proteins, and the nonredundant (McClelland et al. 2000), among several 2002). About 90% of the human protein sets of flies and worms are species of yeast (Cliften et al. 2001, genome is in large blocks of homology about the same size, being only twice 2003; Kellis et al. 2003), and between with mouse. These regions of conserved that of yeast (Rubin et al. 2000). The mouse and human (International synteny have many genes from one more complex developmental biology Mouse Genome Sequencing human chromosome that match genes of flies and worms is reflected in the Consortium 2002). The new comparison on a mouse chromosome, often in very greater number of signaling pathways of the genomes of Caenorhabditis briggsae similar orders. in these two species than in yeast. Over and Caenorhabditis elegans (Stein et al. Sequences with no obvious function, such very large distances, the order of 2003) falls in this category. In contrast, such as relics of transposons that were genes and the sequences regulating very similar genomes, such as those of last active in the common ancestor of their expression are generally not humans and chimpanzees (separated human and mouse, can still align in conserved. At moderate phylogenetic by about 5 million years of evolution), mammalian comparisons; thus, not distances (roughly 70–100 million years are particularly apt for finding the all the aligning DNA is functional. By of divergence), both functional and key sequence differences that may evaluating the quality of the alignments nonfunctional DNA is found within account for the differences in the genome-wide, the proportion that the conserved DNA. In these cases, organisms. These are sequence changes scores significantly higher than the functional sequences will show under positive selection. Comparative alignments in the ancestral (and a signature of purifying or negative genomics is thus a powerful and presumed nonfunctional) repeats can selection, which is that the functional burgeoning discipline that becomes be determined. This analysis leads to an sequences will have changed less than more and more informative as genomic estimate that about 5% of the human the nonfunctional or neutral DNA sequence data accumulate. genome is under purifying selection (Jukes and Kimura 1984). Not only Alignment of DNA sequences is the and thus is functional. This portion of does comparative genomics aim to core process in comparative genomics. the human genome under selection discriminate conserved from divergent An alignment is a mapping of the is about three times larger than the and functional from nonfunctional nucleotides in one sequence onto the portion coding for protein. Within the DNA, this approach is also contributing nucleotides in the other sequence, with noncoding sequences under selection, to identifying the general functional gaps introduced into one or the other one expects to find noncoding RNA PLoS Biology | http://biology.plosjournals.org Volume 1 | Issue 2 | Page 157 mouse genome (International Mouse Genome Sequencing Consortium 2002). The other 60% is composed of at least two classes of sequences, resulting from lineage-specific insertions, deletions, and possibly other mechanisms. One class, occupying about 24% of the genome, is comprised of the repetitive elements that arose by transposition only on the human lineage (International Mouse Genome Sequencing Consortium 2002). These particular insertions did not occur in mice, and thus they cannot align between human and mouse. Likewise, rodent- and mouse-specific retrotransposons independently expanded to occupy about 33% of the mouse genome. The lineage-specific and ancestral repeated elements occupy a substantial portion of the genome of all multicellular organisms, averaging about 50% in mammalian genomes and expanding even higher in the maize genome (San Miguel et al. 1996). The remaining 36% of the human genome currently cannot be accounted for unambiguously. Some of it could DOI: 10.1371/journal.pbio.0000058.g001 be explained by limitations in the

Comparative Genomics Ross C

The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet Is Available from the National Academies Press, 500 Fifth Street, NW, Washington, D.C

Genomics and Its Impact on Science and Society: the Human Genome Project and Beyond

The Human Genome Project Focus of the Human Genome Project

From the Human Genome Project to Genomic Medicine a Journey to Advance Human Health

Comparative Genome Evolution Working Group

Gene Prediction Using Deep Learning

The Genome Project–Write Issues, Enabling Inclusive Decision-Making on the Topics Mentioned Above (7)

"An Overview of Gene Identification: Approaches, Strategies, and Considerations"

Gene Prediction and Verification in a Compact Genome with Numerous Small Introns

Bioinformatic Analysis of Cis-Encoded Antisense Transcription

Model-Based Integration of Genomics and Metabolomics Reveals SNP Functionality in Mycobacterium Tuberculosis

6.047/6.878 Lecture 4: Comparative Genomics I: Genome Annotation Using Evolutionary Signatures