Early Fixation of an Optimal Genetic Code

Early Fixation of an Optimal Genetic Code Stephen J. Freeland,* Robin D. Knight,* Laura F. Landweber,* and Laurence D. Hurst² *Department of Ecology and Evolution, Princeton University; and ²Department of Biology and Biochemistry, University of Bath, Bath, England The evolutionary forces that produced the canonical genetic code before the last universal ancestor remain obscure. One hypothesis is that the arrangement of amino acid/codon assignments results from selection to minimize the effects of errors (e.g., mistranslation and mutation) on resulting proteins. If amino acid similarity is measured as polarity, the canonical code does indeed outperform most theoretical alternatives. However, this ®nding does not hold for other amino acid properties, ignores plausible restrictions on possible code structure, and does not address the naturally occurring nonstandard genetic codes. Finally, other analyses have shown that signi®cantly better code structures are possible. Here, we show that if theoretically possible code structures are limited to re¯ect plausible biological constraints, and amino acid similarity is quanti®ed using empirical data of substitution frequencies, the canonical code is at or very close to a global optimum for error minimization across plausible parameter space. This result is robust to variation in the methods and assumptions of the analysis. Although signi®cantly better codes do exist under some assumptions, they are extremely rare and thus consistent with reports of an adaptive code: previous analyses which suggest otherwise derive from a misleading metric. However, all extant, naturally occurring, secondarily derived, nonstandard genetic codes do appear less adaptive. The arrangement of amino acid assignments to the codons of the standard genetic code appears to be a direct product of natural selection for a system that minimizes the phenotypic impact of genetic error. Potential criticisms of previous analyses appear to be without substance. That known variants of the standard genetic code appear less adaptive suggests that different evolutionary factors predominated before and after ®xation of the canonical code. While the evidence for an adaptive code is clear, the process by which the code achieved this optimization requires further attention. Introduction All known nonstandard genetic codes appear to be porate known biases in mutation and mistranslation rates secondarily derived minor modi®cations of the canoni- (Ardell 1998; Freeland and Hurst 1998a). cal code (Osawa 1995). The fact that codon reassign- ment is not always lethal indicates that the amino acid/ Adaptive Evidence as an Artifact of Stereochemistry? codon assignments of the canonical code need not be a ``frozen accident'' of history (Crick 1965), but, rather, However, this evidence is potentially ¯awed in sev- require explanation. One hypothesis is that the arrange- eral ways. First, when other measures of amino acid similarity, such as volume or charge, are used, the ca- ment of amino acid assignments results from natural se- nonical code no longer appears special (Haig and Hurst lection among different codes favoring those that min- 1991). Without clear evidence that amino acid polarity imize the phenotypic impact of genetic error by maxi- is the major de®ning factor of protein ®tness, claims for mizing the similarity of amino acids assigned to codons an adaptive canonical code might be spurious. In partic- differing by only a single nucleotide (Sonneborn 1965; ular, the ``stereochemical'' hypothesis proposes that the Woese 1965; Zuckerkandl and Pauling 1965). canonical code originated through speci®c steric inter- Previous evidence for an adaptive code structure actions between amino acids and their associated codons derives as follows. The canonical code's susceptibility (Yarus 1998; Knight, Freeland, and Landweber 1999). to error is quanti®ed as a ``code error value,'' Dcode, rep- Because Polar Requirement values derive from chro- resenting the average change in amino acid meaning (ac- matographic partitioning of amino acids in a water/pyr- cording to some quantitative similarity measure) result- idine system (Woese et al. 1966), amino acids that bind ing from all single-nucleotide substitutions in all codons. nucleotides similarly but act differently in proteins could This is then compared with equivalent values measured be responsible for Polar Requirement values, such that for theoretical alternative codes. When amino acid sim- a code formed through stereochemical interactions ilarity is measured in terms of Polar Requirement might appear adaptive as an artifact. We address this (Woese et al. 1966) (essentially a measure of hydropho- weakness by employing point accepted mutations bicity), the canonical code outperforms all but one in (PAM) 74±100 matrix data (Benner, Cohen, and Gonnet 10,000 randomly generated alternatives (Haig and Hurst 1994), which are derived from the pattern of amino acid 1991; Freeland and Hurst 1998a) and appears to be bet- substitution frequencies observed within naturally oc- ter still when calculation of Dcode is adjusted to incor- curring pairs of homologous proteins and thus provide a direct measure of amino acid similarity in terms of protein biochemistry. Key words: genetic code, adaptation, evolution, PAM matrix, Po- lar Requirement. Potential Problems with PAM Address for correspondence and reprints: Stephen J. Freeland, De- partment of Ecology and Evolution, Princeton University, Princeton, A subtle and potentially profound problem with us- New Jersey 08544. E-mail: [email protected]. ing PAM matrix data in this context is that PAM matrix Mol. Biol. Evol. 17(4):511±518. 2000 values may simply re¯ect the structure of the genetic q 2000 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038 code. Speci®cally, over a short evolutionary period, 511 512 Freeland et al. most amino acid substitutions are likely to result from Adaptive Evidence as an Artifact of Biosynthetic single point mutations within codons: PAM scores for Relatedness? minimally diverged proteins are therefore likely to re- A second weakness in previous adaptive analyses ¯ect the arrangement of codon assignments rather than (Wong 1980; DiGiulio 1989, 1994; Haig and Hurst amino acid similarity in terms of protein chemistry, ren- 1991; Goldman 1993; Ardell 1998; Freeland and Hurst dering code analysis based on PAM scores a tautologous 1998a) is that most have assumed that each synonymous exercise. The PAM 74±100 uniquely avoids this prob- codon block of the canonical code could have taken any lem, since it is derived from only highly diverged pro- amino acid assignment (®g. 1a). A growing body of tein sequences (other PAM matrices are calculated from circumstantial evidence questions this assumption homologs of low divergence and manipulated mathe- (Knight, Freeland, and Landweber 1999). Many of the matically to predict substitution patterns at high diver- 20 canonical amino acids are not plausible products of gence). Quantitative analysis of the PAM 74±100 dem- prebiotic chemistry (Wong and Bronskill 1979) and are onstrates its superior measurement of amino acid simi- only produced in extant organisms as biosynthetic mod- larity. PAM matrices derived for homologs of increasing i®cations of their plausibly primordial counterparts. Fur- divergence do differ from one another, consistent with thermore, biosynthetically related amino acids are often the idea that they decreasingly re¯ect code structure and assigned to similar codons within the canonical code increasingly re¯ect amino acid similarity (in terms of (Wong 1975; Taylor and Coates 1989). Taken together, protein biochemistry). That PAMs of increasing diver- these observations have provoked the code coevolution gence approach an asymptote below the divergence hypothesis (Wong 1975), proposing that the canonical threshold used to choose proteins for the calculation of code evolved from a simpler ancestral form (encoding the 74±100 indicates that this matrix predominantly re- fewer amino acids with greater redundancy) by succes- ¯ects amino acid biochemical similarity (i.e., higher di- sively reassigning subsets of synonymous codons to in- vergence would provide no further surprises). Further corporate novel amino acid biosynthetic derivatives. Al- qualitative observations support this interpretation. For though detailed perceived patterns (Wong 1975) are un- example, within the standard code, Tryptophan is as- trustworthy because of the biosynthetic interrelatedness signed a single codon semisurrounded by termination of most amino acids within present-day metabolism codons and can only change to arginine by means of a (Amirnovin 1997), it does appear that amino acids from single nucleotide transition error (UGG→CGR). How- the same biosynthetic pathway are generally assigned to ever, tryptophan's side chain is a hydrophobic ring, codons sharing the same ®rst base (Taylor and Coates whereas arginine's side chain is aliphatic and hydro- 1989) (®g. 1b). If this re¯ects a history of biosynthetic philic. It is intuitive that tryptophan's protein biochem- expansion from some primordial code, then the implied istry would be more similar to that of, say, phenylala- restrictions on code evolution would reduce the number nine (with an aromatic, hydrophobic side chain), which of possible codes so greatly as to render previous adap- lies two point mutations away in codon space. Low- tive results meaningless (Freeland and Hurst 1998b). We divergence PAMs imply

Early Fixation of an Optimal Genetic Code

Traveling Salesman Problem

Using Local Alignments for Relation Recognition

1 Introduction

Efficient Alignment-Free Sequence Similarity Measurement Based on Kendall Statistics

Amino Acid Substitution Matrices from an Information Theoretic Perspective

Similarity in Design: a Framework for Evaluation

A Web Tool for Protein Semantic Similarity

Similarity Measures for Clustering SNP and Epidemiological Data

Research Article a Topology-Based Metric for Measuring Term Similarity in the Gene Ontology

Significance and Functional Similarity for Identification of Disease Genes 3

Sequence Analysis

A New Measure for Similarity Searching in DNA Sequences 1