Proc. Natl. Acad. Sci. USA Vol. 77, No. 6, pp. 3539-3543, June 1980 Genetics Predicting coding function from sequence or survival of "fitness" of tRNA (DNA/mRNA/sequence constraints/genotypic selection/bonded discontinuous ultrathin acrylamide gel) GEORGE PIECZENIK Department of Biochemistry, Bureau of Biological Research, Rutgers University-Busch Campus, New Brunswick, New Jersey 08903 Communicated by Rollin D. Hotchkiss, December 26, 1979

ABSTRACT The sequence of a nucleotide region of fI Presented are two oligonucleotide sequences unselected bacteriophage was determined on a bonded ultrathin acryl- except in the respect that they were determined by me and that amide gel with a discontinuous buffer system by using the di- they offer little in the form of initiator or terminator codons. deoxy-DNA sequencing method. This sequence and one other were analyzed for maximal base pairing with tRNAs. The results When they are first determined, such short sequences give little allow a prediction of the direction and phase of possible coding or no clue as to coding functions. The two sequences will be functions. The implication of sequence constraints on mRNA analyzed as they were first viewed in the light of the mRNA codon frequency, tRNA structure, the origin of protein synthesis, constraint, without reference to any collateral information. The and triplet reading are discussed in terms of neutral, Darwinian, results will be compared with later proposals for coding func- and genotypic selectionist perspectives of evolution. The model of F. H. C. Crick, S. Brenner, A. Klug, and G. Pieczenik [(1976) tion. Oriins ofLife 7,389-397] for the origin of the is used to interpret contemporary adaptive and functional nucleic METHODS acid sequences. Determining a Nucleotide Sequence. Bacteriophage fi is Pieczenik et al. (1) have tested the assumption that mRNA is a single-stranded DNA bacteriophage, parts of which have been directly determined by its complementary sequence. In this sequenced (6, 11, 12). Single-stranded viral DNA, (+)strand, paper I wish to question the assumption that triplet codon was prepared from plaque-purified fI bacteriophage as de- translation necessarily implies that the tRNA-mRNA interac- scribed (13) and made up to a concentration of 0.6 mg/ml. tion is a three-base-pair interaction. It is widely believed that Covalently closed, duplex, supercoiled replication form (RF) DNA sequences evolve or are conserved in evolution as the I of fI DNA was prepared by the published method (6) with the result of selective pressures exerted on organisms carrying their modification of eliminating the sucrose gradient. Escherichia functional end products. The view that nucleic acid sequences coli K-38 was the host strain. themselves may be subject to direct biochemical selection, Hae III restriction fragments were prepared by digesting 100 which imposes evolutionary constraints on nucleotide order (2, ,Ag of RF I DNA with Hae III (obtained from New England 3), was in part illustrated by a recent proposal for the primordial BioLabs) for 15 min at 1 unit/tg of DNA for 2 hr in 50 Ul of 10 mechanism of protein synthesis. In this proposal, Crick et al. mM Tris-HCl, pH 7.8/66 mM MgCl2/60 mM 2-mercaptoeth- (4) suggested that in the primitive protein-synthesizing system, anol. Fragments were separated on a modified version of an the tRNAs used five base pair interactions with the mRNAs. Ornstein-Davis discontinuous acrylamide gel system (2, 14, 15). From this, a comma-free, triplet, partially overlapping code The stacking gel was 2.75% acrylamide (1:10 bisacrylamide/ could easily develop. In the present paper, the possibility of such acrylamide) at pH 6.7. The separating gel was composed of 5% overlapping interactions will be examined for existing con- acrylamide and 10% acrylamide at pH 8.9 at a 1:40 bisacryla- temporary nucleotide sequences and known tRNAs. mide/acrylamide ratio. The running buffer was Tris glycine Such interactions would impose constraints upon mRNA at pH 8.3 (15). Electrophoresis was at 0°C at 100-200 V until sequences and these would be independent of, but compatible the bromophenol blue marker reached the bottom of a 20-cm with, the internal (out-of-phase) terminator constraint for de- gel. Eight of the nine recognized restriction fragments were termining whether a sequence is a message sequence (5, 6). separated. The seventh band from the origin was eluted elec- Because the universality of the UGA terminator codon has now trophoretically into a dialysis bag, acidified with sodium acetate been questioned (7), an additional method for determining (pH 5.0), and precipitated with 2.5 vol of 95% (vol/vol) ethanol phase-of-reading becomes more interesting. in 100 The existence of such an mRNA constraint would also bring at -20°C. The fragment was pelleted and resuspended under question the assumption that synonymous codon usage ,ul of water. This restriction fragment, called Hae III H frag- is selectively neutral (8, 9). It also argues that selection may not ment, was used to prime the fi (+)strand for generating a se- be exclusively at the total organism level and that nucleotide quence. sequences are a simple record of the successful and adaptive The DNA sequence was determined with chain-terminating historical accidents a species has encountered in its survival in dideoxy inhibitors by the procedure of Sanger et al. (16), with nature (10). An mRNA constraint based on tRNA interaction the modification that all four terminators were 2',3'-dideoxy- can be a historical vestige of the original protein-synthesizing triphosphates and at a concentration 100 times the respective mechanism based on the Darwinian principle of descent from deoxytriphosphate concentrations. The dideoxytriphosphates a common ancestor (4), or it may have arisen as a consequence were purchased from P-L Biochemicals. Reaction mixtures of continuous selection at the nucleotide level [i.e., genotypic were prepared for [a-32P]dATP as the labeling triphosphate selection (3)], or it may be a consequence of another, as yet, as well as for [ct-32P]dCTP. The specific activities were in the unspecified limitation. range of 300-500 Ci/mmol (1 Ci = 3.7 X 1010 becquerels). In addition to the dATP "chase" for the [32P]dATP-labeled re- reaction mixture was chased with its The publication costs of this article were defrayed in part by page action, each respective charge payment. This article must therefore be hereby marked "ad- unlabeled triphosphate in 1 Al of 0.5 mM dNTP. The primer vertisement" in accordance with 18 U. S. C. §1734 solely to indicate was recleaved with Hae III (4 units per reaction mixture for 10 this fact. min at 37°C). Reactions were terminated with EDTA and then 3539 Downloaded by guest on September 23, 2021 3540 Genetics: Pieczenik Proc. Nati. Acad. Sci. USA 77 (1980)

EDTA/dye in formamide as described. were 3 fractionated on two gel systems. One was a continuous system r G A T C G A T ) of Tris borate/7 M urea/12% acrylamide (1:20 bisacrylam- COiC.,, A ide/acrylamide) at pH 8.3 with a thickness of 1 mm (17). The 2l A other was a discontinuous buffer system and the gel was cova- co G a -0s.NA q" lently bonded to its backing, allowing an 0.18-mm gel. C Ultrathin Discontinuous Bonded Acrylamide Gel System. ,C 4IS.: By covalently bonding the gel to its glass backing one solves two T problems of resolution. One can make an ultrathin gel and shrink the gel, without distortion, immediately after electro- phoresis. This limits bandwidth as a consequence of gel thick- ness and diffusion. The question of gel thickness has also already G been addressed by Sanger and Coulson (18). The strategy of C developing a thin starting zone was originally devised by C Ornstein and Davis (14, 15), using a moving boundary. They A set up a discontinuous pH system to generate moving -T T boundaries; I have set up a system with cations of different _mIA mobilities. C4,WT The gel was a 12% Tris borate/urea gel at pH 8.9; the upper A (and lower) reservoir buffer was Tris glycine at pH 8.3. The concentrations were as given for Tris glycine buffer (15) and TB/TB TG/TB D Tris borate buffer and Tris borate/urea 12% gels (17). The gel FIG. 1. DNA sequence of bacteriophage fi with Hae III restric- was bonded to its glass plate with y-methacryloxypropyltri- tion fragment used to prime the viral DNA (+)strand and dideoxy methoxysilane (19). Runners for the vertical ultrathin gel were sequencing method of Sanger et al. (16). TB/1TB refers to continuous made from cut x-ray film with the emulsion removed. Slot Tris borate buffer gel electrophoresis system. TG/TB refers to Tris formers of the same material were 1 cm wide X 1 cm high X glycine buffer and Tris borate gel buffer and bonded acrylamide gel 0.18 mm thick with a 2-mm space between the slots. The gel electrophoresis. was prerun with Tris borate buffer for 1/2 hr prior to layering internal terminators or symmetry elements 3'-ward to it to of sample and change of buffer system. This thin gel runs at 2 satisfy sequence constraints for initiator regions (6). The single kV and 10-14 mA, but can be run as low as 1 kV and 5 mA. The terminator codon UAA in phase 3 might eliminate one of the gel was dehydrated in 95% ethanol immediately after elec- possible six phases; on the other hand it might represent an trophoresis, which removes both the urea and water and shrinks actual carboxyl terminus in this phase. the gel to less than 0.1 mm, thereby eliminating diffusion With initiators and terminators ambiguous and in the absence problems. A film was placed directly on the dry gel for auto- of other foreknowledge, it is not certain that the sequence is radiography at -70'C with a Cronex III intensifying screen. coding for protein at all and, if so, the chance of choosing is less Fig. 1 shows results with the standard gel fractionation system than 1 in 6, possibly something like 1 in 12. (Tris borate/Tris borate) and the discontinuous ultrathin We can ask about any sequence whether it is recognized by bonded gel system (Tris glycine/Tris borate) when the same tRNAs and, if so, is it recognized by just a central trinucleotide sample was divided and run on the two systems simultaneously. or a larger part of the anticodon loop. Along the left-hand The diffuse thick bands overlying the Tris glycine/Tris borate Phase 1 sequence bands do not appear on the Tris borate/Tris borate V AL P H EP 1O V ALA L AMA TI AG I CL0AYS NILE CUU-UUUC C UG U U-C AA UG-C uGCGUU AA U-A UU-G system. The last "C" band shown on this gel, which was light E c *l 1 N A 1" E cc I. a N A when a [32P]ATP-labeled mixture was used, showed up clearly PIelc ft UN AP0 Ec o I it N Al'S with an equivalent [32P]CTP-labeled mixture. E o I It N A"' cA@UGCAYGFU U I N Amil E col * Ec *0 li I ENA'4I N AA"' AAC RESULTS AND DISCUSSION U ce. N A i ^~~~~UCCAAjkuC U cobi I NA'A Fig. 1 shows the pattern of nucleotides built up adjacent to the E c 1. I a N A0 Phase 2 ARUAGUC 1.01. tUNA' P N I-P N E-L E U-L I U-C L NT I P-L E U-A L A-V A L- I L E-L E U Hae III restriction fragment by polymerase synthesis along the G-U U-U U C-C U 0-U U -C A A-U G0-C U 0-G C 0- U A-A UA-U U G template phage (+)strand. With the directions of synthesis and E.lcli * RNA ¶ Uc I Ic eoli,lI I I NNA'q'jA: AVAAA&AACAUU translation known, we can say that the segment (5')C-A-A-T- PkegcBd IU N A Uc0 t R N At V A'GUU UU A-T-T-A-C-C-G-C-C-A-G-C-C-A-T-T-G-C-A-A-C-A-G-G- E.eli RNA AA:Ad E.c li I NAI¶ A-A-A-A-A-C(3') [the (-)strand] and its complement (5')G- Ice li *tE tMA AAAAJUCA * C E.ce * E NA'i YGYU T-T-T-T-T-C-C-T-G-T-T-G-C-A-A-T-G-G-C-T-G-G-C-G- Ice I,It I NAti A UAGiU C G-T-A-A-T-A-T-T-G(3') [(+)strand] are produced by fi bac- PhagU,tENA"' Phase3 AAJU C P H E-S E I-C Y S-C Y S-A S N- IY-L-T P-A I G-1I M-T Y R teriophage, and one of them, read in the 5'-to-3' direction, may 0 U-U U U-U CC-UG U-UG C A A U-G C-UG-C 00-U A A-U A U-UO E.ce Ii NAAP AAAAGU V determine a function. E.e qIi f a N A Cs" CAAAGtIU C' E ce I t R N A Given so short a sequence, can we propose whether it is a F.co Ii RUNA coding or noncoding sequence and, coding, in which phase? IceI*1i E NA^4'" -AOUUG~. it E.celi I ENA' A AAC CU C Figs. 2 and 3 show the corresponding RNA transcripts of the I.coliI~c it* EIINAN Al A A'GCl UC T R m A DNA (-) and (+)strands, respectively, each parsed by the ge- E.s Ii *t N A T A AUGUUC netic code into three possible reading frames, phases 1-6. No. of extra 4- :.' One first notes that many possible peptides could be coded base pairs 3- ,-. _- _ 6- for by this sequence and its complement. There is only intRNA-mRNA 2- 1 - ri r . ri ri ri one AUG interaction 1- L=_# .- .4 0 tJI II potential initiator codon (phase 1), but this does not have any o2 4 6 8 101214161820222426283032 sequence more than two nucleotides long 5'-ward to it com- Nucleotide no. plementary to E. coli 16S RNA and, therefore, does not satisfy FIG. 2. Analysis of phases 1, 2, and 3 ofputative fl mRNA coded the Shine-Dalgarno rules (20) for an initiator, nor do we find by the DNA sequence derived in Fig. 1. Downloaded by guest on September 23, 2021 Genetics: Pieczenik Proc. Natl. Acad. Sci. USA 77 (1980) 3541 Phase 4 this exists in a transcript of the viral (+)strand, not of the (-)- GLNTYRTYI ARGGINPAOLEUGINGLNGLUtYS C AAUAUUAC COC CAGCCAUUGCAA CAGGAAAAAC strand shown to be actually functional (1). On the other hand, E col *NA VT. GU Ufu u, A eAAUG u C if we propose that tRNA-mRNA interactions frequently or E col tUNAt" AKA UG'u c f i A E col piNA u sometimes take advantage of more than a three-base-pair in- cold,iINA 5 'Dic PhogoT4 #RNA " v A"U'uau U' Crick et al. (4), who postulated an initial PoVA to1AA I u A AAA UU C teraction, as did E col IRNA 'C' V AGU Uu ' v A'GUC u UdA five-base-pair interaction for proto-tRNA-mRNA, then we can E col, toUNA co C E ACUituc see that phase 1 best fits the postulate and is moreover on the col, , A"* s Phase 5 AANUU tuC A S N LT* N A I A-S 1 1 H S C Y S-A S N"A 30I Y S A S N transcript of the known coding strand of f1. C-A A A U U-A C C GC C-A 0CC A U-U 0 C-A A C A GO-A A A-A A C E. coli NUUG'u C The histogram at the bottom of Fig. 2 shows the periodicity E.Sol tUA "'s AMUAGUc E.Sol :UA oA('UGGU u of the available extra base pairs. At nearly every position cor- E coli C ACGAUcUC E. coli tINA S3. A eUCGu c' responding to phase 1, the number of extra nucleotide pairs is E. coli tUNA"IS V gGUQuu E. coli tNNA I higher than for adjoining phase 2 or 3. That is, there is a peak E. colb tUN A A eUUGUC A E. coBi tNA A NG with more base pairing in phase 1 than in phase 2 or S and this E. c li UNA "T A A'U U LM C E. coli tUNA A"" Phase 6 A MUUOUC peak occurs every third nucleotide. For every interaction except 1 E-l 1 U-P R O-P 0-A I A-I I 1-A I A-T H 14 L.Y-1 Y S the fifth, tRNAs coding for phase 1 interact with more base C A-A U A-U U A-C C 0-C C A-G C C-A U U-G C A-A C A-G A-A A A-A C E. coli A*1^Ott A A'UAGu c PhyoT4 A NAAU'u c pairs than do tRNAs for phase 2 or 3. What is interesting about E. coli tUNAUA0 AGGGtUU u E. coli ": A G'GG LrUU ' this one case is that the codon starting at nucleotide 5 codes E. col INA At'A11 C ACGU"uU (phase 2) for phenylalanine, the same amino acid as the codon E. coli tU::NANA AA1UAGUc E. coli A c AC GUS u E. coli AA!UGoUUc at nucleotide 4 (phase 1). E. col tINA A A C C t'U C E. coli tRNA L AA'UUtu C Peaks of tRNA-mRNA extra base pairs occurring every third 5 nucleotide, in phase 1, would be quite consistent with a triplet No. of extra 41 code. It is comma free in a very specific sense. Fig. 4 shows a base pairs in 31 tRNA-mRNA 2 --, segment of phase 1 sequence, coding for Gly-Gly (amino acids interactioninteraction ~1 ~,, r-~~L~l--h&_rka I i a rirrri LJh LJ-ryri-r 8 and 9). It shows all the tRNAs and release factors known to 2 4 6 8 10 12 14 16 18 202224 26 283032 be capable of interacting with this sequence in any of the three Nucleotide no. phases. It is seen that E. coli tRNAGIy potentially can form 11 FIG. 3. Analysis of the three possible triplet phases of reading hydrogen bonds and 2 G-U stabilizing bonds and the second E. of the putative fl mRNA coded by the complement of the DNA se- coli tRNA?'Ycan form 10 hydrogen bonds and 1 G-U stabilizing quence in Fig. 1. bond, many more than all the other tRNAs could give. margin of Figs. 2 and 3 I have indicated all the relevant tRNA If we imagine a competitive affinity test in which this mRNA sequences that were available in 1978 (from E. coil or, if not sequence is immersed in a "pool" of these four tRNAs, E. coli available, from T4 bacteriophage). Underneath each possible tRNAG1Y would bind to U-G-G-C-G-G strongly and then a reading frame of my sequence I have aligned each tRNA second one (another E. coli tRNA3G') would compete with the anticodon with its corresponding codon and the nucleotides right end of this sequence (GCC base pairs versus G-U base pairs) adjacent to the anticodon. The bold-lettered anticodon se- at G-G-U-A-A. Therefore, Gly-Gly would be the favored coding quences correspond to anticodon loop nucleotides that can in terms of the number and order of possible hydrogen-bonding potentially base pair with the possible mRNA. G-C, A-U, and interactions, even without necessarily taking the known di- G-U are the allowable base pairs. Modified bases not in the rection of translation of ribosome function into account. anticodon are considered, for this analysis, to pair as if un- The triplet peaks in the histogram at the bottom of Fig. 2 modified. The histograms below Figs. 2 and 3 show the number of E. coli tRNAGly possible contiguous base pairs that can be formed between each E. coli . successive tRNA anticodon loop and the mRNA codon with its tRNA'.1"P-c E. coli 3 3- tRNAAla G *-----C...... ia adjacent nucleotides, less three for the anticodon. This is called .C5~~~~~~~ 5' the number of "extra" base pairs for each potential tRNA-

C------.. G....*-.... G mRNA interaction. If we sum the number of extra base pairs A C.. A@-- U in phase 1 (that is, nucleotide numbers 1, 4, 7, 10, etc.), we get A ...... U A C. A C C.. **..C 22 extra base pairs (i.e., 22/11 or 2 extra nucleotide base pairs U A U per codon-anticodon interaction). For phase 2 (numbers 2, 5, (A c cG)CC)Cc U U c G u 8, 11, etc.), we find 9 extra base pairs or 0.82 extra nucleotide base pair per codon-anticodon interaction. Phase 3 has 0.86 base pair per tRNA (fewer interactions postulated because of the terminator and because E. coli tRNACYS is unknown). Phases 'c '-l''|*Cl Gyp G1 A A U3 fl mRINA 4, 5, and 6 have 0.09, 1.11, and 0.70 extra base pair, respectively, per tRNA-mRNA interaction. Clearly, phase 1 provides the most extra base pairs and totals (GC |) (C C G)U U Release FOcct o r five potential base pairs for the average interaction. This phase A u is present in the complement ( product) of the A c A C DNA (+)strand sequenced. Phase 4, by contrast, has only one C ...... G possible extra base pair among all 11 interactions. In assuming G ...... C the classical model of a three-base-pair anticodon-codon in- A ...... u 3 5. teraction, one might even suppose neighboring base pairs to E. coli interfere with access to adjacent codons; then only phase 4 tRNAArg .colE tRA would be compatible with the original model. All of the other CE. Eoli tRNAGly possible phases provide four or more base pairs per unit tRNA-mRNA interaction. If potential extra base pairing were FIG. 4. Comma-free characteristic of tRNA interaction with incompatible with function, phase 4 would be preferred, yet mRNA. Downloaded by guest on September 23, 2021 3542 Genetics: Pieczenik Proc. Natil. Acad. Sci. USA 77 (1980) would tend to show that on such a basis the amino acid sequence mRNA sequence, and its complement is the one that determines that would be favored would be identical to the one given for the mRNA coding in f1. This was demonstrated (1). phase 1 even though a phase 2 tRNA is contributing the phe- (ii) The Hae III restriction fragment used as primer will be nylalanine in the sequence. mapped in, and adjacent to, a genetically . The This decoding is comma free-i.e., phase can be inferred in fragment used in this experiment co-runs on a Tris acetate (pH a relatively small stretch of sequence independent of having 7.4) gel with the defined Hae III H restriction fragment (23, initiator or terminator codons. It is a structural alternative to 24). This fragment was also used to rescue an amber mutant, the comma-free coding originally suggested as intrinsic to the R143 of f1, genetically mapped in a coding region, IV genetic code by Crick et al. (21). (23). The constraint of optimal base pairing and the triplet peaking (iii) Closely homologous sequences will be found in similar can also be used to analyze a sequence whose biochemical bacteriophages that differ only in synonymous codons or in function is known. The sequence of a ribosome binding site of minor shifts of amino acid type. Workers in several laboratories fi was determined (6) and identified as coding for the initiation recently determined (25) a complete sequence of fd, a single- of gene IV (22). This sequence is given in Fig. S and parsed into stranded DNA bacteriophage that complements fi mutants and all three phases of reading. This sequence contains a coding and has a similar Hae III restriction map. By computer search using a noncoding region. The region 5'-ward from the AUG should a similarity measure of homology (unpublished data) we located be noncoding and the region 3'-ward from the AUG should be a sequence that is almost identical to the fi sequence of Fig. 2. coding. There are no more than two consecutive bases com- This corresponding fd sequence runs from nucleotide 5169 to plementary to 16S RNA preceding the AUG in this sequence. nucleotide 5202 (enumerated from the single HindII site), If we subject this sequence to the same type of analysis, we note except for nucleotide 5177, which is a C in fd and a T in fi that phase 1 has 1.5 extra base pairs whereas phases 2 and Shave [(+)strand]. On Fig. 1, the complement of this T is the dark 0.75 and 1.0 extra base pair, respectively, on the average. Phase band near the top of the A slot bounded by a weak C band and 3 has two terminator codons, which effectively eliminate it from two strong G bands (C-A-G-G). This T/C difference is a syn- being a coding phase. However, for phase 1 with an average onymous difference only in phase 1, where it constitutes the 4.5 base pairs per interaction, the histogram shows characteristic third position of a triplet. peaking every third base in phase 1 more often than in any (iv) A protein product of gene IV exists with an amino acid other phase, particularly in the "noncoding" region of this sequence corresponding to that determined by reading phase ribosome binding site. 1. That such a protein exists is quite likely; a ribosome binding We know that phase 2 is the actual phase for the initiation site has been isolated for gene IV mRNA (1, 6, 22) and in vitro of reading and, yet, phase 1 displays the most extra base pairs transcription-translation experiments have produced a putative and the most characteristic triplet peaking of all three phases. protein product of gene IV. The amino acid sequence of this This apparent contradiction can be resolved by proposing that protein has not been determined directly. phase 1 codes for a protein that overlaps the ribosome binding The analysis of sequence 2 suggests that besides being a site and is later terminated by an internal terminator of phase ribosome binding site for gene IV mRNA, it is also coding for 2. Thus, although the biochemical criterion of ribosome binding the middle or terminus of another protein. Sanger et al. (26) suggests phase 2 is the coding phase, this syntactical constraint clearly demonstrated, by using both DNA and protein se- suggests that phase 1 is also coding. quences, that of the nine in 4X174, the genes D with E Is there any biochemical evidence to support these proposals? and A with B are in overlapping coding configurations. They lead to the following predictions: Schaller et al. (25) suggested in their analysis of a sequence (i) The template or viral strand should be homologous to the for bacteriophage fd that the carboxy-terminal region of gene I overlaps the ribosome binding site for gene IV. fi and fd differ Phase 1 structurally from 4X174 in that they are filamentous bacter- L Y S L Y S L Y A S S I R A S N ¢ U L I AAAAAAGGUA AUUC AAAUGAAAUU UU C iophages with no apparent packaging constraint on DNA size. I co uIANAU U c INA' U UU,U f c ot lNA' ^AA'UUUcUCAG Therefore, the existence of an overlapping region in fi and fd E cot RNA" AACCGUUc is more significant and open to question than in Their I cotI OX174. analysis of this region depends on the relationships between cot, *NA"'^sN A 'U U G'U c E c oft RlNA, cxACUU'Uc cow, initiator codons, internal terminators, and termination codons. I tNA Phase 2 AAOUAGuc In addition, the gene IV ribosome binding site sequence has L Y S L Y S V A L L E C L N(M T TLS ) been extended (25) and confirms the palindromic nature of the A-A A A A A 0 0 U A A U U C A A A U A A A U U I cotl tRNA" A: UU U C internal terminator regions predicted in ref. 6. If the postulated coc. tNA AA U U U'U C I cot, RlNA A A7 CA U'U C overlap turns out to be present, an internal terminator of gene coI ,RNA AA'UAGUc IV functions as the actual terminator of gene I. This illustrates Itcot, RNA6, V NGU'GUU'Uu c *RNA. AAUACUC how selection for one function can introduce sequence patterns E cotot tNA Akuuu UC Phase 3 affecting another function. A carboxy-terminal sequence of L Y S A GoT M.P I L Y S.T M-A S N gene I protein product should confirm or disprove the predic- A A A A A A GG U A A UU C A A AU GA A A U U cot INAs AA'UUU UC A tion of the overlap made in both the conventional sequence I cot, RNA TIM analysis (25) and in the syntactical analysis given here. cot INA AAAAGU wA When the same principles are used to analyze a long se- I cot, RtNA AA 'U U WU C E c o LI NIM quence where there are sufficient internal terminators to de- IRNA AA'UU G'uc termine phase, such as the MS2 RNA bacteriophage coat pro- 5 No. of extra 4 tein, we find that the coding phase gives 1.4 extra base pairs per base pairs in 3 , tRNA-mRNA 2 rj r -rn rn triplet, while the other two phases give 1.0 and 0.8, respectively interaction 1 r -i- (over a region of 130 codons). An extensive study of this con- 01~~~~L...,,.,,,,*J~~~ 2 4 6 8 10 12 14 16 18 20 22 24 straint in respect to known sequences coding for known gene Nucleotide no. products, with overlapping coding configurations, with alter- FIG. 5. Analysis of phases 1, 2, and 3 of fl ribosome binding site nate codings to the known genetic code, and with intervening for possible overlapping gene function. sequences will be presented when the class of tRNA sequences Downloaded by guest on September 23, 2021 Genetics: Pieczenik Proc. Natl. Acad. Sci. USA 77 (1980) 3543

corresponding to potential codings becomes available Qr-more I thank Dr. R. Hotchkiss for generous guidance in preparing this nearly complete. This tRNA-mRNA constraint is proposed as manuscript. an additional method of analyzing nucleic acid sequences, even relatively short ones. In reality, it is independent from, though 1. Pieczenik, G., Horiuchi, K., Model, P., McGill, C., Mazur, B. J., compatible with, determining coding by the phase relationship Vovis, G. F. & Zinder, N. D. (1975) Nature (London) 253, of initiators, internal terminators, and proper terminators. 131-132. Given the known tRNA nucleotides adjacent to the antico- 2. Pieczenik, G. (1973) Dissertation, New York University (Uni. dons, in E. coli codons will be favored that have a first-position versity Microfilms, Ann Arbor, MI, 73-19/955). purine and a third-position pyrimidine (the sequence purine- 3. Pieczenik, G. (1977) Hearings Before the Subcommittee on N-pyrimidine) favorable to binding of tRNA for adjacent co- Science, Research and Technology of the Committee on Science dons. This amounts to a compositional constraint upon which and Technology, 95th Congress, 1st Session, no. 24 (GPO, specific codons are used, at least in this organism, and not a Washington, DC), pp. 334-340. constraint upon the 4. Crick, F. H. C., Brenner, S., Klug, A. & Pieczenik, G. (1976) sequence of codons or peptides. Brenner Origins of Life 7,389-397. (27) showed in his analysis of overlapping codes that there was 5. Pieczenik, G., Barrell, B. G. & Gefter, M. (1972) Arch. Biochem. no constraint on protein sequence. This suggests that a codon Biophys. 152, 152-165. catalogue with a purine-N-pyrimidine compositional bias would 6. Pieczenik, G., Model, P. & Robertson, H. (1974) J. Mol. Biol. 90, have a selective advantage for translation, particularly in E. 191-214. colt. 7. Macino, G., Coruzzi, G., Nobrega, F. G., Li, M. & Tzagoloff, A. Although the overlapping tRNA-mRNA analysis includes (1979) Proc. Natl. Acad. Sci. USA 76,3784-3785. all possible base pairs, it is clear that tRNA has a structure and 8. Kimura, M. & Ohta, T. (1971) Nature (London) 229, 467- that nucleotides adjacent to the 3' end of the anticodon are 469. modified. Some of these modifications are quite substantial and 9. King, J. L. & Jukes, T. H. (1969) Science 164, 788-798. could clearly inhibit base pairing next to this end. This modi- 10. Dawkins, R. (1976) The Selfish Gene (Oxford Univ. Press, Ox- fiation could have evolved, in effect, to limit such overlapping ford, England). interaction and to allow codons to be less context sensitive, 11. Sanger, F., Donelson, J. E., Coulson, A. R., Kossel, H. & Fischer, thereby allowing more flexibility in the amino acid sequences D. (1974) J. Mol. Biol. 90,315-3. available to proteins. It is similar to Darwin's observation that 12. Ravetch, J. V., Horiuchi, K. & Zinder, N. (1977) Proc. Natl. Aced. the pistil and stamen of flowers are physically so close as to allow Sci. USA 74,4219-4222. self fertilization, yet the way they function precludes self fer- 13. Horiuchi, K., Vovis, G. & Zinder, N. (1974) J. Biol. Chem. 249, tilization. As sexual variability is important for adaptability to 543-552. new environments, so too one presumes is codon context vari- 14. Ornstein, L. (1964) Ann. N.Y. Acad. Sci. 121, 211-255. ability (28). Codon context sensitivity is suggested certain 15. Davis, B. J. (1964) Ann. N.Y. Acad. Sci. 121, 404-427. by 16. Sanger, F., Nicklen, S. & Coulson, A. R. (1977) Proc. Natl. Acad. suppressor experiments (29-31). Scd. USA 74, 5463-5467. The overlapping tRNA sequence constraint on mRNA I have 17. Barrell, B. (1978) Biochemistry of Nucleic Acids, ed. Clark, B. presented can explain in some measure the high frequency of F. C. (University Park Press, Baltimore, MD), Vol. 2, pp. 125- third-position T in codons for 4X174, a characteristic often used 179. to order and correct 4X174 sequence (18). Most tRNAs have 18. Sanger, F. & Coulson, A. R. (1978) FEBS Lett. 87, 107-110. A or G 3'-ward from the anticodon; those with the G can base 19. Plueddemann, E. P. (1968) J. Paint Technol. 40 (No. 516), 1. pair with a third-position C of the adjacent codon. However, 20. Shine, J. & Dalgarno, L. (1975) Eur. J. Biochem. 57,221-230. a U in this third position (coming from a T) can base pair with 21. Crick, F. H. C., Griffith, J. A. & Orgel, L. E. (1957) Proc. Nati. either the G or the A. It is also clear that a third-position U and Acad. Scd. USA 43, 416-421. a first-position A/G is compatible with out-of-phase internal 22. Ravetch, J., Horiuchi, K. & Model, P. (1977) Virology 81, terminator sequences. The ratio of U-NNN-A to A-NNN-U in 341-51. long sequences can be used as a first-order approximation of 23. Horiuchi, K., Vovis, G., Enea, V. & Zinder, N. (1975) J. Mol. Blol. this constraint for determining phase without knowledge of 95, 147-165. specific tRNA. Similarly, the use of C in fMet-tRNAfmet (eu- 24. Van den Hondel, C. & Schoenmakers, G. (1976) J. Virol. 18, karyotic) is in keeping with the higher use of G codon coded 1024-1039. amino acids at the start of eukaryotic proteins. 25. Schaller, H., Beck, E. & Takanami, M. (1978) The Single- Stranded DNA Phages, eds. Denhardt; D. T., Dressler, D. & Ray, The overlapping tRNA-mRNA interaction is in a sense a D. S. (Cold Spring Harbor Laboratory, Cold Spring Harbor, NY), virtual constraint. To be actual, it predicts and requires that pp. 139-163. tRNA anticodon loops have more than three base pairs stacked 26. Sanger, F., Air, G., Barrell, B., Brown, N., Coulson, A., Fiddes, and that anticodon loops can assume at least two conformations J., Friedman, J. & Smith, A. (1978) The Single-Stranded DNA (4). The crystal structure of yeast tRNA shows five bases in a Phages, eds. Denhardt, D. T., Dressler, D. & Ray, D. S. (Cold stacked configuration (32). Temperature jump experiments also Spring Harbor Laboratory, Cold Spring Harbor, NY), pp. lend strong support to this prediction of alternate conformation 659-669. for tRNA anticodon loops (33). 27. Brenner, S. (1957) Proc. Nati. Acad. Sci. USA 43,687-694. In summary, I present a method for analyzing the func- 28. Darwin, C. (1958) The Origin of Species (Pelican, London), p. tionality of short nucleotide sequences that is predictive and 144. complementary to previous methods of sequence analysis for 29. Comer, M., Foss, K. & McClain, W. (1975) J. Mol. Biol. 99, coding function. The analysis presumes that a message sequence 283-293. preserves sequence constraints that are a consequence of its 30. Hirsch, D. (1970) Nature (London) 228,37. 31. Atkins, J. (1979) in Transfer RNA, eds. Soll, D., Abelson, J. & history, function, or direct selection (genotypic selection). It Shimmel, P. (Cold Spring Harbor Laboratory, Cold Spring suggests that the translation machinery "sees" codons in a Harbor, NY), pp. 64-75. context, that the triplet nature of coding is not a consequence 32. Ladner, J., Jack, A., Robertus, J., Brown, R., Rhodes, D., Clark, of an obligatory three-base-pair tRNA-mRNA interaction, and B. & Klug, A. (1975) Proc. Natl. Acad. Sci. USA 72, 4414- that a structural manifestation of the mRNA constraint requires 4418. that tRNA anticodon loops can take up at least two configura- 33. Urbanke, C. & Maass, G. (1978) Nucleic Acids Res. 5, 1221- tions. 1560. Downloaded by guest on September 23, 2021