<<

Proc. Natt Acad. Sci. USA Vol. 78, No. 12, pp. 7665-7669, December 1981

Enhanced graphic matrix analysis of nucleic acid and sequences (sequence /transposition/secondary structure/immunoglobulin/microcomputer) JACOB V. MAIZEL, JR., AND ROBERT P. LENK Section on Molecular Structure, Laboratory of Molecular Genetics, National Institute ofChild Health and Human Development, National Institutes of Health, Bethesda, Maryland 20205 Communicated by Philip Leder, September 23, 1981 ABSTRACT The enhanced graphic matrix procedure ana- distances (9), these programs are limited in the body ofdata they lyzes nucleic acid and sequences for features ofpossible can handle, because the computations increase as a power of biological interest and reveals the spatial patterns ofsuch features. the sequence length. Available resources are exhausted by se- When a sequence is compared to itselfthe technique shows regions quence lengths on the order of 103 to 104 . An even of self-complementarity, direct repeats, and palindromic subse- more severe limitation exists for programs that attempt to pre- quences. Comparison of two different sequences, exemplified by dict the folding of single-stranded nucleic acids by evaluating immunoglobulin K light chain , by using colored graphic intra-strand self-complementarity. matrices showed domains ofsimilarity, regions ofdivergence, and This paper describes a simple sequence comparison based features explainable by transpositions. Analysis ofmouse constant domain immunoglobulin sequences revealed self-complementary on a graphic matrix display. Graphic techniques were used by regions that can be used to fold the molecule into a structure con- Tinoco et al. (10) to find regions ofself-complementarity in RNA sistent with electron microscopic observations. Computer trans- and by Gibbs and McIntyre (11), Fitch (12), and McLachlan lation of nucleic acid sequences into all possible amino acid se- (13) to determine regions ofhomology in amino acid and nucleic quences followed by graphic matrix analysis provides a way to acid sequences. The refinements described here have already detect theimost likely protein encoding regions and can predict the been applied to (14) and immunoglobulin se- correct reading frames in sequences in which splicing patterns are quences (15, 16). These programs enable the approach to be notdefined. Application ofthis technique toregions ofsimian virus used with sequences of more than 10,000 bases and are en- 40 and polyoma virus demonstrates the frames oftranslation and hanced by various subroutines that detect partial homology be- shows the agreement of sequences determined in separate labo- tween sequences. A large graphic matrix can be compressed ratories with different virus isolates. The graphic matrix tech- into a single page for a broader overview. Additional programs nique can also be used to assemble fragmentary sequences during permute sequences, either by complementing or translating determination, to display local variations in base composition, to them into amino acids, and use color to increase the information detect distant evolutionary relationships, and to display intragenic content of the graphic matrix. variation in rates of evolution. MATERIALS AND METHODS The recent development oftechniques for rapid determination of nucleic acid sequences (1, 2) has allowed complete specifi- The graphic matrix programs are written in Hewlett-Packard cation ofa number ofcellular genes and the entire extended BASIC specifically for a Hewlett-Packard 9845 desk- of several viruses. The data, consisting simply of finite arrays top computer with 64 kilobytes of memory and integral dot ofnucleotides, embodies the intrinsic blueprint ofan organism. matrix printer. They could be adapted to other microcompu- Transmission and correlation ofthis large and growing collection ters. Ancillary equipment includes a model 9872A four-color of data is ideally handled by computer. plotter, and a model 9885 flexible disc drive. Linkage with the Several widely distributed computer programs and their de- data storage facilities at the National Institutes of Health, the scendents are in use (3-7). A review of some programs for nu- Stanford Molgen project, and the Dayhoff Nucleic Acid Se- cleic acid analysis has appeared (8). Most frequently used are quence project is through a communications-terminal emulat- programs that search for occurrences ofshort subsequences that ing program, acoustic coupling device, and standard telephone. are used by enzymes as signals to recognize, modify, and express The basic program, called REVEAL, functions by projecting nucleic acids, that determine the frequency and locations of two sequences to be compared along the horizontal and vertical short strings of nucleotides, and that translate nucleic acid se- axes of a two-dimensional array. The matrix is constructed by quences into amino acid sequences or complementary polynu- marking the position corresponding to every element with a dot cleotide strands. where the bases corresponding to the vertical and horizontal Another category ofanalysis, originally applied to amino acid coordinates are identical. A nucleic acid sequence comparison sequences, uses programs that collate large stretches of se- of one base at a time consists of only four kinds of horizontal quence data to determine imperfect homologies and thereby rows, one for each ofthe four types ofbases (Fig. 1). this feature show evolutionary relationships. Often the approach is to gen- is utilized to simplify the actual computation by storing each erate a matrix with a numerical calculation for each type ofrow and recalling it when the corresponding base is en- comparison. Aside from sophisticated caveats, such as the - countered along the vertical column. A comparison oftwo 1000- ative efficacy of various algorithms for measuring evolutionary base sequences, one base at a time, is completed and printed in 1-2 min. For analysis of three bases at a time, using stored The publication costs ofthis article were defrayed in part by page charge rows for 64 triplets, about 5 min is required. payment. This article must therefore be hereby marked "advertise- The utility of the graphic matrix procedure is enhanced by ment" in accordance with 18 U. S. C. §1734 solely to indicate this fact. a filtration procedure in which n bases (n is an odd number) are 7665 Downloaded by guest on September 27, 2021 7666 Genetics: Maizel and Lenk Proc. Natl. Acad. Sci. USA 78 (1981)

A ensuing step the horizontal n-base subsequence is stepped over one base and the homologies are assessed again. At the end of a line the vertical subsequence is dropped by one base and the G process is repeated. Considerably greater time is needed for G G this type ofanalysis in comparison with the high-speed base-for- R T base comparison above. C Additional subroutines permute sequences in one of several R ways into amino C including acids in three reading G frames, generating a sequence's complement, or allowing the T R weak interaction of GOT or G&U base pairs to contribute to re- T G verse complement matching. RC T RESULTS T R G Demonstration of Features in a Hypothetical Polynucleo- C tide by Using the Graphic Matrix. Fig. 1 shows the analysis R of T a hypothetical nucleic acid to demonstrate the concepts and the R main features revealed by the graphic matrix technique. If a C R sequence is compared with itself, as in Fig. 1A, a perfect di- T C agonal arises from the match of each base with itself. Further- R more, repeated regions within the sequence appear as parallel GC lines-for example, the subsequence A-T-C-A-C-G starting at GC bases 4 and 25. A self-comparison matrix has identical halves G above and below the main diagonal, but comparison ofdifferent B sequences does not. Alphabetic palindromes are recognized as TT diagonal lines in the lower left to upper right direction. An ex- A A C*G ample is A-C-G-T-A and A-T-G-C-A starting at positions 7 and G*C 11. No biological relevance has been attached to this type of TA feature; therefore, it provides a measure of the "chance" oc- G GGATCAC GCATGACGCGG currence of matches of various lengths. Comparison of the sequence with its reversed complement C reveals regions ofself-complementarity that can be involved in the formation of secondary structure in single-stranded mole- CGGGRTCRCGTRTGCRTTRCARTRCRTCRCG GG cules as shown in Fig. 1C. This function has increasing impor- C tance as and structures are C __1 C C 1C secondary, higher order, implicated G 7GGG G m G GG in the expression of single-stranded nucleic acids. Self-comple- EFc1 CC R1 C- G GGGG G, mentarity is revealed by diagonal lines in the upper left to lower iTT FTT right direction-for example, G-T-A-T-G-C beginning at po- T -i3 1 T-GA[ sition 9 is self-complementary to G-C-A-T-A-C beginning at position 19. There is a 2-fold symmetry on either side of the G R~~~~~~~~~~F[GtGG E ER_R:m'RI R§_1 R Lm 9_T_I nonmatching diagonal (dashed line) that runs from the lower left to the upper corners that is useful in re- T mTl right recognizing the RG [EE R G, FGTE'FR E mY gions ofself-complementarity. The two subsequences thatcould pair are identified by projecting vertical lines from the ends of the two diagonal segments to the horizontal (forward) sequence. T Then the complementary regions can be joined and the mole- G FGTGGI ' m I cule folded as shown in Fig. 1B. From even a small sequence, C C Ct IC a a large number of possible combinations can occur. The eval- CRT r AMT !/y[1[ uation of all their stabilities in order to choose the most stable one is a large task. Programs to search and evaluate these pos- RS GTR [GTRB,mGg I[ E sibilities in a rigorous manner are an active subject of investi- gation (9, 17, 18). T L FTTTV Comparison of Two Related Sequences: Mouse K Chain C ,' m Variable Regions. Fig. 2 Left is a heterologous comparison of C, [m~ OCOCr[ portions of K-2 (K2) and K-41 (K41) variable regions of two subgroups from the mouse immunoglobulin system. These sub- FIG. 1. Demonstration of the graphic matrix technique using ar- sequences are ofthe germ-line genome and contain the tificial sequences. (A) Self-comparison of a sequence reveals direct re- leader, peats (solid line). (C) Comparison of a sequence and its reversed and intervening, and coding sequences (19-21). The sequence is complemented sequence reveals self-complementary regions (solid displayed with each ofthe four bases assigned a different color, line). The diagonal of 2-fold symmetry is shown as a dashed line. allowing direct reading of the homologous sequences from the (B) Folding of a single-stranded polynucleotide based on self- graphic matrix. Local variations in base composition are readily complementarity. apparent. Runs of repeated bases appear as lines or rectangles of a single color. collated at each step. The program allows for a minimal number Lines ofpartial homology are seen in Fig. 2 Left between the of homologous matches, m, out of n bases to be scored by en- regions ofbases 540-590 ofK2 and 300-350 ofK41 and to vary- tering a dot in the position of the central coordinate. In the ing degrees between 620-850 of K2 and 380-600 of K41. The Downloaded by guest on September 27, 2021 Genetics:Genetics:MaizelMaizelandandLenkLenk~~~~~Proc.Natl. Acad. Sci. USA 78 (1981) 7667

K2 K2 -Leader InterveningVrm~ -Iiade intervening VarnablvI, 400 500 600 700 $00 100

r- 4v A I 1, .0) r- I- A r N~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -200 5-- 2 30 --*0f I in. to

300 ---q .X Ahl -A

400 -400 < K.

501 615 620 -500 1530 \ B~~~~~~~~-0

6001 35

FIG. 2. Graphic matrix Of two K chain variable regions from the mouse. K chain variable region 2 is on the horizontal axis, and region 41 is on the vertical (Left). Each dot represents one base: black, T-T; red, A-A; green, C-C; blue, G-G. (Right) The same sequences filtered by collating nine nucleotides at a step. Black, all nine bases are homologous; red, eight of the nine; blue, seven of nine; green, six of nine. Absence of a dot indicates five or fewer matches. overall homology is 68% and typical of partially divergent se- (16, 23). Two filtrations are superimposed. The small dots show quences. Fig. 2 Right illustrates a matrix in which the extent perfect complementary segments of three nucleotides; Xs in- ofhomology in nine bases is indicated by the color ofa dot placed dicate regions where at least 9 of 11 bases are complementary. at the center of the group. This method effectively removes Fig. 3B illustrates how these regions of reverse complemen- many of the chance background homologies. In addition to the tarity might pair in a folded molecule. Each pair ofcomplements main diagonal, short regions of enhanced homology are seen is projected onto the horizontal axis; a hypothesized structure between 633-646 of K2 and 191-204 of K41 and between is shown below it. The electron micrograph in Fig. 3C, similar 611-622 of K2 and 528-539 of K41. They coincide with places to that in ref. 19, shows a typical structure of K chain mRNA where the main diagonal has vanished. It is possible to read the hybridized to a DNA fragment containing the Kvariable region sequences directly from the colored matrix, especially Fig. 2 gene. Because the cloned fragments do not contain J or constant Left, Insets A and B. They are: region sequences, the unhybridized constant region projects from the resulting R loop. The resolution of the electron mi- A B croscopic procedure used here does not allow confirmation of K2-633 AAGTGGGAATATTC 611 CTGTCACCATCA the specific structure illustrated but both the size and location of the hairpin are consistent with the prediction. Protein Inferred from the Matrix. K41-191 AATTGGGAA1TTTC 528 CTCTCAGCATCA Coding Regions Graphic An accessory filtration and compressed plot procedure gives a On the basis of a numerical program (5, 6), these homologies usefuil overview of sequences too long for examination in a 1:1 are significantly improbable (4.8 x 10'6 and 1. 8 X 10-6); there matrix reduced to page size. Fig. 4A is a comparison of the se- are no other homologies with probabilities less than 5 x 10-5. quences of simian virus 40(24) and polyoma virus (25), two dis- Transpositions, or duplications followed by , of por- tandly related papovaviruses of more than 5000 base pairs each. tions of the K41 sequence bases 390-191 and 378-538 would Here the DNA sequences are compressed by a combination of account for this result. The former event would involve a move collating five bases at a time for perfect match and by scaling from a coding region to an intervening sequence; the latter the plotted display to page size. Regions of interspecies ho- would involve a move of a coding region to another coding re- mology are distributed selectively along the genomes. Highest gion. It is not known whether the events occurred by reciprocal interspecies sequence conservation is shown by the gene for exchange of sequence at these sites or by segmental gene con- coat protein VP1. Note that an insertion or deletion between version (22). A role for such transpositions in the function of the the early and late regions ofeither one or both genomes has led immunoglobulin gene remains to be found, but because the to a shift of the homology from the diagonal. germ-line and a differentiated gene have similar sequences, it The imperfect homology of the early region genes is also re- is not necessary to imagine a role as a somatic generator of flected in the computer-translated amino acid sequences, as diversity. demonstrated in Fig. 4B. Nine graphic matrices comparing all Secondary Structure of RNA Inferred from the Graphic frames oftranslation for the early regions of simian virus 40 and Matrix. The filtration subroutine can reveal regions of reverse polyoma virus are shown. One comparison (frame 3 of simian complementarity within a sequence and facilitate the selection virus 40 versus frame 3 of polyoma virus) shows strikingly of regions for combination into a folded structure. Fig. 3A dem- greater homology than do the eight other translation compari- onstrates the reverse complements of the mouse Kchain in RNA sons or the direct nucleic acid (Fig. 4A). This region encodes Downloaded by guest on September 27, 2021 7668 Genetics: Maizel and Lenk Proc. Natl. Acad. Sci. USA 78 (1981)

0 1 2 3 4 5 Polyoma A 0 10 20 30 40 50 A III L L.i-L-- LL 11LIii1iI1I1iiIIIlIlII1iili 0 4- j-I- I... ''. 5'. 4. it A ::.F7-F7.7 .1 .,.. :. 4. - 112 .,. m. '. -::, -- 11 !:. A- .: -. -. . -."- " - ... .w. M.- ..' I, :.: .. :"I'I :t ,-.l .,I!:, .4 ..a .1,'- !PI P

.- 1 - 10-2,,.,..,.1 .'<4''C

2- 201

I .,sx*.

_ 1'_ : 4 ''' '. ': ::: : s '' 4-

5- 50 ".;.'--'C9y.j« A J T T 1 1 1 1. 1 III t T WT1J m I1 mV VP2 t earl, RNA T T T - B B Polyoma early region Frame 1 Frame 2 Frame 3 , , I i.1 ,,i , a.-4 -. .. . ;, . ~tI ! .. d...>.,, A,..2 ag o A -

C .'

c FIG. 3. Secondary structure of the transcript of the mouse K chain 0 constant region. The sequence is that of the appropriate J region ap- 0) C1 pended to the constant region sequence. (A) Reverse complement plot >0) of this sequence. x, Nine of a block of 11 nucleotides are complemen- Ec tary; -, three of three are complements. (B) Folded structure on the basis of the self-complementarity regions noted in A. (C) Sketch and 0 electron micrograph of mouse K chain mRNA hybridized to a plasmid (n containing the variable region gene only. cog . 4 i . t t the portion of the large tumor antigen acquired by splicing 0) . n .; ,- 's All - tisA.),-- me... ; i...... -I.,-i...... --- ^. .' ...t i,,.....'' a...... 'S , 7,,,.,..._ _ E _.A frame 2 into frame 3 to circumvent the small tumor antigen ter- _*'".' 'a z'' 0' J' , if;' '' '-.,4, < Ma .v - ;, a.. .. . : > _ can _; i,, ,,ij" :'t ,' t:2(- ,, Be,i.,.9, , ..'':N ,.-.X-' , -::__ be '; 'z,, ',"-...... mination codons. Another region ofincreased homology found in the first 100-200 amino acids of frame 2 versus frame A. . .+ @,'-',i,, ;tsa 2',; .,.. t 0 X ,-..,-,<.,t.Ad *Zb ;. ';''if +-'*IS' 0 ':' .- .;''. _ 2 corresponding to the small tumor antigen in an unfiltered comparison (data not shown). Increased homology at the amino _. ' ;J .- t;'_',,'; ' ;-j- ...... '' I'-;-tt ,,z, If', ^ ',_'; ; @,:9' ...... "8'i'>';' ';- vi- FIG. 4. Compressed graphic matrix of comparison of polyoma acid level as opposed to nucleic acid level was noted previously rus with simian virus 40 (SV40) sequences. (A) Entire nucleic acid se- (26). ;' X'.-h.',,s ,;;L-=w .'.. quences_ -*;;comprising . '. is-: _t>,5-1^_ As;5226 4 ',,; 'x * C !5160:* ' ...... 26,966,160 individual* ^a *t-*-*ii;,elements.Or,I; alone reveals Used in this manner, the graphic matrix po- A dot indicates that.;,-s.s-.;,';ows-,,'J'x,five bases in as*e'*Fi,',row arei; -I'mhomologous. ;b''An.-',.'-aThe...... early4A-\,"and no _ .'z'-,'s's; tential coding regions in two related sequences even if other late regions,: -.ofathe . qW'v,simians: virus..;40 .>sequence .; :,:.-were{..rearrangedr ¢.. ^ .fromt'.the data are available. This relies on the simple hypothesis that an published> .,;form ,,.:..<;to9;,bring...... both sequencesiz';t: ...... into,.:the .-.'same .4\+*...... sense\.andi 0 {ori- _ enhanced homology in one set of frames implies evolutionary entation. Features of the genome indicated along the margin: t, small tumor antigen;i ...... , i.mT1 . .- .and - ;-kmT2, , .. .-the . - ..two. . .portions .-...... -.;of . ,...Ethe* 5genome i.;-.i .S-.which, acid This is most to occur conservation ofamino sequence. likely when spliced, the middle-size tumor antigen of polyoma virus; if real exist for those frames of translation. The tech- T1 and T2, portions of large tumor antigen RNA, the primary tran- nique could also aid in predicting the correct choices among script of the early region of the genomes; VP1, VP2, and VP3, the three many possible alternatives for coding regions in genes that have virus coat proteins; 0, origin of replication. (B) The early regions of A; *l j.,in;0,anroeach into its three read- n N>s . , after intervening sequences. the two sequences compared translating ing frames and concatenating them (frame 1 starts with the first base DISCUSSION of each sequence, frame 2 with the second, and frame 3 with the third). This results in nine possible pairwise comparisons. The graphic matrix technique takes advantage of the sensitive discrimination of the eye and mind in detecting patterns in nu- exist for predicting the secondary structure of RNA but they cleic acid and protein sequences. Most of the deductions made either require large amounts ofcomputer time or take uncertain from the sequences presented here can be discovered, but only shortcuts (18, 26, 27). The graphic matrix can be used to guide with difficulty, either by aligning the sequence data manually numerical programs by eliminating unfavorable regions and or by performing a numerical analysis. Both approaches are thereby shortening calculation times. Existing numerical meth- expedited by the graphic matrix. For example, several programs ods to produce "best" alignments of homologous sequences do Downloaded by guest on September 27, 2021 Genetics: Maizel and Lenk Proc. Natl. Acad. Sci. USA 78 (1981) 7669

not deal with transposed regions within sequences, as exem- 3. Staden, R. (1977) Nucleic Acids Res. 4, 4037-4051. plified in the K gene comparisons (Fig. 2), but the graphic matrix 4. Staden, R. (1978) Nucleic Acids Res. 5, 1013-1015. technique reveals them readily. 5. Staden, R. (1979) Nucleic Acids Res. 6, 2601-2610. 6. Korn, L. J., Queen, C. L. & Wegman, M. N. (1977) Proc. Natl Another application, not shown here, utilizes pairwise anal- Acad. Sci. USA 74, 4401-4405. yses on transparent sheets overlaid to reveal similarities in pat- 7. Queen, C. L. & Korn, L. J. (1980) Methods Enzymol 65, terns of repeats or reverse complements. Related patterns of 595-609. repeats or reverse complements can occur in sequences that are 8. Gingeras, T. R. & Roberts, R. J. (1980) Science 209, 1322-1328. otherwise dissimilar, suggesting that general features rather 9. Waterman, M. S., Smith, T. F. & Beyer, W. A. (1976) Adv. than specific sequences may be recognized by certain processes Math. 20, 367-387. 10. Tinoco, I., Jr., Uhlenbeck, 0. C. & Levine, M. D. (1971) Nature (28). (London) 230, 362-367. The graphic matrix method has proved to be ofgreat practical 11. Gibbs, A. J. & McIntyre, G. A. (1970) Eur. J. Biochem. 16, 1-11. value in ordering fragments of DNA in the sequence deter- 12. Fitch, W. M. (1969) Biochem. Genet. 3, 99-108. mination process (15). Fragments are compared in forward and 13. McLachlan, A. D. (1971)]. MoL Biowl 61, 409-424. reverse complement forms and overlapping fragmentary se- 14. Konkel, D. A., Maizel, J. V., Jr., & Leder, P. (1979) Cell 18, quences are arranged in order by generating the comparison 865-873. 15. Hieter, P. A., Max, E. E., Seidman, J. G., Maizel, J. V., Jr., & matrices and observing the overlaps in the form ofhomologous Leder, P. (1980) Cell 22, 197-202. regions between ends ofthe fragments. Direct visual inspection 16. Max, E. E., Maizel, J. V., Jr., & Leder, P. (1981)J. Biol Chem. deals successfully with imperfectly repeated sequences which 256, 5116-5120. may confuse programs designed to order fragments automati- 17. Nussinov, R. & Jacobson, A. B. (1980) Proc. Natl Acad. Sci. USA cally (5, 8). Comparison with a previously known nucleic acid 77, 6309-6313. or amino acid sequence, even if from a different species, can 18. Zuker, M. & Stiegler, P. (1981) Nucleic Acids Res. 9, 133-148. aid in 19. Seidman, J. G., Leder, A., Edgell, M. H., Polsky, F., Tilghman, ordering the fragments. S. M., Tiemeier, D. C. & Leder, P. (1978) Proc. Natl Acad. Sci. Due to its simplicity and versatility, the graphic matrix can USA 75, 3881-3885. be adapted with ease to reveal various properties and patterns 20. Nishioka, Y. & Leder, P. (1980) J. Biol Chem. 255, 3691-3694. in biological sequences. 21. Seidman, J. G., Max, E. E. & Leder, P. (1979) Nature (London) 280, 370-375. We gratefully acknowledge the expert assistance ofTerri Broderick 22. Seidman, J. G., Leder, A., Nau, M., Norman, B. & Leder, P. in preparation ofthis manuscript. We thank Jon Seidman for providing (1978) Science 202, 11-15. sequences and electron micrographs and encouraging the development 23. Hamlyn, P. H., Brownlee, G. G., Cheng, C., Gait, M. J. & Mil- ofthese programs, M. Waterman (Los Alamos Scientific Laboratories) stein, C. (1978) Cell 15, 1067-1075. and B. C. 24. Reddy, V. B., Thimmappaya, B., Dhar, R., Subramanian, K. N., Bush and Leventhal (Columbia University) for introduction Zain, B. S., Pan, J., Ghosh, P. K., Celma, M. L. & Weissman, to the literature on matrix methods, Margery Sullivan and Barbara S. M. (1978) Science 200, 494-502. Norman for technical assistance, and the Stanford Molgen and the Na- 25. Soeda, E., Arrand, J., Smolar, N., Walsh, J. & Griffin, B. (1980) tional Biomedical Research Foundation data base projects. Nature (London) 283, 445-453. 26. Friedman, T., Doolittle, R. F. & Walter, G. (1978) Nature (Lon- 1. Maxam, A. M. & Gilbert, W. (1977) Proc. Natl Acad. Sci. USA don) 274, 291-293. 74, 560-564. 27. Sellers, P. H. (1974) SIAM J. Appl Math. 26, 787-793. 2. Sanger, F., Nicklen, S. & Coulson, A. R. (1977) Proc. NatL Acad. 28. Studnicka, G. M., Eisenling, F. A. & Lake, J. A. (1981) Nucleic Sci. USA 74, 5463-5467. Acids Res. 9, 1885-1904. Downloaded by guest on September 27, 2021