RMTEAAEY OLQIMPERSPECTIVE COLLOQUIUM ACADEMY: THE FROM

Statistical signals in

Samuel Karlin† Department of Mathematics, , Stanford, CA 94305-2125

Contributed by Samuel Karlin, July 7, 2005

The Arthur M. Sackler Colloquium of the National Academy of Sciences, ‘‘Frontiers in Bioinformatics: Unsolved Problems and Chal- lenges,’’ organized by , , and myself, was held October 15–17, 2004, to provide a forum for discussing concepts and methods in bioinformatics serving the biological and medical sciences. The deluge of genomic and proteomic data in the last two decades has driven the creation of tools that search and analyze biomolecular sequences and structures. Bioinformatics is highly interdisciplinary, using knowledge from mathematics, statistics, computer science, biology, medicine, physics, chemistry, and engineering.

BLAST ͉ repeat sequences ͉ r-scan statistics ͉ frequent and rare oligonucleotides and peptides

ore than 200 bacterial, 20 ment) (13, 14); r-scan statistics (15) (de- to evaluate gene expression with a differ- archaeal, and 20 eukaryotic tecting anomalous spacing of specific ent set of limitations (32–34). Ribosomal genomes as well as 1,600 markers distributed along a sequence); protein (RP) codon frequencies deviate viral genomes have been and ‘‘frequent’’ and ‘‘rare’’ oligonucleo- strongly from average codon frequencies completelyM sequenced. Moreover, several tides and peptides (sequence words in many bacteria, especially during rapid hundred mitochondrial chromosomal sets that occur statistically more or less fre- growth. By contrast, the expression levels and at least 30 chromosomal plastids have quently than would be expected by for RPs in archaea are variable; 15–30% been sequenced. A deeper understanding chance) (16–18). of archaeal genes have average codon of basic biology can be gained from a The standard BLAST protocol compares usage (35, 36). The most biased codon comparison of organisms in different evo- a query sequence with a large database of usages are found for genes generally lutionary lineages, and this understanding protein sequences to uncover significant involved in general processes of transcrip- is the aim of the currently intense se- similarities that help to circumscribe func- tion͞translation, chaperone͞degradation, quencing efforts. New species thriving in tion and structure of the query sequence. and energy metabolism (33, 34). environments throughout the earth and BLAST-like programs are extended to oceans reveal microbes in geothermal ar- searches using multiple alignments (13, 14, Highlights of the Sackler Colloquium eas, in arctic ice, in acidic springs, in toxic 19); to single-sequence analyses (20), e.g., Discussions at the Colloquium focused on waste sites, in assorted air currents, and in to identify DNA-binding peptides and the following topics: sequence patterns subterranean habitats (1, 2). Unique mixes transmembrane tracts; and to three-di- (12, 18, 20), comparative genomics and of microbes that thrive in different ecosys- mensional analyses, e.g., to identify charge proteomics, modeling of molecular inter- tems have been described (3). For exam- clusters in protein structures and cysteine actions and gene regulatory networks, ple, a new type of bacterial rhodopsin was knots (21, 22). The BLAST programs cur- management and interpretation of protein discovered by genomic analysis of natu- rently serve Ϸ250,000 queries per day at expression data (33, 34), microarrays (29), rally occurring marine bacterioplankton the National Center for Biotechnology over- and underrepresentations of words (4). Prokaryote genomic samples from the Information (NCBI) in Washington, DC. in genomes and proteomes, molecular human oral cavity and the intestinal tract, The theory relies on fundamental proper- evolution (37–39), alternative splicing (40– and their changes in time, are progres- ties of extremal statistical distributions and 43), polymorphisms and SNPs (44), sively being followed (5). The diversity of the stochastic theory of large deviations genomic haplotypes (HAP-MAP) (45), microbes in cutaneous wounds of humans (23–25). There are natural relationships and three-dimensional macromolecular is of great practical importance. Sequenc- between these analyses and studies on the taxonomy (46). In this article, I will sum- ing efforts facilitate health research and maximum service time among customers marize results and perspectives issuing understanding of pathogenesis and may in queuing systems, as well as in applica- from Colloquium talks and then review have commercial, industrial, and agricul- tions to insurance risk and traffic flow some general concepts and methods of tural benefits as well. It is clear that lack models (see refs. 26 and 27 and below) bioinformatics. of data is not a problem today; rather, the DNA microarrays (DNA chips) aim to George Miklos called attention to diffi- challenge will be the analysis of the vast dissect gene expression under varied phys- culties arising from noise in microarray quantity of information already available. iological, clinical, and environmental con- data sets. Conflicts in the clinical applica- Bioinformatic methods underlie most of ditions. Microarrays are used to monitor tion of microarray data to cancers and computational biology. Current emphasis well characterized genes in different situa- complex diseases are exposed in refs. 47 is on computationally efficient and wide- tions; to discover disease genes; to assess ranging algorithms that have been imple- gene expression during treatment with This paper serves as an introduction to the Arthur M. Sackler mented and tested on real and simulated drugs, chemicals, or toxins; to discover Colloquium of the National Academy of Sciences, ‘‘Frontiers data sets. Exact and empirical algorithms genes that compensate for knockout mu- in Bioinformatics: Unsolved Problems and Challenges,’’ have been applied to genomics, proteom- tations; and to profile gene expression in held October 15–17, 2004, at the Arnold and Mabel Beck- ics, gene networks, structure prediction, temporal and in tissue-specific localiza- man Center of the National Academies of Sciences and Engineering in Irvine, CA. Papers from this Colloquium will and drug design. Bioinformatic tools are tions (28–30). Experimental evaluations of be available as a collection on the PNAS web site. The used daily, including the generalized protein abundances under different cellu- complete program is available on the NAS web site at BLAST programs (sequence similarity eval- lar conditions can be assayed by two-di- www.nasonline.org͞bioinformatics. uations) (6–9); GENSCAN and GENIE (gene mensional gel electrophoresis (31) and Abbreviations: AS, alternative splicing; i.i.d., independent discovery) (10, 11); SAPS (statistical analy- supplemented by mass spectrometry, anti- identically distributed; USS, uptake signal sequence. sis of protein sequences) (12); CLUSTAL body associations, and biochemical tests. †E-mail: [email protected]. and ITERALIGN (multiple sequence align- Codon usage analysis offers another way © 2005 by The National Academy of Sciences of the USA

www.pnas.org͞cgi͞doi͞10.1073͞pnas.0501804102 PNAS ͉ September 20, 2005 ͉ vol. 102 ͉ no. 38 ͉ 13355–13362 Downloaded by guest on September 24, 2021 and 48. Marc Gerstein reviewed data on interacting genes. Per Bork described gene CpG Dinucleotide Suppression the numbers and distribution of pseudo- control in metazoans and how it influ- The role of CpG dinucleotides, particu- genes (called fossil records) over several ences genome evolution. He proffered a larly in the context of CpG methylation, is complete eukaryotic genomes, especially gene neighborhood on predicting gene of considerable interest (e.g., see ref. 69). the human genome. Ribosomal protein function (43). reported In addition to causing increased mutation genes predominate among pseudogenes in how gene regulation and the arrangement rates, CpG methylation alters the shape of human sequences counting more than of protein pairs participate in protein– the major groove of DNA, leading to 2,000 cases (49, 50). In chromosomes 21 protein interactions in Escherichia coli and modified chromatin structure, and thus it and 22 (see ref. 51), the median length of Saccharomyces cerevisiae. She displayed is capable of altering patterns and rates of processed pseudogenes is approximately situations of chromosomal adjacency gene transcription (70). Recent studies in the same median length as that of single- among these genes (58). humans have shown that the nonpromoter exon (intronless) genes. The median also dealt with protein–protein interaction CpG islands are targets for de novo meth- length of single-exon genes is approxi- and the structural problem of docking ylation, playing a role in cancer and aging mately congruent to the median internal (59). David Eisenberg reviewed several (71). As a mechanism for modifying gene exon length times the average number of programs relevant to the study of protein expression levels and tissue-specific ex- exons per gene. These observations sup- interactions comparing multiple bacterial pression, it offers regulatory possibilities port the proposition that many single-exon genomes. The information can be applied that are exploited in genomic imprinting, genes derive from processed multiexon in structural genomics to determine pro- X chromosome inactivation, transposon genes transposed into the genome (51). tein partners and predict protein interac- inactivation, and developmental processes identified the most con- tions (60, 61). An interesting family of (71–73). It has been suggested that meth- served (called ultraconserved) intergenic proteins featured in Eisenberg’s talk are ylation is partly a defense against the up- regions across several higher eukaryotic the PGRS (polymorphic GC-rich repeti- take and integration of foreign DNA (74). genomes (human, chimpanzee, mouse, rat, tive sequences) of Mycobacterium tuber- CpG islands have also been suggested to dog, and chicken) and attributes this con- culosis. There are genes for Ϸ70 such be origins of replication (75). However, servation to purifying selection (52). Pavel proteins in the M. tuberculosis genome CpG suppression can be strong even in Pezner described synteny (genome rear- that have many glycine–glycine doublets, the absence of CpG methylation (76). rangements) and characterizations of have few charged, aliphatic, or aromatic A natural measure of dinucleotide bias breakpoints in mammalian genomes (53). residues, and are virtually devoid of cys- is the symmetrized dinucleotide relative Several Colloquium talks focused on teine residues. Their composition pre- abundance values ␳ ϭ f ͞f f the regulatory role of RNA selection cludes electrostatic, hydrophobic, or disul- XpY XpY X Y, where pressure and aspects of splicing. Chris fide-bridge interactions. M. tuberculosis fX, fY, and fXpY are the frequencies of the Burge used the specificity of splicing to pathogenicity islands contain putative bases X and Y and the dinucleotide XpY, identify exons in the proximity of exon- PGRS genes (62), and there is evidence respectively, calculated over a DNA strand concatenated with its inverted comple- splice enhancer sequences and exon- to suggest that the proteins encoded by ␳ splice silencer sequences (54, 55). Chris these genes are surface-exposed and can ment. can be computed for complete Lee pointed out that alternative splicing obstruct the host immune system. Some genomes or sequence ‘‘windows.’’ Statisti- cally, the dinucleotide XpY is said to be (AS) in humans is found at levels from experimental evidence suggests that sev- ␳ Յ 60% to 80%, according to tissue, devel- eral PGRS genes contribute virulence and underrepresented if XpY 0.78 and to be ␳ Ն opmental stage, and disease contingen- persistence of M. tuberculosis in macro- overrepresented if XpY 1.23 (18). Early cies. Some AS exons can be related to phage environments (63). biochemical experiments measuring near- protein domains. On human chromo- compared a hierarchy of est-neighbor frequencies established that some 22, at least 25% of gene structures programs useful for structural protein the set of dinucleotide biases is a remark- have 5Ј and 3Ј untranslated exons alignment (64). Helen Berman reviewed ably stable property of the DNA of an (UTEs) (51). These UTEs may play an the tools available for probing the Protein organism (77–79). From this perspective, important role in AS, as in the case of Data Bank (PDB) crystallographic data- the set of dinucleotide biases constitutes a G protein-coupled receptor proteins. base (65). Volker Brendel and Terry ‘‘genomic signature’’ that can discriminate These analyses were carried out on al- Gaasterland discussed the functional plant sequences from different organisms. The ternatively spliced exons in comparing genomics of Arabidopsis thaliana and dinucleotide biases appear to reflect spe- human–chimp, mouse–rat, and human– maize. The challenges of building data- cies-specific properties of the enzymes of mouse sequences. Amino acid mutation bases that associate genotype with pheno- DNA modification, replication, and repair. differences and synonymous mutation type were discussed by Russ Altman (66). In addition, the genomic signature is use- differences could be interpreted in My focus was on the subject of a charac- ful for detecting pathogenicity islands and terms of AS. Synonymous mutations are terization of the highly expressed genes in horizontal gene transfer between bacterial known to disrupt existing splicing signals archaeal genomes (35). genomes (62). as they introduce new splicing signals Context-dependent mutational pro- Human nuclear DNA and that of other leading to potential human diseases. Lee cesses (mutations that depend on flanking mammals is significantly CpG-suppressed, proposed that AS is adaptive and posi- nucleotide content) in mammalian species but generally nonvertebrate DNA does tively selected during evolution and hy- were considered by Phil Green (67) who not have this property. Animal mtDNA is pothesized the creation of new exons by emphasized the methylation–deamina- not methylated, apparently because meth- means of an AS mechanism (56). He tion–mutation scenario associated with the yltransferases cannot gain access to the emphasized RNA-level selection pres- low frequency of the CpG dinucleotide. mitochondrion, yet almost all metazoan sures distinguished from protein-level Green proposed that mutational asymme- mitochondria are CpG-suppressed (80). selection. Shawn Eddy discussed RNA try results from transcription-coupled re- CpG has the highest free energy (stacking genes and RNA-based regulatory cir- pair. He also discussed a generation time stability) of all dinucleotides, 16% higher cuits that control gene function (57). effect of mutations that occur in conjunc- than GpC (81). Accordingly, reduced Two sessions focused on proteomic in- tion with DNA replication (68). CpG mu- CpG helps local strand separation, with teractions, networks of molecular interac- tations have been reported to have a re- the potential for higher replication and tions, and chromosomal adjacency of the duced generation time effect (67). transcription rates, and easier access for

13356 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0501804102 Karlin Downloaded by guest on September 24, 2021 host factors. Differences between species (iv) Repeat Patterns. These occur at three that {Sm} entails a negative mean score in their DNA replication, repair, and tran- levels: (i) microsatellites and minisatellites, and one or more si is positive. A quan- scription machinery might favor different e.g., the triplet repeats in neurological tity fundamental to the asymptotic (n 3 dinucleotide abundances; for example, diseases; (ii) motifs, highly frequent occur- ϱ) distribution of Mn is the unique posi- through variations in processing efficiency rences of words of moderate length; and tive root ␪* of the equation E[exp(␪*X)] and accuracy depending on local DNA (iii) two or more occurrences of very long ϭ 1 that is structure (base-step configuration) and words that tolerate a few short mis- DNA curvature, DNA modifications, and matches and͞or indels (89–91). r ␪ ͸ *si ϭ generation of context-dependent mutation pie 1, rates. DNA topology is determined princi- (v) Molecular Evolution. This method com- iϭ1 pally by the base-step configuration, spe- prises prokaryotes compared to eu- cifically through dinucleotide stacking en- karyotes (36, 92), issues of horizontal where E signifies expectation. It has ergies and charge interactions (82). gene transfer (93, 94), phylogenetic re- been proven (24) for n large that Further, many protein–DNA interactions constructions (37, 38, 95–97), and the ln n involve DNA bending as well as sequence- existence of three domains of life (98, ͫ Ͼ ϩ ͬ Pr Mn ␪ x specific interactions. DNA repair enzymes 99). Conflicting results often accompany * [1] phylogenetic inference with respect to recognize lesions in DNA by their shape Ϸ Ϫ ͕Ϫ Ϫ␪*x͖ (83). Hence, the dinucleotide biases that RNA, DNA, or protein sequences across 1 exp K*e , characterize different genomes might re- diverse organisms (37). Causes contrib- sult from coevolution of genome dinucle- uting to these conflicts relate to ambigu- with accessible computation for K* and ␪* otides with genome maintenance pro- ities in identifying homologous charac- (6, 20). The formula in Eq. 1 can be used cesses (84–88). ters in alignments, sensitivity of tree- to establish benchmarks of statistical sig- making methods to unequal nificance. For example, we set the right Some Methods for evolutionary rates (e.g., fast evolution), side of Eq. 1 to some significance level, biases in species sampling, unrecognized (i) Score-Based Sequence and Structure for example, p* ϭ 0.01, and solve for x* paralogy, functional differentiation, and Analysis. The theory of score-based meth- ϭ x(p*). A maximal segment score ex- difficulties with the assumptions and ods has been developed in three contexts: ceeding (ln n͞␪*) ϩ x* is significant at the approximations used to infer phylogeny. (i) analysis of a single sequence to identify p* level. Attempts to surmount these difficulties distinctive sequence features (e.g., mem- The analysis also provides information by averaging over many proteins and brane tracts, proline concentrations, etc.), gene order considerations are problem- on the composition of high-scoring seg- often reflected in motifs of appropriate ments and related variables (6, 20, 23, 24, atic because of inherent biases of se- Ͼ ϭ high score (20, 23); (ii) comparisons lected families, lack of signal in others, 101). For each level y 0, let L(y) T(y) among multiple sequences to identify seg- Ϫ K(y) be the length extending from K(y) and lateral transfer events, fusion, ϩ ments having high similarity score (13, and͞or chimerism. Assessing reliability 1toT(y) as the first segment of aggre- 14); and (iii) patterns that can be investi- by using the bootstrap protocol is strewn gate score exceeding y (20). Let Um be a gated in protein structures (e.g., charge with obstacles because of lack of inde- sequence of vector random variables

clusters, cysteine knots, etc.) (21, 22). pendence and inhomogeneity in the mo- where Um is independent of Xk, k m. lecular data. Several recent methods Form (ii) r-scan Statistics. r-scan statistics pro- with limitations of their own include the T͑y͒ vide means for characterizing nonran- signature of Gupta (98), the genomic W͑y͒ ϭ ͸ U domness in the distribution of a specific signature introduced by Karlin and col- k K͑y͒ϩ1 marker array in DNA and amino acid leagues (87), discrimination by protein sequence data. r-scan methodology can domain Pfam content (100), and struc- so that W(y) cumulates Uk samples in a also be used to discern significant peaks tural folds (38, 46). high-scoring segment. Then W(y)͞L(y) 3 in the analysis of counts in sliding win- ␪*X1 u*asy 3 ϱ, u* ϭ E[U1e ]. Taking Xk dows. In particular, r-scan statistics char- Score-Based Methods ʦ A equal to 1 and 0 otherwise, then acterize clustering, overdispersion, or Probabilities of High-Scoring Segments. The W(y)͞L(y) is the fraction of samples in A excessive evenness in the distribution of simplest model is as follows (20). Let {X1, that lie in a high-scoring segment. Thus, specific markers in a sequence. By vary- X2,...,Xn} be independent identically over high-scoring segments, the relative ing r, sequence organization on different distributed (i.i.d.) letters drawn from a frequency of score si is approximately qi scales can be analyzed (15). finite alphabet {ai} with associated scores ϭ piexp{␪*si} (6, 20). It follows that {si} such that Prob{X ϭ si} ϭ pi, i ϭ 1, scores defined by si ϭ ln(qi͞pi) (a positive (iii) The Identity of Frequent (Abundant) and 2,...,r, pi Ͼ 0, ͚pi ϭ 1. Let multiplicative scaling of si will not change Rare (Avoided) Oligonucleotides and Pep- any of the theory or its applications) iden- tides, Including General Analysis of Compo- m ϭ ϭ ͸ ϭ tify high-scoring segments of target fre- sitional Extremes. For example, the Chi S0 0, Sm Xi, m 1, 2, . . . , quencies qi. These log ratio scores auto- sequence CGTGGTGG, abundant in the iϭ1 matically satisfy the assumptions of E. coli genome, is important in promoting negative mean score with some s positive. recombination events. Over- and under- be the cumulative score process. The i quantity representation measures in evaluating Examples of Natural Scoring Assignments. compositional biases in short oligonucleo- ϭ ͑ Ϫ ͒ Mn max Sl Sk tides include the examples of CpG dinu- 0ՅkՅlՅn (i) Amino acid scores emphasizing posi- cleotide suppression in vertebrates and the tive charge. For Lys (K) and Arg (R) CTAG tetranucleotide underrepresenta- corresponds to a segment of the se- set s ϭϩ2; for Asp (D) and Glu (E) n ϭϪ tion in many proteobacterial and archaeal quence {Sm}0 with maximal aggregate set s 2; and for other amino ac- genomes. score. The essential assumptions are ids set s ϭϪ1.

Karlin PNAS ͉ September 20, 2005 ͉ vol. 102 ͉ no. 38 ͉ 13357 Downloaded by guest on September 24, 2021 ⌬

M ϭ max ͭ ͸F͑X ϩ , Y ϩ ͒ͮ . n i l j l Յ Յ Ϫ⌬ 0 i, j n lϭ1 ⌬Ն0 Suppose the two sequences are indepen- dent: X1,...,Xn i.i.d. following the distri- bution law ␮(x) and Y1,...,Yn i.i.d. ␯(y). Of primary relevance is the case where the expected score per pair is negative and there is positive probability of attain- ing some positive pair score. So we as- sume E␮(x)ϫ␯(y)(F) Ͻ 0, (␮ ϫ ␯)(F Ͼ 0) Ͼ 0, in which case Mn 3 ϱ corresponds Fig. 1. High-scoring assignments to identify transmembrane protein segments. A, experimentally to a rare event. Also, Mn͞log n 3 ␥* established transmembrane tracts. where 0 Ͻ ␥* Ͻϱ. Determine ␪* Ͼ 0 to satisfy E{␪*F} ϭ 1. The conjugate measure is defined to be ␣*(x, y) ϭ ␪ (ii) Scoring applied to gene finding (10, ment score of the sequence {X1, X2,...,}. ␮(x)␯(y)e *F(x,y), which is the two- 18, 23). A 61 letter alphabet (codons) The analysis also applies with Markov dimensional analog of target frequencies. ␣ ϭ ϭ 1,...,61;q␣ observed fre- dependence among the {Ui} and sepa- The relative entropy for two probability ϭ ϭ quency in coding; p␣ observed fre- rately among the {Ti}. Define Ua,b(x) 1, measures ␮ and ␯ is the quantity quency of triplet nucleotides gener- a Յ x Յb; 0 otherwise. During high-scor- ϭ ͞ ͑␯͉␮͒ ϭ ͸␯͑ ͒ ͑␯͑ ͒͞␮͑ ͒͒ ally; and s␣ ln(q␣ p␣). ing intervals H ai log ai ai . (iii) Scores for hydrophobic proflies can i ͕ Յ Յ ͉ ͑ ͒ Յ Յ ͑ ͖͒ be based on the Kyte–Doolittle scale # a Xi b K y i T y or any of the many other scales for H͑␣*͉␮ ϫ ␯͒ ⅐ ͑ ͞ ͑ ͒͒ 3 ͓ ͑ ͒ ␪*X͔ measuring hydrophobicity (102). 1 L y E Ua,b X e Ͼ ͓ ͑␣* ͉␮͒ ͑␣*͉␯͔͒ iv 2max H X H Y , ( ) Scores derived from target frequen- as y 3 ϱ. cies. Letters in a high-scoring seg- Condition E ment have an intrinsic biased compo- In particular, during long waiting times sition such that letter ai occurs in among the first n customers, the fraction where ␣*x is the marginal x distribution of ϭ these segments with frequency qi of interarrival times between a and b is ␣ and ␣*y corresponds to the y variable. ␪*si pie . This result can be used as the ␪*X ϭ E[Ua,be ]. Condition E holds if F(x, y) F(y, x), i.e., basis for defining the scoring system. F is symmetric and ␮ Ϸ ␯ (i.e., ␮ and ␯ Suppose the overall letter frequencies Multiple High-Scoring Segments. Applica- are quite similar) (23–25). are {p1, p2,...,pr}. Let {q1, q2,..., tions of the scoring method often concern qr} be a set of target frequencies that the sum of the t highest segment scores Comparing Two Sequences (24). Let sij be 7 Ј corresponds to the composition in (23, 104). This assessment is relevant scores for the pairing ai aj that occur Ј representative segments of the type when there may be several distinct seg- with probabilities pipj. Assuming s ϭ ͚ Ј ␪ we wish to identify. The scores i ments within the sequence of a given type pipjs(i, j) is negative, * is the unique ln(q ͞p ), i ϭ 1,2,...,r are appropri- ͚ Ј ␪ i i (e.g., several transmembrane segments, positive solution of pipjexp( *s(i, j)) ϭ ate because in a high-scoring segment multiple charge clusters, etc.). For se- 1, and the sequence lengths N and M letter a occurs approximately with i quence comparisons, insertions or dele- grow roughly at similar rates, then the the frequency q ϭ p exp(␪*s ). i i i tions can break an alignment into several maximal matching segment score S sat- (v) High-scoring assignments that iden- isfies the probability law Pr{S Ͼ y} Ϸ 1 pieces. Denote the t highest-scoring seg- Ϫ␪*y tify transmembrane protein segments (1) Ϫ exp{ϪK*MNe }. ments of the model of Eq. 1 as Mn , (see Fig. 1). qi ϭ frequencies over A (2) (t) Mn ,...,Mn . It is convenient to deal segments (where A equals experi- (i) ϭ The Significant Segment Pair Alignment with the centered segment scores Vn mentally established transmembrane (i) Ϫ ͞␪ ϭ (SSPA) Protocol (14). A pairwise amino acid ϭ Mn (ln nK*)(1 *), i 1, 2, . . . , t. tracts) and pi overall letter fre- (i) t similarity matrix s(i, j) [e.g., BLOSUM, PAM quencies. The scores for high-scoring The limiting density of {Vn }iϭ1 is f(x1, ϭ ␪ t (105, 106)] is often used to assess amino ϭ ͞ ..., xt) ( *) segments with si log(qi pi) gener- Ϫ␪ Ϫ␪ ͑ ϩ ϩ ϩ ͒ acid matching. Given two sequences to be Ϫ *xt * x1 x2 ... xt ally identify transmembrane seg- exp{ e }e defined on aligned, the global similarity between the Յ Յ Յ ments. Combining experimental data the domain xt xtϪ1 ... x1. two protein sequences is scored as follows. also can be adapted to identify DNA- The number of high-scoring segments First, all high-scoring segment pairs Ͼ ͞␪ ϩ binding proteins, signal peptides, and of level (log n) * x closely adheres (HSSPs), significant at the 1% level, are amphipathic helices (20). to a Poisson distribution with parameter identified. Next, the HSSPs are combined Ϫ␪ K*exp{ *x}. into a consistent alignment, labeled SSPA. Connections to Queuing Models. The stan- The alignment score is the maximal value dard G͞G͞1, one server queue, involves Maximal Score for Sequence Matching (23, over all sets of sequence segments calcu- (i.i.d.) service times U1, U2, . . . and (i.i.d.) 24). In DNA and protein sequences, lated by summing HSSP segment scores interarrival times T1, T2, . . . of successive matching segment scores are of the form and then normalizing to allow compari- customers (103). A stable queue results if F(ai, ajЈ), where ai is the ith letter in the sons among proteins of different sizes and Ј E[Ui Ϫ Ti] Ͻ 0 (the queue line does not first sequence, aj is the jth letter in the quality (14). For the sequence pairing become infinite). Let Ui Ϫ Ti ϭ Xi. The second sequence, and F(x, y) is the score with at least one segment having a signifi- maximal waiting time among the first n for the letter pair (x, y). The maximal seg- cantly high score match, additional seg- customer is the same as the maximal seg- ment score allowing shifts is ments are identified by using a lower

13358 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0501804102 Karlin Downloaded by guest on September 24, 2021 threshold. The use of the reduced thresh- Regulatory proteins frequently interact n(1ϩ1/r). For a sequence of length L, when (r) Յ * old helps to fill in regions between the with DNA, RNA, or other proteins. Elec- m br L,anr-scan cluster is asserted. more significant HSSPs. The SSPA scores trostatic interactions that exert relatively Similarly, a significantly even spacing is (r) Ͼ * * can be used to deduce groupings of se- long-range, rapid, and localized effects indicated by m ar L, where ar is deter- quences. A group is deemed coherent if presumably mediate or facilitate processes mined from the first equation by setting the SSPA score within the group invariably such as protein sorting, translocation, the probability of m(r) to 0.99. For sam- exceeds the SSPA scores with sequences docking, orientation, and binding to DNA pling from a density f(x), not in the group and if the scores for all and to other protein molecules. It is im- members of the group are consistent. The portant to distinguish the occurrence of a xr 1 ␭ ϭ ͵ ͓ f͑␰͔͒rϩ1d␰. ITERALIGN multiple alignment method charge cluster from a preponderant net r! uses a symmetric-iterated protocol that charge over the whole protein. For exam- 0 combines a motif-finding procedure and a ple, the eukaryotic histones have a sub- local dynamic programming procedure. stantial net positive charge of at least Distribution of the Tetranucleotide CTAG The use of each sequence as a template to 15%, but the charge distribution is with- Sites in Human Herpesviruses and Pro- which all sequences are aligned distin- out clusters. teobacterial Genomes. Human cytomegalo- guishes ITERALIGN from methods of pro- virus (229 kbp) contains 341 CTAG sites gressive pairwise alignments. Consensus Nonrandomness in a Marker Array Along (frequency Ϸ 0.0015). A 10-scan cluster at sequences are generated from each of the a Sequence position 91832 stretching 1,046 bp gives alignments, and the procedure is iterated Particular markers (e.g., specific DNA the significant level (0.01), which overlaps leading to enhanced discrimination of restriction sites, nucleosome placements, the orilyt region of human cytomegalovi- conserved blocks of alignment and vari- gene locations) are distributed along chro- rus (108). The Epstein–Barr virus (172 able-length unaligned insertions. Each of mosomes. Let X be the gap (in DNA kbp) contains 342 CTAG sites (fre- i Ϸ the aligned blocks can be independently units) between the ith and the (i ϩ 1)th quency 0.002). A significant 5-scan clus- ͞ studied as a potential functional structural markers. General issues of sequence het- ter occurs at position 53082 stretching 255 unit. erogeneity lead to statistical consider- bp, which again overlaps the orilyt region of Epstein–Barr virus. The human herpes- ations of the r-scan process{Ri ϭ Charge Clusters in Protein Sequences. A ͚iϩrϪ1 viruses herpes simplex virus 1 and varicella- jϭi Xj}, the array of distances between charge cluster indicates a nonrandom dis- the ith and the (i ϩ r)th markers, i ϭ 1, 2, zoster virus show no significant 1-, 3-, 5-, tribution of the charged residues in a pro- 3, . . . , where r is an integer parameter. It and 10-cluster scans; no significant 1-scan tein sequence, producing high local con- is of interest to characterize r-scan lengths clusters occurred in all genomes. centrations of certain charge types. In harboring clusters or overdispersions of The frequency of CTAG is significantly eukaryotes, charge clusters are often asso- the markers along the sequence. n ϭ low in many proteobacterial and archaeal ciated with transcriptional activation, de- genomes. Some possible contributing fac- number of markers. We form the order velopmental control, and regulation of * Յ * Յ Յ * tors may be as follows. The DCM methyl- statistics R1 R2 ... RnϪrϩ1. Let membrane receptor activity, and they are (r) * (r) * ase (short-patch DNA repair system in m ϭ R (kth smallest), M ϭ R Ϫ Ϫ ϩ generally lacking in cytoplasmic enzymes k k k n r k 2 E. coli) targets the second C of the pen- (kth largest), say k ϭ 1, 2, 3; m(r) too and housekeeping proteins. Clusters of k tanucleotide CCAGG, which can then small indicates clustering; m(r) too large opposite charge in different proteins may k mutate by deamination to CTAGG. Gen- and͞or M(r) too small indicates significant mediate the formation of multiprotein k erally, the repair system corrects T͞G mis- evenness; and M(r) too large indicates complexes. Charge clusters of like sign k matches back to C͞G. However, if the may help to keep certain protein assem- overdispersion. The distribution of a repair system lacked perfect specificity marker is evaluated by comparing the dis- blages apart. Charge clusters within one * and sometimes corrected legitimate protein could contribute to intramolecular tribution of {Ri } calculated for a random CTAG tetranucleotides, this configuration folding or cooperative protein–protein and sequence to those actually observed. Let might operate to some extent in limiting protein–nucleic acid interactions. the minimum and maximum r-scans be CTAG representations. The perfect dyad (r) ϭ * (r) ϭ * The percentage of proteins with signifi- m mini Ri and M maxi Ri ,re- symmetry ACTAGTT AACTAGTisthe cant charge clusters averages Ϸ20% to spectively. The theoretical probabilities for consensus binding site for the E. coli trpR- 25% in most eukaryotic species but only the extremal r-scan statistics of a marker encoded repressor, and this regulatory Ϸ7% in E. coli. Proteins with multiple array of n points distributed randomly activity might impose sufficient rarity of charge clusters are uncommon, Ϸ3.5% in (uniformly), are CTAG. There is some evidence from the human, mouse, fly, and yeast, and ex- crystal structure of the trp-repressor͞oper- x xr tremely rare in E. coli (Ͻ 0.1%) (107). ͩ ͑r͒ Ͼ ͪ Ϸ Ϫ␭ ␭ ϭ ator complex that the two CTAG tet- Pr m 1ϩ1/r exp{ }, , Most developmental control proteins in n r! ranucleotides ‘‘kink’’ when bound by trpR, Drosophila (e.g., Antp, Bicoid, Ftz) carry which may, under conditions of supercoil- 1 at least one charge cluster, commonly Prͩ M ͑r͒ Ͻ ͑ln n ϩ ͑r Ϫ 1͒ln ln n ϩ x͒ͪ ing, be structurally deleterious (109). It proximal or overlapping a homeodomain n has been demonstrated experimentally or a zinc-finger region. The homeodomain that CTAG promotes crosslinking be- Ϫx is generally associated with a mixed- e tween complementary DNA strands by Ϸ exp͕Ϫ␮͖, ␮ ϭ charge cluster that mediates DNA bind- ͑r Ϫ 1͒! UV irradiation at a much greater rate ing, often in a cooperative dimeric confor- than any other tetranucleotides (110). Be- mation. The homeoproteins Cut, These equations allow criteria whether cause crosslinking generally entails delete- Deformed, Engrailed, and Paired carry minimum and maximum observed spac- rious effects, avoidance of CTAG motifs multiple charge clusters. The nervous sys- ings deviate significantly from random could be a natural consequence. tem embryogenetic protein Cut (2175 resi- expectations. For example, setting the CTAG is missing in the left half of the dues long) contains five separated charge probability of the minimum to a required ␭-genome in a segment of 24,743 bp, and clusters, but a homeodomain ascribed to significance (typically 0.01) yields the con- occurrences concentrate in three clusters Ϫ r ͞ ϭ positions 1743–1803 does not qualify as a dition exp( xb r!) 0.01, which is solved in the right half. Eight are located in non- * ϭ ͞ charge clusters by our statistical tests. for xb, yielding the threshold br xb coding regions or at stop codons, four in

Karlin PNAS ͉ September 20, 2005 ͉ vol. 102 ͉ no. 38 ͉ 13359 Downloaded by guest on September 24, 2021 ORFs of undetermined expression, one in number of its occurrences N(L)inase- Frequent words often include repetitive, the CI gene near the carboxyl end, and quence of length L and compare this structural, regulatory, and transposable one in gene S (affecting cell lysis). Thus, count with the expected count, postulating elements [e.g., uptake signal sequences the distribution of CTAG sites in ␭ is rare independently or Markov-generated se- (USSs; see below) in Haemophilus influen- and nonrandom. In E. coli, CTAG occurs quences. Let ␮ be the mean and ␴2 the zae], Chi sites in association with the relatively more frequently in the rRNA variance of the length between successive RecBCD recombination complex, and operon than elsewhere. The low fre- occurrences of the target word. For the REP elements (repeated extragenic palin- quency of CTAG persists in proteobacte- independence model, the quantity c(L) ϭ drome) of unknown function, the latter rial genomes entailing clustering in the ␮3/2(N(L) Ϫ L͞␮)͞(␴͌L) follows approx- two in E. coli. In proteins, frequent oli- 16S and 23S ribosomal RNA genes. Is it imately the standard normal distribution gopeptides often reflect characteristic mo- possible that CTAG sites are nucleation for large L. The tails of the normal distri- tifs shared in certain protein functional or anchor points in the assembly of the bution can be used as thresholds for rare families, e.g., the sequence environment of ribosomal complex? and frequent levels. However, this method the catalytic triad of serine proteases. is difficult to implement, especially the Cluster of DAM Sites in the ori-C Region of computation of ␴ for each word (114). Frequent Oligonucleotides and Peptides of E. coli. DAM sites are important regula- Some problems concerning long repeats the H. influenzae Genome. Two major tory signals composed of the tetranucle- in a random letter sequence are idealized classes of frequent oligonucleotides in otide GATC. These sequences serve in into a ball-in-urn model (urns correspond the H. influenzae genome stand out: (i) part to distinguish the template strand to all DNA words of a given size, and oligos related to the USSs AAGT- (fully methylated) from the newly synthe- balls refer to the observed words in a GCGGT (USSϩ737 occurrences) and its sized strand (unmethylated) during semi- given sequence). Limit theorems for sev- inverted complement (USSϪ 734) of al- conservative replication and repair. These eral generalized occupancy problems are most equal counts and (ii) multiple tet- sites also are associated with genes in- germane. A sequence of n indistinguish- ranucleotide iterations. The USSϩ and volved in the SOS response, transposon able balls are allocated independently USSϪ as established by r-scan statistics function, and bacteriophage infection. In equally likely into an array of m urns. Two are remarkably evenly spaced around the 245-bp sequence that defines the mini- Poisson limit laws refer to the variable Nr, the genome and appear predominantly mal ori-C region of E. coli, there are 8 the number of urns containing r balls: in the same coding frame. The above DAM sites. In a 350-bp stretch flanking with n, m 3 ϱ, Nr has a Poisson limit law findings suggest that USSs contribute to the ori-C region, there are an additional with parameter c͞r!ifn͞m(rϪ1)/r 3 c Ͼ 0; global genomic functions. 12 DAM sites. Do the 8 DAM methyl- and if n ϭ m(lnm) ϩ rm(lnlnm) ϩ mx ϩ A major hypothesis concerning H. influ- ation sites observed in a stretch of 245 bp o(m), then Nr has a Poisson limit law with enzae (and some other bacterial organ- that includes the E. coli origin of replica- parameter exp(Ϫx͞r!). Limiting distribu- isms) is that natural genetic competence tion or that are joined with the additional tions for the waiting times until some urn (transformation) evolved and is main- 12 DAM sites located in the flanking 350 first acquires r balls ensue through their tained for the task of acquiring templates bp or both represent a statistically signifi- duality relations with the occupancy prob- mediating repair of DNA lesions. One cant cluster? We apply the formula first in lems. A compendium of results of this possibility is that the uptake of DNA fol- the case of r ϭ 7, where n is the number kind can be found in ref. 17. lowed by the production of single- of DAM sites over the E. coli genome. In As possible words (urns) and L (actually stranded tails could induce higher levels the E. coli genome, the GATC frequency L Ϫ s ϩ 1) samples then, as L 3 ϱ, of RecA enzyme activities and concomi- is 0.0044. There are 20,680 DAM sites As 3 ϱ and Pr{urn 1 contains Ն r tantly increase the extent of DNA repair. throughout the genome. Now, exp{Ϫx7͞ balls} Ϸ [L͞As]r (1͞r!) for moderate r and In fact, single-stranded DNA is known to 7!} ϭ 96 bp when x ϭ 1.75. Eight DAM As ϾϾ L. The expected number of such be an inducing signal of SOS repair and sites in a stretch of 245 bp somewhere in frequent words is approximately (L͞As)rϪ1 RecA polymerization in binding single- the E. coli genome would occur with (L͞r!) ϭ ␭, and provided ␭ is small, the stranded DNA of E. coli. Other possible probability Ϸ0.06. Thus, the concentration number of frequent words is approxi- roles of natural genetic competence have of DAM sites in the ori-C region of E. mately Poisson-distributed with parameter been attributed to benefits for horizontal coli is not statistically significant. How- ␭ (115, 116). Guided by these formulas, gene transfer for the repair of damaged ever, the calculation with r ϭ 19 implies for a given sequence of length L,wede- chromosomes that are rescued by recom- that a segment of 1,068 bp in length or termine the word size s to satisfy s Ϫ 1 Յ bination with exogenous homologous less containing 20 DAM sites presents a (logL)͞(logA) Ͻ s, and then we determine DNA, for conversion of mutant alleles to statistically significant cluster. the copy number r to satisfy (r Ϫ 1)͞r Ͻ functional alleles, or simply as a good nu- Applications of scan-statistics analysis (logL͞logAs) Յ (r͞(r ϩ 1)). Accordingly, trient source. The uptake mechanisms are for a fixed-length sliding window pertain few s-words are expected to occur at least largely unknown. Another bacterium with to phenomena such as clusters of disease r times in the sequence and can be consid- corresponding directed uptake is Neisseria in time, generalized birthday proximities, ered ‘‘frequent words.’’ gonorrhoeae. Generally, a small percent- and rth nearest-neighbor problems. Early Rare words are characterized as fol- age (Ϸ10%) of a population of Bacillus work on scan statistics focused mainly on lows. Set the word size s obeying the ine- subtilis can become competent for uptake exact formulae (111), exploiting calcula- qualities As log As Յ L Յ Asϩ1 log Asϩ1 of nonspecific DNA sequences. In B. sub- tions of coincidence probabilities in diffu- and then determine r satisfying As(log As tilis and Streptococcus pneumoniae, compe- sion stochastic processes. The asymptotic ϩ rlog log as) Ͻ L Յ As(log As ϩ (r ϩ tence is genetic competence regulated by results reported here are based on the 1)log log As). Following these prescrip- cell density, cell–cell signaling, and nutri- powerful Chen–Stein method of Poisson tions, s-words occurring at most r times tional signaling dependence on growth approximations (112, 113). are deemed ‘‘rare words.’’ conditions. Although DNA uptake is For DNA, rare words might be binding widespread in bacterial cells, nonspecific Frequent Words (Oligonucleotides sites for transcription control factors re- integration into the chromosome seems to and Peptides) stricted to specific locations. Alternatively, be rare. A classical approach for deciding whether rare words may be discriminated against Many bacteria can develop the state of a given word is frequent is to count the because of structural incompatibilities. physiological competence for natural

13360 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0501804102 Karlin Downloaded by guest on September 24, 2021 DNA uptake that is consistent with a bac- In frame 1, (AGTC) 32 is part of a 194-aa highly conserved. Peptide words FQNRR terial gene transfer of free DNA (117). H. ORF. In frame 2, the gene encodes Mod and HFNRY are frequent peptides in influenzae (and N. gonorrhoeae) can only (629 aa), similar to a type III restriction- homeobox proteins. bind and take up double-stranded and modification enzyme of E. coli. Frame 3 is single-stranded DNA from the same or ‘‘null,’’ flanked by multiple termination Rare Words in Human Herpesvirus Genomes. closely related species. This finding is dif- codons. The sequence exhibits a rare ex- Consider the genome of the major hu- ferent from B. subtilis, where the DNA ample of two genes encoded in the same man herpesviruses herpes simplex virus uptake tends to be nonspecific and most orientation in different reading frames 1, varicella-zoster virus, cytomegalovi- cells are not competent, whereas natural overlapping Ͼ40 aa. Is it possible that rus, and Epstein–Barr virus. The DNA genetic competence in H. influenzae and their movement around the genome is totals 678,780 bp. The criterion for rare Azotobacter vinelandii can be attained by channeled through transposon activity? words is size s ϭ 7 and at most r ϭ 12 almost 100% of cells. As noted previously, Variation in the number of AGTC itera- copies. For these characteristics, 728 the degree of bacterial cell competence tions is a strategy that can alter the trans- rare words qualify from a totality of 47 seems to be correlated with the presence lational frame and͞or intensity of DNA Ϸ 16,000. All 7-words occurred at least of highly frequent words. supercoiling in regulation of gene expres- once, and the bottom 11 of least occur- The palindrome GGCGATCGCC la- sion and provide a repertoire of genetic rences are TCTAGTA (1 occurrence), beled HIP1 (highly iterated palindrome) is polymorphism. Mechanisms capable of ACTAGGC (3 occurrences), CTA- highly frequent in Synechocystis (118) and generating such nonstandard random vari- ACTC (3 occurrences), TCTAGTC (3 in most cyanobacteria. The r-scan analysis ation putatively provide a solution to the occurrences), AAGTTAG (4 occur- shows a significantly even distribution problem of enabling rapid and reversible rences), ACTTAGG (4 occurrences), where the observed minimal spacing be- response to environmental changes that ATCACTC (4 occurrences), CTTAGCT tween any two successive occurrences is are frequently encountered in the bacte- (4 occurrences), GACCTAA (4 occur- 52 bp. Thus, the even spacing of the HIP1 rial habitat. rences), GGACTAG (4 occurrences), in Synechocystis is considerably more dra- Frequent peptides are related to the ͞ and TACTAAG (4 occurrences). It is matic than the even spacing of the USSs Walker A box motif GXXGXGK(S notable that most of these words con- in H. influenzae. Synechocystis, like H. in- T)TL. Frequent peptides related to the tain stop codons and the tetranucleotide fluenzae, is known to be transformable. motif LLDEPTN are associated with the CTAG. Whether the HIP1 sequences serve as ATP hydrolysis B site generally located recognition sites in this capacity is un- 40–70 residues downstream of the A-site Concluding Remarks known. The significance of its palindromic (90) in the form ␾␾␾␾D, representing In my view, the role of statistics in se- character also is intriguing. four successive unspecified aliphatic resi- quence analysis is primarily exploratory Mechanisms allowing changes in the dues culminating with the essential aspar- and interactive with the data, generating frequency of gene expression include tate residue. As identified in x-ray crystal new questions and lines of experimental introduction of frameshifts that affect structures, the Walker A and B boxes transcription and͞or translation. Moxon mostly contribute to the ATP-binding investigation. Rather than fitting models et al. (119), for a number of pathogenic pocket of ATP-dependent transport pro- to biomolecular sequences with the pur- bacterial populations, highlight non- teins. A third motif approximately of the pose of statistical hypothesis testing, the standard mutation mechanisms that oc- form LSGGQ(Q͞R)Q Ϸ20 aa upstream analysis of the extreme tails of distribu- cur at special loci, and they explicitly from the B site was identified in ref. 90. tions derived from random sequences discuss the case of the repeat tract, There is considerable agreement of ATP can provide benchmarks for the selec- tion of sequences, part of sequences, or (TCAA)16, present in H. influenzae at and GTP DNA-binding motifs in prokary- the 5Ј end of the lic2 gene. The H. influ- otic and eukaryotic species. sequence features to concentrate on for enzae genome contains 11 impressive The frequent motifs HVDHG, further study. Essential in this approach microsatellites in the form of tandem VDHGK, and DHGKT, which combine is the use of a mixture of different sta- tetranucleotide repeats each extending into HVDHGKT, are notable because tistics and interaction with the data at least 15 iterations. Generally, in tan- they occur in the elongation factor Tu, in and the experimenter. Here robustness dem repeats (in the coding region tufB-B,inselB (translation factor), in infB includes sensitivity of the statistics to and͞or in gene regulatory regions) poly- (initiation factor 2), and in other transla- outliers due to sampling biases, concor- merase slippage, homologous recombi- tion factors. dance among several different measures nation, or mismatch repair occurring Most of the frequent words in higher that examine the data in different ways, during chromosomal replication can eukaryotes have been characteristic of and consistency among independently generate a heterogeneous population of zinc fingers, chymotrypsin proteases, sampled data sets. There are also many cells that can facilitate infection or can serine͞threonine and tyrosine kinases, Ig challenging problems related to classifi- counter host defense mechanisms. Other heavy and light chains, and homoebox cation of protein and DNA sequences examples of variable gene expression proteins among others. The active sites in with reference to function, structure, putatively controlled by repeat sequence these classes of proteins generally have at subcellular localization, expression, phy- tracts occur in Bordetella pertussis, Neis- least one frequent word associated with logenetic relations, and other biological seria meningitides, and N. gonorrhoeae. them. These include explicitly CGKAF, and medical criteria. Statistical stratifi- The example of (AGTC)32 in H. influ- CEECG in zing fingers, LTAAH, GDS- cation of the databases can also aid enzae that offers at least two alternatively GGP in chymotrypsin protease proteins, these tasks as more sequences and encoded genes is particularly interesting. and ADFGL, FGQGT in kinases are genomes become available.

1. Pace, N. R. (1997) Science 276, 734–740. (2003) Environ. Microbiol. 5, 212–216. Natl. Acad. Sci. USA 101, 6176–6181. 2. Venter, J. C., Remington, K., Heidelberg, J. F., 4. Beja, O., Aravind, L., Koonin, E. V., Suzuki, M. T., 6. Karlin, S. & Altschul, S. F. (1990) Proc. Natl. Acad. Halpern, A. L., Rusch, D., Eisen, J. A., Wu, D., Hadd, A., Nguyen, L. P., Jovanovich, S. B., Gates, Sci. USA 87, 2264–2268. Paulsen, I., Nelson, K. E., Nelson, W., et al. (2004) C. M., Feldman, R. A., Spudich, J. L., et al. (2000) 7. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Science 304, 66–74. Science 289, 1902–1906. Lipman, D. J. (1990) J. Mol. Biol. 215, 403–410. 3. Zeidner, G., Preston, C. M., Delong, E. F., Mas- 5. Lepp, P. W., Brinig, M. M., Ouverney, C. C., Palm, 8. Altschul, S. F., Madden, T. L., Schaffer, A. A., sana, R., Post, A. F., Scanlan, D. J. & Beja, O. K., Armitage, G. C. & Relman, D. A. (2004) Proc. Zhang, Z., Miller, W. & Lipman, D. J. (1997) Nucleic

Karlin PNAS ͉ September 20, 2005 ͉ vol. 102 ͉ no. 38 ͉ 13361 Downloaded by guest on September 24, 2021 Acids Res. 25, 3389–3402. 47. Miklos, G. L. & Maleszka, R. (2004) Nat. Biotechnol. 85. Karlin, S. & Burge, C. (1995) Trends Genet. 11, 9. Ewens, W. S. & Grant, G. R. (2001) Statistical 22, 615–621. 283–290. Methods in Bioinformatics: An Introduction 48. Yeung, K. Y., Medvedovic, M. & Bumgarner, R. E. 86. Gentles, A. J. & Karlin, S. (2001) Genome Res. 11, (Springer, New York). (2004) Genome Biol. 5, R48. 540–546. 10. Burge, C. & Karlin S. (1997) J. Mol. Biol. 268, 78–94. 49. Zhang, Z., Harrison, P. M., Liu, Y. P. & Gerstein, M. 87. Karlin, S. (1998) Curr. Opin. Microbiol. 1, 598–610. 11. Kulp, D., Haussler, D., Reese, M. G. & Eckman, (2003) Genome Res. 13, 2541–2558. 88. Krieg, A. M., Wu, T. Weeratna, R., Efler, S. M., F. H. (1996) Proc. Int. Conf. Intell. Syst. Mol. Biol. 4, 50. Zhang, Z., Carriero, N. & Gerstein, M. (2004) Trends Love-Homan, L., Yang, L., Yi, A. K., Short, D. & 134–142. Genet. 20, 62–67. Davis, H. L. (1998) Proc. Natl. Acad. Sci. USA 95, 12. Brendel, V., Bucher, P., Nourbakhsh, I. R., Blaisdell, 51. Chen C., Gentles, A. J., Jurka, J. & Karlin, S. (2002) 12631–12636. B. E. & Karlin, S. (1992) Proc. Natl. Acad. Sci. USA Proc. Natl. Acad. Sci. USA 99, 2930–2935. 89. Krawiec, S. & Riley, M. (1990) Microb. Rev. 54, 89, 2002–2006. 52. Bejerano, G., Pheasant, M., Makunin, I., Stephen, S., 502–539. Kent, W. J., Mattick, J. S. & Haussler, D. (2004) 13. Higgins, D., Thompson, J., Gibson, T., Thompson, 90. Blaisdell, B. E., Rudd, K. E., Matin, A. & Karlin, S. Science 304, 1321–1325. J. D., Higgins, D. G. & Gibson, T. J. (1994) Nucleic (1993) J. Mol. Biol. 229, 833–848. 53. Bourque, G., Pevzner, P. A. & Tesler, G. (2004) Acids Res. 22, 4673–4680. 91. Leung, M.-Y., Blaisdell, B. E., Burge, E. & Karlin, S. Genome Res. 14, 507–516. 14. Brocchieri, L. & Karlin, S. (1998) J. Mol. Biol. 276, (1991) J. Mol. Biol. 221, 1367–1378. 249–264. 54. Fairbrother, W. G., Holste, D., Burge, C. B. & Sharp, P. A. (2004) PLoS Biol. 2, E268. 92. Woese, C. R., Kandler, O. & Wheelis, M. L. (1990) 15. Dembo, A. & Karlin, S. (1992) Ann. Appl. Prob. 2, Proc. Natl. Acad. Sci. USA 87, 4576–4579. 329–357. 55. Yeo, G., Hoon, S., Venkatesh, B. & Burge, C. B. 93. Campbell, A. (2000) Theor. Popul. Biol. 57, 71–77. 16. Karlin, S., Mrazek, J. & Campbell, A. M. (1996) (2004) Proc. Natl. Acad. Sci. USA 101, 15700–15705. 94. Ochman, H., Lawrence, J. G. & Groisman, E. A. Nucleic Acids Res. 24, 4263–4272. 56. Xing, Y. & Lee, C. (2005) Proc. Natl. Acad. Sci. USA (2000) Nature 405, 299–304. 17. Johnson, N. L. & Kotz, S. (1997) Urn Models and 102, 13526–13531. 95. Felsenstein, L. (1989) PHYLIP: Phylogeny Package, Their Applications (Wiley, New York). 57. Eddy, S. R. (2001) Nat. Rev. Genet. 2, 919–929. Version 3.2, Cladistics 5, 164–166. 18. Karlin, S. & Cardon, L. (1994) Annu. Rev. Microbiol. 58. Yeger-Lotem, E., Sattath, S., Kashtan, N., Itzkovitz, 96. Swofford, D. L. (1993) PAUP: Phylogenetic Analysis 48, 619–654. S., Milo, R., Pinter, R. Y., Alon, U. & Margalit, H. 19. Park, Y. & Spouge, J. L. (2002) Bioinformatics 18, (2004) Proc. Natl. Acad. Sci. USA 101, 5934–5939. Using Parsimony (Illinois Natural History Survey, 1236–1242. 59. Wodak, S. J. & Mendez, R. (2004) Curr. Opin. Struct. Champaign, IL), Version 3.1. 20. Karlin, S. & Brendel, V. (1992) Science 257, 39–49. Biol. 14, 242–249. 97. Kumar, S., Tamura K. & Nei, M. (2004) MEGA3: 21. Karlin, S. & Zhu, Z. Y. (1996) Proc. Natl. Acad. Sci. 60. Strong, M., Graeber, T. G., Beeby, M., Pellegrini, M., Integrated Software for Molecular Evolutionary USA 93, 8344–8349. Thompson, M. J., Yeates, T. O. & Eisenberg, D. Genetic Analysis and Sequence Alignment, Brief. 22. Zhu, Z. Y. & Karlin, S. (1996) Proc. Natl. Acad. Sci. (2003) Nucleic Acids Res. 31, 7099–7109. Bioinform. 5, 150–163. USA 93, 8350–8355. 61. Salwinski, L. & Eisenberg, D. (2004) Nat. Biotechnol. 98. Gupta, R. S. (1998) Microbiol. Rev. 62, 1435–1491. 22, 1017–1019. 23. Karlin, S. (1994) Philos. Trans. R. Soc. London B 344, 99. Karlin, S., Brocchieri, L., Trent, J., Blaisdell, B. E. & 62. Karlin, S. (2001) Trends Microbiol. 9, 335–343. 391–402. Mrazek, J. (2002) Theor. Popul. Biol. 61, 367–390. 63. Ramakrishnan, L., Federspiel, N. A. & Falkow, S. 24. Dembo, A., Karlin, S. & Zeitouni, O. (1994) Ann. 100. Bateman, A., Birney, E., Cerruti, L., Durbin, R., (2000) Science 288, 1436–1439. Prob. 22, 1993–2021. Etwiller, L., Eddy, S. R., Griffiths-Jones, S., Howe, 64. Kolodny, R., Koehl, P. & Levitt, M. (2005) J. Mol. 25. Dembo, A. & Zeitouni, O. (1998) Large Deviations K. L., Marshall, M. & Sonnhammer, E. L. (2002) Biol. 346, 1173–1188. Techniques and Applications (Springer, New York), Nucleic Acids Res. 30, 276–280. 65. Bourne, P. E., Westbrook, J. & Berman, H. M. (2004) 2nd Ed. 101. Karlin, S. & Dembo, A. (1992) Adv. Appl. Prob. 24, Brief Bioinform. 5, 23–30. 26. Iglehart, D. (1972) Ann. Math. Stat. 43, 627–635. 113–140. 66. Klein, T. E. & Altman, R. B. (2004) Pharmaco- 27. Assmussen, S. (1982) Adv. Appl. Prob. 14, 143–170. 102. Von Heijne, G. (1987) Sequence Analysis in Molec- genomics J. 4, 1. 28. Futcher, B., Latter, G. I., Monardo, P., Mclaughlin, ular Biology: Treasure Trove or Trivial Pursuit (Aca- 67. Hwang, D. G. & Green, P. (2004) Proc. Natl. Acad. C. S. & Garrels, J. I. (1999) Mol. Cell. Biol. 19, Sci. USA 101, 13994–14001. demic, New York). 7357–7368. 68. Meunier, J. & Duret, L. (2004) Mol. Biol. Evol. 21, 103. Feller, W. (1968) An Introduction to Probability The- 29. Yates, J. R., III. (2000) Trends Genet. 16, 5–8. 984–990. ory and Its Applications (Wiley, New York), Vol. II. 30. Gygi, S. P., Rochon, Y., Franza, B. R. & Aebersold, 69. Gentles, A. J. & Karlin, S. (2004) Recent Develop- 104. Karlin, S. & Altschul, S. F. (1993) Proc. Natl. Acad. R. (1999) Mol. Cell. Biol. 19, 1720–1730. ments in Nucleic Acids Research (Transworld Re- Sci. USA 90, 5873–5877. 31. Van Bogelen, R. A., Schiller, E. E., Thomas, J. D. & search Network, Kerala, India), Vol. 1, Part I, pp. 105. Henikoff, S. & Henikoff, J. G. (1992) Protein Sci. 6, Neidhardt, F. C. (1999) Electrophoresis 20, 2149– 35–51. 698–705. 2159. 70. Takai, D. & Jones, P. A. (2002) Proc. Natl. Acad. Sci. 106. Dayhoff, M. O. (1978) Atlas of Protein Sequence and 32. Sharp, P. M. & Matassi, G. (1994) Curr. Opin. Genet. USA 99, 3740–3745. Structure (National Biomedical Research Founda- Dev. 4, 851–860. 71. Nguyen, C., Liang, G. M., Nguyen, T. T., Tsao-Wei, tion, Washington, DC), Vol. 5. 33. Karlin, S. & Mrazek, J. (2000) J. Bacteriol. 182, D., Groshen, S., Lubbert, M., Zhou, J. H., Benedict, 107. Karlin S. (1995) Curr. Opin. Struct. Biol. 5, 360–371. 5238–5250. W. F. & Jones, P. A. (2001) J. Natl. Cancer Inst. 93, 108. Masse, M. J. O., Karlin, S., Schachtel, G & Mocarski, 34. Karlin, S. Mrazek, J., Campbell, A. & Kaiser, D. 1465–1472. E. S. (1992) Proc. Natl. Acad. Sci. USA 89, 5246– (2001) J. Bacteriol. 183, 5025–5040. 72. Avner, P. & Heard, E. (2001) Nat. Rev. Genet. 2, 5250. 35. Karlin, S., Mrazek, J., Ma, J., Brocchieri, L. (2005) 59–67. 109. Gunsalus, R. P. J. & Yanovsky, C. (1980) Proc. Natl. Proc. Natl. Acad. Sci. USA 102, 7303–7308. 73. Bird, A. (2002) Genes Dev. 16, 6–21. Acad. Sci. USA 77, 7117–7121. 36. Karlin, S., Brocchieri, L., Campbell, A. M., Cyert, M., 74. Doerfler, W. (1991) Biol. Chem. 372, 557–564. 110. Burge, C., Campbell, A. M. & Karlin, S. (1992) Proc. Mrazek, J. (2005) Proc. Natl. Acad. Sci. USA 102, 75. Antequera, F. & Bird, A. (1999) Curr. Biol. 9, Natl. Acad. Sci. USA 89, 1358–1362. 7309–7314. 661–667. 111. Naus, J. I. (1979) Int. Stat. Rev. 47, 47–78. 37. Brocchieri, L. (2001) Theor. Popul. Biol. 59, 27–40. 76. Kress, C., Thomassin, H. & Grange, T. (2001) FEBS 112. Arratia, R., Goldstein, L. & Gordon, L. (1989) Ann. 38. Yang, S., Doolittle, R. F. & Bourne, P. E. (2005) Lett. 494, 135–140. Prob. 17, 9–25. Proc. Natl. Acad. Sci. USA 102, 373–378. 77. Josse, J., Kaiser, A. D. & Kornberg, A. (1961) J. Biol. 113. Barbour, A. D., Holst, L. & Janson, S. (1992) Poisson 39. Felsenstein, J. (2003) Inferring Phylogenies (Sinauer, Chem. 236, 864–875. Approximations (Oxford Univ. Press, Oxford). Sunderland, MA). 78. Russel, G. J., Walker, P. M. B., Elton, R. A. & 114. Kleffe, J. & Borodovsky, M. (1992) Comput. Appl. 40. Modrek, B. & Lee, C. (2002) Nat. Genet. 30, 13–19. Subak-Sharpe, J. H. (1976) J. Mol. Biol. 108, 1–28. Biosci. 8, 433–441. 41. Modrek, B. & Lee, C. J. (2003) Nat. Genet. 34, 79. Russel, G. J. & Subak-Sharpe, J. H. (1977) Nature 4, 177–180. 266, 533–535. 115. Karlin, S. & Leung, M.-Y. (1991) Ann. Appl. Prob. 42. Sorek, R. & Ast, G. (2003) Genome Res. 13, 1631– 80. Groot, G. S. & Kroon, A. M. (1979) Biochim. Bio- 513–538. 1637. phys. Acta 564, 355–357. 116. Reinert, G., Schbath, S. & Waterman, M. S. (2003) 43. Boue, S., Letunic, I. & Bork, P. (2003) BioEssays 25, 81. Breslauer, K. J., Frank, R., Blocker, H. & Marky, J. Comp. Biol. 7, 1–46. 1031–1034. L. A. (1986) Proc. Natl. Acad. Sci. USA 83, 3746– 117. Lorenz, M. G. & Wackernagel, W. (1994) Microbiol. 44. The International SNP Map Working Group (2001) 3750. Rev. 58, 563–602. Nature 409, 928–933. 82. Hunter, C. A. (1993) J. Mol. Biol. 230, 1025–1054. 118. Robinson, N. J., Robinson, P. J., Gupta, A., Bleasby, 45. The International HapMap Consortium (2003) Na- 83. Kunkel, T. A. & Bebeneck, K. (2000) Annu. Rev. A. J., Whitton, B. A. & Morby, A. P. (1995) Nucleic ture 426, 789–796. Biochem. 69, 497–529. Acids Res. 23, 729–735. 46. Murzin, A. G., Brenner, S. E., Hubbard, T. & Cho- 84. Sinden, R. R. (1994) DNA Structure and Function 119. Moxon, E. R., Rainey, P. B., Nowak, M. A. & Lenski, thia, C. (1995) J. Mol. Biol. 247, 536–540. (Academic, San Diego). R. E. (1994) Curr. Biol. 4, 24–33.

13362 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0501804102 Karlin Downloaded by guest on September 24, 2021