Statistical Signals in Bioinformatics

FROM THE ACADEMY: COLLOQUIUM PERSPECTIVE Statistical signals in bioinformatics Samuel Karlin† Department of Mathematics, Stanford University, Stanford, CA 94305-2125 Contributed by Samuel Karlin, July 7, 2005 The Arthur M. Sackler Colloquium of the National Academy of Sciences, ‘‘Frontiers in Bioinformatics: Unsolved Problems and Chal- lenges,’’ organized by David Eisenberg, Russ Altman, and myself, was held October 15–17, 2004, to provide a forum for discussing concepts and methods in bioinformatics serving the biological and medical sciences. The deluge of genomic and proteomic data in the last two decades has driven the creation of tools that search and analyze biomolecular sequences and structures. Bioinformatics is highly interdisciplinary, using knowledge from mathematics, statistics, computer science, biology, medicine, physics, chemistry, and engineering. BLAST ͉ repeat sequences ͉ r-scan statistics ͉ frequent and rare oligonucleotides and peptides ore than 200 bacterial, 20 ment) (13, 14); r-scan statistics (15) (de- to evaluate gene expression with a differ- archaeal, and 20 eukaryotic tecting anomalous spacing of specific ent set of limitations (32–34). Ribosomal genomes as well as 1,600 markers distributed along a sequence); protein (RP) codon frequencies deviate viral genomes have been and ‘‘frequent’’ and ‘‘rare’’ oligonucleo- strongly from average codon frequencies Mcompletely sequenced. Moreover, several tides and peptides (sequence words in many bacteria, especially during rapid hundred mitochondrial chromosomal sets that occur statistically more or less fre- growth. By contrast, the expression levels and at least 30 chromosomal plastids have quently than would be expected by for RPs in archaea are variable; 15–30% been sequenced. A deeper understanding chance) (16–18). of archaeal genes have average codon of basic biology can be gained from a The standard BLAST protocol compares usage (35, 36). The most biased codon comparison of organisms in different evo- a query sequence with a large database of usages are found for genes generally lutionary lineages, and this understanding protein sequences to uncover significant involved in general processes of transcrip- is the aim of the currently intense se- similarities that help to circumscribe func- tion͞translation, chaperone͞degradation, quencing efforts. New species thriving in tion and structure of the query sequence. and energy metabolism (33, 34). environments throughout the earth and BLAST-like programs are extended to oceans reveal microbes in geothermal ar- searches using multiple alignments (13, 14, Highlights of the Sackler Colloquium eas, in arctic ice, in acidic springs, in toxic 19); to single-sequence analyses (20), e.g., Discussions at the Colloquium focused on waste sites, in assorted air currents, and in to identify DNA-binding peptides and the following topics: sequence patterns subterranean habitats (1, 2). Unique mixes transmembrane tracts; and to three-di- (12, 18, 20), comparative genomics and of microbes that thrive in different ecosys- mensional analyses, e.g., to identify charge proteomics, modeling of molecular inter- tems have been described (3). For exam- clusters in protein structures and cysteine actions and gene regulatory networks, ple, a new type of bacterial rhodopsin was knots (21, 22). The BLAST programs cur- management and interpretation of protein discovered by genomic analysis of natu- rently serve Ϸ250,000 queries per day at expression data (33, 34), microarrays (29), rally occurring marine bacterioplankton the National Center for Biotechnology over- and underrepresentations of words (4). Prokaryote genomic samples from the Information (NCBI) in Washington, DC. in genomes and proteomes, molecular human oral cavity and the intestinal tract, The theory relies on fundamental proper- evolution (37–39), alternative splicing (40– and their changes in time, are progres- ties of extremal statistical distributions and 43), polymorphisms and SNPs (44), sively being followed (5). The diversity of the stochastic theory of large deviations genomic haplotypes (HAP-MAP) (45), microbes in cutaneous wounds of humans (23–25). There are natural relationships and three-dimensional macromolecular is of great practical importance. Sequenc- between these analyses and studies on the taxonomy (46). In this article, I will sum- ing efforts facilitate health research and maximum service time among customers marize results and perspectives issuing understanding of pathogenesis and may in queuing systems, as well as in applica- from Colloquium talks and then review have commercial, industrial, and agricul- tions to insurance risk and traffic flow some general concepts and methods of tural benefits as well. It is clear that lack models (see refs. 26 and 27 and below) bioinformatics. of data is not a problem today; rather, the DNA microarrays (DNA chips) aim to George Miklos called attention to diffi- challenge will be the analysis of the vast dissect gene expression under varied phys- culties arising from noise in microarray quantity of information already available. iological, clinical, and environmental con- data sets. Conflicts in the clinical applica- Bioinformatic methods underlie most of ditions. Microarrays are used to monitor tion of microarray data to cancers and computational biology. Current emphasis well characterized genes in different situa- complex diseases are exposed in refs. 47 is on computationally efficient and wide- tions; to discover disease genes; to assess ranging algorithms that have been imple- gene expression during treatment with This paper serves as an introduction to the Arthur M. Sackler mented and tested on real and simulated drugs, chemicals, or toxins; to discover Colloquium of the National Academy of Sciences, ‘‘Frontiers data sets. Exact and empirical algorithms genes that compensate for knockout mu- in Bioinformatics: Unsolved Problems and Challenges,’’ have been applied to genomics, proteom- tations; and to profile gene expression in held October 15–17, 2004, at the Arnold and Mabel Beck- ics, gene networks, structure prediction, temporal and in tissue-specific localiza- man Center of the National Academies of Sciences and Engineering in Irvine, CA. Papers from this Colloquium will and drug design. Bioinformatic tools are tions (28–30). Experimental evaluations of be available as a collection on the PNAS web site. The used daily, including the generalized protein abundances under different cellu- complete program is available on the NAS web site at BLAST programs (sequence similarity eval- lar conditions can be assayed by two-di- www.nasonline.org͞bioinformatics. uations) (6–9); GENSCAN and GENIE (gene mensional gel electrophoresis (31) and Abbreviations: AS, alternative splicing; i.i.d., independent discovery) (10, 11); SAPS (statistical analy- supplemented by mass spectrometry, anti- identically distributed; USS, uptake signal sequence. sis of protein sequences) (12); CLUSTAL body associations, and biochemical tests. †E-mail: [email protected]. and ITERALIGN (multiple sequence align- Codon usage analysis offers another way © 2005 by The National Academy of Sciences of the USA www.pnas.org͞cgi͞doi͞10.1073͞pnas.0501804102 PNAS ͉ September 20, 2005 ͉ vol. 102 ͉ no. 38 ͉ 13355–13362 Downloaded by guest on September 24, 2021 and 48. Marc Gerstein reviewed data on interacting genes. Per Bork described gene CpG Dinucleotide Suppression the numbers and distribution of pseudo- control in metazoans and how it influ- The role of CpG dinucleotides, particu- genes (called fossil records) over several ences genome evolution. He proffered a larly in the context of CpG methylation, is complete eukaryotic genomes, especially gene neighborhood on predicting gene of considerable interest (e.g., see ref. 69). the human genome. Ribosomal protein function (43). Hanah Margalit reported In addition to causing increased mutation genes predominate among pseudogenes in how gene regulation and the arrangement rates, CpG methylation alters the shape of human sequences counting more than of protein pairs participate in protein– the major groove of DNA, leading to 2,000 cases (49, 50). In chromosomes 21 protein interactions in Escherichia coli and modified chromatin structure, and thus it and 22 (see ref. 51), the median length of Saccharomyces cerevisiae. She displayed is capable of altering patterns and rates of processed pseudogenes is approximately situations of chromosomal adjacency gene transcription (70). Recent studies in the same median length as that of single- among these genes (58). Shoshana Wodak humans have shown that the nonpromoter exon (intronless) genes. The median also dealt with protein–protein interaction CpG islands are targets for de novo meth- length of single-exon genes is approxi- and the structural problem of docking ylation, playing a role in cancer and aging mately congruent to the median internal (59). David Eisenberg reviewed several (71). As a mechanism for modifying gene exon length times the average number of programs relevant to the study of protein expression levels and tissue-specific ex- exons per gene. These observations sup- interactions comparing multiple bacterial pression, it offers regulatory possibilities port the proposition that many single-exon genomes. The information can be applied that are exploited in genomic imprinting, genes derive from processed multiexon in structural genomics to determine pro- X chromosome inactivation, transposon genes transposed into the genome (51). tein

Statistical Signals in Bioinformatics

RECENT ADVANCES in BIOLOGY, BIOPHYSICS, BIOENGINEERING and COMPUTATIONAL CHEMISTRY

ISMB 2008 Toronto

From DNA Sequence to Chromatin Dynamics: Computational Analysis of Transcriptional Regulation

Program Book

Conference Proceedingssmall

Assigning Folds to the Proteins Encoded by the Genome of Mycoplasma Genitalium (Protein Fold Recognition͞computer Analysis of Genome Sequences)

Practical Structure-Sequence Alignment of Pseudoknotted Rnas Wei Wang

BIOINFORMATICS Doi:10.1093/Bioinformatics/Btq499

A Hidden Markov Model Framewrok for Studying Regulation from Chromatin Through RNA

EYAL AKIVA, Phd

UC San Diego UC San Diego Electronic Theses and Dissertations

Pacific Symposium on Biocomputing 2020 January 3-7, 2020 Big Island of Hawaii