Hidden Markov Models

Total Page:16

File Type:pdf, Size:1020Kb

Hidden Markov Models Hidden Markov Models Jacques van Helden Aix-Marseille Université (AMU) Lab. Theory and Approaches of Genomic Complexity (TAGC) https://tagc.univ-amu.fr/ Institut Français de Bioinformatique (IFB) http://www.france-bioinformatique.fr [email protected] https://orcid.org/0000-0002-8799-8584 A seminal book n In 1998, Richard Durbin, Sean Eddy, A. Krogh and G. Mitchison published a seminal book entitled « Biological sequence analysis » q A tutorial introduction to hidden Markov models and other probabilistic modelling approaches in computational sequence analysis. q The authors restate the classical sequence analysis problems in terms of Hidden Markov Models (HMM). q Even their table of contents is presented as an HMM (their Figure 1.1 below) n Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Richard Durbin, Sean Eddy, Anders Krogh, and Graeme Mitchison. Cambridge University Press, 1998. ISBN 0-521-62041-4 (hardback) Applications of Hidden Markov Models in biology n Hidden Markov models can be applied to solve a diversity of problems in bioinformatics n Sequence segmentation q Detection of CpG islands q Intron/exon prediction n Motif detection q Protein domains (long motifs in peptidic sequences) q Transcription factor binding sites (short motifs on DNA sequences) n Secondary structure prediction n … Markov models (nothing to hide so far) Markov process 2-states Markov process Transition matrix n A Markov process is defined by q A finite number of states (A, B, C, …) X Y n Example: 2-state Markov process X Y q States: {X, Y} X 0.9 0.1 n Transitions: Y 0.2 0.8 q {X à X, X à Y, Y à X, X à Y} n The probability of transition from each state to each other one is described in a transition matrix. Examples of biological applications q Rows: current state si 1. Segmentation of the genome into transcribed and intergenic regions q Columns: next state si+1 Genome fragment q Values P(si+1 | si ) transcript q Transition probabilities sum to 1 on each row n Examples of Markov models to annotate genomic sequences 1. State X = intron, State Y = exon 2. Segmentation of transcribed regions into introns and exons 2. State X = transcribed region, state Y = intergenic region intron 3. State X = CpG island; State Y = other genomic region exon 3. Segmentation of the genome into CpG islands and non-CpG islands CpG island non-CpG island Markov process n In order to annotate the genome, we could conceive a multi- k-states Markov process state markov model that would represent the different 1. State W = intron, 2. State X = exon W X 3. State Y = CpG island; 4. State Z = other genomic region B E Y Z Segmentation of the genome into different types of regions intron exon Transition matrix (arbitrary values) CpG island W X Y Z other type of genomic region W 0.990 0.010 0.000 0.000 X 0.010 0.988 0.001 0.001 Y 0.00000 0.00002 0.99898 0.00100 Z 0 0.000002 0.000001 0.999997 Markov model of a sequence 4-states Markov process for DNA sequence n We can model a macromolecular sequence as a Markov process q DNA : n = 4 states (A, C, G, T) q Proteins: n = 20 states (amino acids) q Optionally, additional states can be used to represent the beginning (B) of and the end (E) of the sequence. This A C enables to generate sequences of different lengths. n Transition probabilities indicate the probability to generate a B E given residue (suffix) given the current residue (prefix) G T n Exercise q DNA sequences are generated using a Markov model with ending probability of 0.99 (irrespective of the current residue). What is the distribution of sequence lengths? Probability of a sequence segment n What is the probability for a given sequence segment ? n Different models can be chosen q Bernoulli model • Assumes independence between successive nucleotides. • The probability of each residue is fixed a priori (prior residue probability) n Example: P(A) = 0.35; P(T) = 0.32; P(C) = 0.17; P(G) = 0.16 • Particular case: equiprobable residues n P(A) = P(T) = P(C) = P(G) = 0.25 n Simple, but NOT realistic ! q Markov model • The probability of each residue depends on the m preceding residues. • The parameter m is called the order of the Markov model • Remark: a Bernoulli model can be considered as a Markov model of order 0 8 Independent and equiprobable nucleotides n The simplest model : Bernoulli with identically and independently (i.i.d.) distributed nucleotides. p = P(A) = P(C) = P(G) = P(T)= 0.25 n The probability of a sequence P(S) = pL q Is the product of its residue probabilities (independence) q Equiprobability: since all residues have the same probability, it is simply computed as the residue proba (p) to the power of the sequence length (L) • S is a sequence segment (e.g. an oligonucleotide) • L length of the sequence segment € • p nucleotide probability • P(S) is the probability to observe this sequence segment at given position of a larger sequence n Example 6 -4 q P(CACGTG) = 0.25 = 2.44e 9 Bernoulli model : independently distributed nucleotides n A more refined model consists in using residue- specific probabilities. The probability of each residue L is assumed to be constant on the whole sequence (Bernoulli schema). P(S) = ∏P(ri ) n The probability of a sequence is the product of its i=1 residue probabilities. q i = 1..k is the index of nucleotide positions q ri is the residue found at position I q P(ri) is the probability of this residue € n Example: non-coding sequences in the yeast genome q P(A) = P(T) = 0.325 q P(C) = P(G) = 0.175 q P(CACGTG) = P(C) P(A) P(C) P(G) P(T) P(G) = 0.3254 * 0.1752 = 9.91E-5 10 Bernoulli models n A Bernoulli model assumes that q each residue has a specific prior probability q this probability is constant over the sequence (no context dependencies) n The heat-maps below depict the nucleotide frequencies in non-coding upstream sequences of various organisms. n The frequencies of AT versus CG show strong inter-organism differences. Saccharomyces cerevisiae Escherichia coli K12 Mycobacterium leprae (Fungus) (Proteobacteria) (Actinobacteria) Mycoplasma genitalium Bacillus subtilis (Firmicute, intracellular) (Firmicute, extracellular) Plasmodium falciparum Anopheles gambiae Homo sapiens (Aplicomplexa, intracellular) (Insect) (Mammalian) 11 Markov chains and transition matrices n In a Markov model, the probability to find a letter at P(ri | Si−m,i−1) position i depends on the residues found at the m preceding residues. Transition matrix, order 1 n The tables represent the transition matrices for a c g t A P(A|A) P(C|A) P(G|A) P(T|A) Markov chain models of order m=1 (top) and m=2 C P(A|C) P(C|C) P(G|C) P(T|C) (bottom). € G P(A|G) P(C|G) P(G|G) P(T|G) T P(A|T) P(C|T) P(G|T) P(T|T) n Each row specifies one prefix, each column one suffix. Transition matrix, order 2 Prefix A C G T n The values indicate the probability to observe a AA P(A|AA) P(C|AA) P(G|AA) P(T|AA) given residue (suffix ) at position ( ) of the AC P(A|AC) P(C|AC) P(G|AC) P(T|AC) ri i AG P(A|AG) P(C|AG) P(G|AG) P(T|AG) sequence, as a function of the m preceding residues AT P(A|AT) P(C|AT) P(G|AT) P(T|AT) (the prefix S ) CA P(A|CA) P(C|CA) P(G|CA) P(T|CA) i-m,i-1 CC P(A|CC) P(C|CC) P(G|CC) P(T|CC) n Particular case CG P(A|CG) P(C|CG) P(G|CG) P(T|CG) CT P(A|CT) P(C|CT) P(G|CT) P(T|CT) q A Bernoulli model is a Markov model of order 0. GA P(A|GA) P(C|GA) P(G|GA) P(T|GA) GC P(A|GC) P(C|GC) P(G|GC) P(T|GC) GG P(A|GG) P(C|GG) P(G|GG) P(T|GG) GT P(A|GT) P(C|GT) P(G|GT) P(T|GT) TA P(A|TA) P(C|TA) P(G|TA) P(T|TA) TC P(A|TC) P(C|TC) P(G|TC) P(T|TC) TG P(A|TG) P(C|TG) P(G|TG) P(T|TG) TT P(A|TT) P(C|TT) P(G|TT) P(T|TT) 12 Markov model estimation (“training”) Dinucleotide frequencies n Transition frequencies for a Markov model of order m can Sequences Occurrences Frequency be estimated from the frequencies observed for oligomers S N(S) F(S) (k-mers) of length k=m+1 in a reference sequence set. AA 526,149 0.112 AC 251,377 0.054 n Example AG 275,056 0.059 q The upper table shows dinucleotide frequencies (k=2) AT 414,453 0.088 computed from the whole set of upstream sequences of CA 294,423 0.063 the yeast Saccharomyces cerevisiae. CC 178,324 0.038 CG 146,052 0.031 q This table can be used to estimate a Markov model of CT 275,859 0.059 order m = k–1 = 1. GA 277,343 0.059 GC 184,367 0.039 GG 173,404 0.037 GT 239,569 0.051 TA 369,980 0.079 TC 280,475 0.060 TG 279,932 0.060 TT 521,236 0.111 13 Markov model estimation (“training”) Dinucleotide frequencies n Transition frequencies for a Markov model of order m can Sequences Occurrences Frequency be estimated from the frequencies observed for oligomers S N(S) F(S) (k-mers) of length k=m+1 in a reference sequence set.
Recommended publications
  • RDA COVID-19 Recommendations and Guidelines on Data Sharing
    RDA COVID-19 Recommendations and Guidelines on Data Sharing DOI: 10.15497/RDA00052 Authors: RDA COVID-19 Working Group Published: 30th June 2020 Abstract: This is the final version of the Recommendations and Guidelines from the RDA COVID19 Working Group, and has been endorsed through the official RDA process. Keywords: RDA; Recommendations; COVID-19. Language: English License: CC0 1.0 Universal (CC0 1.0) Public Domain Dedication RDA webpage: https://www.rd-alliance.org/group/rda-covid19-rda-covid19-omics-rda-covid19- epidemiology-rda-covid19-clinical-rda-covid19-1 Related resources: - RDA COVID-19 Guidelines and Recommendations – preliminary version, https://doi.org/10.15497/RDA00046 - Data Sharing in Epidemiology, https://doi.org/10.15497/RDA00049 - RDA COVID-19 Zotero Library, https://doi.org/10.15497/RDA00051 Citation and Download: RDA COVID-19 Working Group. Recommendations and Guidelines on data sharing. Research Data Alliance. 2020. DOI: https://doi.org/10.15497/RDA00052 RDA COVID-19 Recommendations and Guidelines on Data Sharing RDA Recommendation (FINAL Release) Produced by: RDA COVID-19 Working Group, 2020 Document Metadata Identifier DOI: https://doi.org/10.15497/rda00052 Citation To cite this document please use: RDA COVID-19 Working Group. Recommendations and Guidelines on data sharing. Research Data Alliance. 2020. DOI: https://doi.org/10.15497/rda00052 Title RDA COVID-19; Recommendations and Guidelines on Data Sharing, Final release 30 June 2020 Description This is the final version of the Recommendations and Guidelines
    [Show full text]
  • Curriculum Vitae – Prof. Anders Krogh Personal Information
    Curriculum Vitae – Prof. Anders Krogh Personal Information Date of Birth: May 2nd, 1959 Private Address: Borgmester Jensens Alle 22, st th, 2100 København Ø, Denmark Contact information: Dept. of Biology, Univ. of Copenhagen, Ole Maaloes Vej 5, 2200 Copenhagen, Denmark. +45 3532 1329, [email protected] Web: https://scholar.google.com/citations?user=-vGMjmwAAAAJ Education Sept 1991 Ph.D. (Physics), Niels Bohr Institute, Univ. of Copenhagen, Denmark June 1987 Cand. Scient. [M. Sc.] (Physics and mathematics), NBI, Univ. of Copenhagen Professional / Work Experience (since 2000) 2018 – Professor of Bionformatics, Dept of Computer Science (50%) and Dept of Biology (50%), Univ. of Copenhagen 2002 – 2018 Professor of Bionformatics, Dept of Biology, Univ. of Copenhagen 2009 – 2018 Head of Section for Computational and RNA Biology, Dept. of Biology, Univ. of Copenhagen 2000–2002 Associate Prof., Technical Univ. of Denmark (DTU), Copenhagen Prices and Awards 2017 – Fellow of the International Society for Computational Biology https://www.iscb.org/iscb- fellows-program 2008 – Fellow, Royal Danish Academy of Sciences and Letters Public Activities & Appointments (since 2009) 2014 – Board member, Elixir, European Infrastructure for Life Science. 2014 – Steering committee member, Danish Elixir Node. 2012 – 2016 Board member, Bioinformatics Infrastructure for Life Sciences (BILS), Swedish Research Council 2011 – 2016 Director, Centre for Computational and Applied Transcriptomics (COAT) 2009 – Associate editor, BMC Bioinformatics Publications § Google Scholar: https://scholar.google.com/citations?user=-vGMjmwAAAAJ § ORCID: 0000-0002-5147-6282. ResearcherID: M-1541-2014 § Co-author of 130 peer-reviewed papers and 2 monographs § 63,000 citations and h-index of 74 (Google Scholar, June 2019) § H-index of 54 in Web of science (June 2019) § Publications in high-impact journals: Nature (5), Science (1), Cell (1), Nature Genetics (2), Nature Biotechnology (2), Nature Communications (4), Cell (1, to appear), Genome Res.
    [Show full text]
  • Predicting Transmembrane Topology and Signal Peptides with Hidden Markov Models
    i i “thesis” — 2006/3/6 — 10:55 — page i — #1 i i From the Center for Genomics and Bioinformatics, Karolinska Institutet, Stockholm, Sweden Predicting transmembrane topology and signal peptides with hidden Markov models Lukas Käll Stockholm, 2006 i i i i i i “thesis” — 2006/3/6 — 10:55 — page ii — #2 i i ©Lukas Käll, 2006 Except previously published papers which were reproduced with permission from the publisher. Paper I: ©2002 Federation of European Biochemical Societies Paper II: ©2004 Elsevier Ltd. Paper III: ©2005 Federation of European Biochemical Societies Paper IV: ©2005 Lukas Käll, Anders Krogh and Erik Sonnhammer Paper V: ©2006 ¿e Protein Society Published and printed by Larserics Digital Print, Sundbyberg ISBN 91-7140-719-7 i i i i i i “thesis” — 2006/3/6 — 10:55 — page iii — #3 i i Abstract Transmembrane proteins make up a large and important class of proteins. About 20% of all genes encode transmembrane proteins. ¿ey control both substances and information going in and out of a cell. Yet basic knowledge about membrane insertion and folding is sparse, and our ability to identify, over-express, purify, and crystallize transmembrane proteins lags far behind the eld of water-soluble proteins. It is dicult to determine the three dimensional structures of transmembrane proteins. ¿ere- fore, researchers normally attempt to determine their topology, i.e. which parts of the protein are buried in the membrane, and on what side of the membrane are the other parts located. Proteins aimed for export have an N-terminal sequence known as a signal peptide that is in- serted into the membrane and cleaved o.
    [Show full text]
  • Downloaded Were Considered to Be True Positive While Those from the from UCSC Databases on 14Th September 2011 [70,71]
    Basu et al. BMC Bioinformatics 2013, 14(Suppl 7):S14 http://www.biomedcentral.com/1471-2105/14/S7/S14 RESEARCH Open Access Examples of sequence conservation analyses capture a subset of mouse long non-coding RNAs sharing homology with fish conserved genomic elements Swaraj Basu1, Ferenc Müller2, Remo Sanges1* From Ninth Annual Meeting of the Italian Society of Bioinformatics (BITS) Catania, Sicily. 2-4 May 2012 Abstract Background: Long non-coding RNAs (lncRNA) are a major class of non-coding RNAs. They are involved in diverse intra-cellular mechanisms like molecular scaffolding, splicing and DNA methylation. Through these mechanisms they are reported to play a role in cellular differentiation and development. They show an enriched expression in the brain where they are implicated in maintaining cellular identity, homeostasis, stress responses and plasticity. Low sequence conservation and lack of functional annotations make it difficult to identify homologs of mammalian lncRNAs in other vertebrates. A computational evaluation of the lncRNAs through systematic conservation analyses of both sequences as well as their genomic architecture is required. Results: Our results show that a subset of mouse candidate lncRNAs could be distinguished from random sequences based on their alignment with zebrafish phastCons elements. Using ROC analyses we were able to define a measure to select significantly conserved lncRNAs. Indeed, starting from ~2,800 mouse lncRNAs we could predict that between 4 and 11% present conserved sequence fragments in fish genomes. Gene ontology (GO) enrichment analyses of protein coding genes, proximal to the region of conservation, in both organisms highlighted similar GO classes like regulation of transcription and central nervous system development.
    [Show full text]
  • Biological Sequence Analysis Probabilistic Models of Proteins and Nucleic Acids
    This page intentionally left blank Biological sequence analysis Probabilistic models of proteins and nucleic acids The face of biology has been changed by the emergence of modern molecular genetics. Among the most exciting advances are large-scale DNA sequencing efforts such as the Human Genome Project which are producing an immense amount of data. The need to understand the data is becoming ever more pressing. Demands for sophisticated analyses of biological sequences are driving forward the newly-created and explosively expanding research area of computational molecular biology, or bioinformatics. Many of the most powerful sequence analysis methods are now based on principles of probabilistic modelling. Examples of such methods include the use of probabilistically derived score matrices to determine the significance of sequence alignments, the use of hidden Markov models as the basis for profile searches to identify distant members of sequence families, and the inference of phylogenetic trees using maximum likelihood approaches. This book provides the first unified, up-to-date, and tutorial-level overview of sequence analysis methods, with particular emphasis on probabilistic modelling. Pairwise alignment, hidden Markov models, multiple alignment, profile searches, RNA secondary structure analysis, and phylogenetic inference are treated at length. Written by an interdisciplinary team of authors, the book is accessible to molecular biologists, computer scientists and mathematicians with no formal knowledge of each others’ fields. It presents the state-of-the-art in this important, new and rapidly developing discipline. Richard Durbin is Head of the Informatics Division at the Sanger Centre in Cambridge, England. Sean Eddy is Assistant Professor at Washington University’s School of Medicine and also one of the Principle Investigators at the Washington University Genome Sequencing Center.
    [Show full text]
  • Tporthmm : Predicting the Substrate Class Of
    TPORTHMM : PREDICTING THE SUBSTRATE CLASS OF TRANSMEMBRANE TRANSPORT PROTEINS USING PROFILE HIDDEN MARKOV MODELS Shiva Shamloo A thesis in The Department of Computer Science Presented in Partial Fulfillment of the Requirements For the Degree of Master of Computer Science Concordia University Montréal, Québec, Canada December 2020 © Shiva Shamloo, 2020 Concordia University School of Graduate Studies This is to certify that the thesis prepared By: Shiva Shamloo Entitled: TportHMM : Predicting the substrate class of transmembrane transport proteins using profile Hidden Markov Models and submitted in partial fulfillment of the requirements for the degree of Master of Computer Science complies with the regulations of this University and meets the accepted standards with respect to originality and quality. Signed by the final examining commitee: Examiner Dr. Sabine Bergler Examiner Dr. Andrew Delong Supervisor Dr. Gregory Butler Approved Dr. Lata Narayanan, Chair Department of Computer Science and Software Engineering 20 Dean Dr. Mourad Debbabi Faculty of Engineering and Computer Science Abstract TportHMM : Predicting the substrate class of transmembrane transport proteins using profile Hidden Markov Models Shiva Shamloo Transporters make up a large proportion of proteins in a cell, and play important roles in metabolism, regulation, and signal transduction by mediating movement of compounds across membranes but they are among the least characterized proteins due to their hydropho- bic surfaces and lack of conformational stability. There is a need for tools that predict the substrates which are transported at the level of substrate class and the level of specific substrate. This work develops a predictor, TportHMM, using profile Hidden Markov Model (HMM) and Multiple Sequence Alignment (MSA).
    [Show full text]
  • Genomic and Transcriptomic Surveys for the Study of Ncrnas with a Focus on Tropical Parasites
    PhD Thesis PROGRAMA DE PÓS-GRADUAÇÃO EM BIOINFORMÁTICA UNIVERSIDADE FEDERAL DE MINAS GERAIS Genomic and transcriptomic surveys for the study of ncRNAs with a focus on tropical parasites Mainá Bitar Belo Horizonte February 2015 Universidade Federal de Minas Gerais PhD Thesis PROGRAMA DE PÓS-GRADUAÇÃO EM BIOINFORMÁTICA Genomic and transcriptomic surveys for the study of ncRNAs with a focus on tropical parasites PhD candidate: Mainá Bitar Advisor: Glória Regina Franco Co-advisor: Martin Alexander Smith Mainá Bitar Lourenço Genomic and transcriptomic surveys for the study of ncRNAs with a focus on tropical parasites Versão final Tese apresentada ao Programa Interunidades de Pós-Graduação em Bioinformática do Instituto de Ciências Biológicas da Universidade Federal de Minas Gerais como requisito parcial para a obtenção do título de Doutor em Bioinformática. Orientador: Profa. Dra. Glória Regina Franco BELO HORIZONTE 2015 043 Bitar, Mainá. Genomic and transcriptomic surveys for the study of ncRNAs with a focus on tropical parasites [manuscrito] / Mainá Bitar. – 2015. 134 f. : il. ; 29,5 cm. Orientador: Glória Regina Franco. Coorientador: Martin Alexander Smith. Tese (doutorado) – Universidade Federal de Minas Gerais, Instituto de Ciências Biológicas. Programa de Pós-Graduação em Bioinformática. 1. Bioinformática - Teses. 2. Trypanosoma cruzi. 3. Schistosoma mansoni. 4. Genômica. 5. Transcriptoma. 6. Trans-Splicing. I. Franco, Glória Regina. II. Smith, Martin Alexander. III. Universidade Federal de Minas Gerais. Instituto de Ciências Biológicas. IV. Título. CDU: 573:004 Ficha catalográfica elaborada por Fabiane C. M. Reis – CRB 6/2680 Esta tese é dedicada à minha mãe, que me deu a liberdade para sonhar e a força para viver a realidade.
    [Show full text]
  • On the Necessity of Dissecting Sequence Similarity Scores Into
    Wong et al. BMC Bioinformatics 2014, 15:166 http://www.biomedcentral.com/1471-2105/15/166 METHODOLOGY ARTICLE Open Access On the necessity of dissecting sequence similarity scores into segment-specific contributions for inferring protein homology, function prediction and annotation Wing-Cheong Wong1*, Sebastian Maurer-Stroh1,2, Birgit Eisenhaber1 and Frank Eisenhaber1,3,4* Abstract Background: Protein sequence similarities to any types of non-globular segments (coiled coils, low complexity regions, transmembrane regions, long loops, etc. where either positional sequence conservation is the result of a very simple, physically induced pattern or rather integral sequence properties are critical) are pertinent sources for mistaken homologies. Regretfully, these considerations regularly escape attention in large-scale annotation studies since, often, there is no substitute to manual handling of these cases. Quantitative criteria are required to suppress events of function annotation transfer as a result of false homology assignments. Results: The sequence homology concept is based on the similarity comparison between the structural elements, the basic building blocks for conferring the overall fold of a protein. We propose to dissect the total similarity score into fold-critical and other, remaining contributions and suggest that, for a valid homology statement, the fold-relevant score contribution should at least be significant on its own. As part of the article, we provide the DissectHMMER software program for dissecting HMMER2/3 scores into segment-specific contributions. We show that DissectHMMER reproduces HMMER2/3 scores with sufficient accuracy and that it is useful in automated decisions about homology for instructive sequence examples. To generalize the dissection concept for cases without 3D structural information, we find that a dissection based on alignment quality is an appropriate surrogate.
    [Show full text]
  • Download PDF of This Story
    B NY RA DY BARRETT ILLUSTRATION BY MIKE PERRY TE H NEW JANELIA COMPUTING CLUSTER PUTS A PREMIUM ON EXPANDABILITY AND SPEED. ple—it’s pretty obvious to anyone which words are basically the same. That would be like two genes from humans and apes.” But in organisms that are more diver- gent, Eddy needs to understand how DNA sequences tend to change over time. “And it becomes a difficult specialty, with seri- ous statistical analysis,” he says. From a computational standpoint, that means churning through a lot of opera- tions. Comparing two typical-sized protein sequences, to take a simple example, would require a whopping 10200 opera- Computational biologists have a need for to help investigators conduct genome tions. Classic algorithms, available since speed. The computing cluster at HHMI’s searches and catalog the inner workings the 1960s, can trim that search to 160,000 Janelia Farm Research Campus delivers and structures of the brain. computations—a task that would take the performance they require—at a mind- only a millisecond or so on any modern boggling 36 trillion operations per second. F ASTER Answers processor. But in the genome business, In the course of their work, Janelia A group leader at Janelia Farm, Eddy deals people routinely do enormous numbers researchers generate millions of digitized in the realm of millions of computations of these sequence comparisons—trillions images and gigabytes of data files, and they daily as he compares sequences of DNA. and trillions of them. These “routine” cal- run algorithms daily that demand robust He is a rare breed, both biologist and code culations could take years if they had to be computational horsepower.
    [Show full text]
  • A Jumping Profile HMM for Remote Protein Homology Detection
    A Jumping Profile HMM for Remote Protein Homology Detection Anne-Kathrin Schultz and Mario Stanke Institut fur¨ Mikrobiologie und Genetik, Abteilung Bioinformatik, Universit¨at G¨ottingen, Germany contact: faschult2, [email protected] Abstract Our Generalization: Jumping Profile HMM We address the problem of finding new members of a given protein family in a database of protein sequences. We are given a MSA of k rows and a candidate sequence. At each position the candidate sequence is either Given a multiple sequence alignment (MSA) of the sequences in the protein family, we would like to score each aligned to the whole column of the MSA or to a certain reference sequence: We say that we are in the column candidate sequence in the database with respect to how likely it is that it belongs to the family. Successful mode or in a row mode of the HMM. methods for this task are profile Hidden Markov Models (HMM), like HMMER [Eddy, 1998] and SAM [Hughey • Column mode: (red part of Figure 1) and Krogh, 1996], and a so-called jumping alignment (JALI) [Spang et al., 2002]. As in a profile HMM each consensus column of the MSA is modeled by three states: match (M), insert (I) and delete (D).Match states model the distribution of residues in this column, they emit the amino acids We developed a Hidden Markov Model which can be regarded as a generalization of these two methods: At each with a probability which depends on all residues in this column. position the candidate sequence is either aligned to the whole column of the MSA or to a certain reference sequence.
    [Show full text]
  • Dear Delegates,History of Productive Scientific Discussions of New Challenging Ideas and Participants Contributing from a Wide Range of Interdisciplinary fields
    3rd IS CB S t u d ent Co u ncil S ymp os ium Welcome To The 3rd ISCB Student Council Symposium! Welcome to the Student Council Symposium 3 (SCS3) in Vienna. The ISCB Student Council's mis- sion is to develop the next generation of computa- tional biologists. We would like to thank and ac- knowledge our sponsors and the ISCB organisers for their crucial support. The SCS3 provides an ex- citing environment for active scientific discussions and the opportunity to learn vital soft skills for a successful scientific career. In addition, the SCS3 is the biggest international event targeted to students in the field of Computational Biology. We would like to thank our hosts and participants for making this event educative and fun at the same time. Student Council meetings have had a rich Dear Delegates,history of productive scientific discussions of new challenging ideas and participants contributing from a wide range of interdisciplinary fields. Such meet- We are very happy to welcomeings have you proved all touseful the in ISCBproviding Student students Council and postdocs Symposium innovative inputsin Vienna. and an Afterincreased the network suc- cessful symposiums at ECCBof potential 2005 collaborators. in Madrid and at ISMB 2006 in Fortaleza we are determined to con- tinue our efforts to provide an event for students and young researchers in the Computational Biology community. Like in previousWe ar yearse extremely our excitedintention to have is toyou crhereatee and an the opportunity vibrant city of Vforienna students welcomes to you meet to our their SCS3 event. peers from all over the world for exchange of ideas and networking.
    [Show full text]
  • I S C B N E W S L E T T
    ISCB NEWSLETTER FOCUS ISSUE {contents} President’s Letter 2 Member Involvement Encouraged Register for ISMB 2002 3 Registration and Tutorial Update Host ISMB 2004 or 2005 3 David Baker 4 2002 Overton Prize Recipient Overton Endowment 4 ISMB 2002 Committees 4 ISMB 2002 Opportunities 5 Sponsor and Exhibitor Benefits Best Paper Award by SGI 5 ISMB 2002 SIGs 6 New Program for 2002 ISMB Goes Down Under 7 Planning Underway for 2003 Hot Jobs! Top Companies! 8 ISMB 2002 Job Fair ISCB Board Nominations 8 Bioinformatics Pioneers 9 ISMB 2002 Keynote Speakers Invited Editorial 10 Anna Tramontano: Bioinformatics in Europe Software Recommendations11 ISCB Software Statement volume 5. issue 2. summer 2002 Community Development 12 ISCB’s Regional Affiliates Program ISCB Staff Introduction 12 Fellowship Recipients 13 Awardees at RECOMB 2002 Events and Opportunities 14 Bioinformatics events world wide INTERNATIONAL SOCIETY FOR COMPUTATIONAL BIOLOGY A NOTE FROM ISCB PRESIDENT This newsletter is packed with information on development and dissemination of bioinfor- the ISMB2002 conference. With over 200 matics. Issues arise from recommendations paper submissions and over 500 poster submis- made by the Society’s committees, Board of sions, the conference promises to be a scientific Directors, and membership at large. Important feast. On behalf of the ISCB’s Directors, staff, issues are defined as motions and are discussed EXECUTIVE COMMITTEE and membership, I would like to thank the by the Board of Directors on a bi-monthly Philip E. Bourne, Ph.D., President organizing committee, local organizing com- teleconference. Motions that pass are enacted Michael Gribskov, Ph.D., mittee, and program committee for their hard by the Executive Committee which also serves Vice President work preparing for the conference.
    [Show full text]