Lecture 10: Phylogeny 25,27/12/12 Phylogeny

Total Page:16

File Type:pdf, Size:1020Kb

Lecture 10: Phylogeny 25,27/12/12 Phylogeny גנומיקה חישובית Computational Genomics פרופ' רון שמיר ופרופ' רודד שרן Prof. Ron Shamir & Prof. Roded Sharan ביה"ס למדעי המחשב אוניברסיטת תל אביב School of Computer Science, Tel Aviv University , Lecture 10: Phylogeny 25,27/12/12 Phylogeny Slides: • Adi Akavia • Nir Friedman’s slides at HUJI (based on ALGMB 98) •Anders Gorm Pedersen,Technical University of Denmark Sources: Joe Felsenstein “Inferring Phylogenies” (2004) 1 CG © Ron Shamir Phylogeny • Phylogeny: the ancestral relationship of a set of species. • Represented by a phylogenetic tree branch ? leaf ? Internal node ? ? ? Leaves - contemporary Internal nodes - ancestral 2 CG Branch© Ron Shamir length - distance between sequences 3 CG © Ron Shamir 4 CG © Ron Shamir 5 CG © Ron Shamir 6 CG © Ron Shamir “classical” Phylogeny schools Classical vs. Modern: • Classical - morphological characters • Modern - molecular sequences. 7 CG © Ron Shamir Trees and Models • rooted / unrooted “molecular • topology / distance clock” • binary / general 8 CG © Ron Shamir To root or not to root? • Unrooted tree: phylogeny without direction. 9 CG © Ron Shamir Rooting an Unrooted Tree • We can estimate the position of the root by introducing an outgroup: – a species that is definitely most distant from all the species of interest Proposed root Falcon Aardvark Bison Chimp Dog Elephant 10 CG © Ron Shamir A Scally et al. Nature 483, 169-175 (2012) doi:10.1038/nature10842 HOW DO WE FIGURE OUT THESE TREES? TIMES? 12 CG © Ron Shamir Dangers of Paralogs • Right species distance: (1,(2,3)) Sequence Homology Caused By: Gene Duplication •Orthologs - speciation, •Paralogs - duplication •Xenologs - horizontal (e.g., by virus) Speciation events 1A 2A 3A 3B 2B 1B 13 CG © Ron Shamir Dangers of Paralogs • Right species distance: (1,(2,3)) • If we only consider 1A, 2B, and 3A: ((1,3),2) Gene Duplication Speciation events 1A 2A 3A 3B 2B 1B 14 CG © Ron Shamir Type of Data • Distance-based – Input: matrix of distances between species – Distance can be • fraction of residue they disagree on, • alignment score between them, • … • Character-based – Examine each character (e.g., residue) separately 15 CG © Ron Shamir Distance Based Methods 16 CG © Ron Shamir Tree based distances • d(i,j)=sum of arc lengths on the path ij • Given d, can we find – an exactly matching tree? – An approximately matching tree? 17 CG © Ron Shamir The Problem Input: matrix d of distances between species Goal: Find a tree with leaves=chars and edge distances, matching d best. Quality measure: sum of squares: 2 SSQ(T ) = ∑∑ wij (dij − tij ) i j≠i tij: distance in the tree 2 wij: pair weighting. Options: (1) ≡1 (2) 1/dij (3) 1/dij 18 CGNP © Ron- hardShamir (Day ’86). We’ll describe common heuristics UPGMA Clustering (Sokal & Michener 1958) (Unweighted pair-group method with arithmetic mean) • Approach: Form a tree; closer species according to input distances should be closer in the tree • Build the tree bottom up, each time merging two smaller trees • All leaves are at same distance from the root 19 CG © Ron Shamir Efficiency lemma • Approach: gradually form clusters: sets of species • Repeatedly identify two clusters and merge them. • For clusters Ci Cj , define the distance between them to be the average dist betw their members: 1 d (Ci ,C j ) = ∑ ∑d (p,q) |Ci ||C j | p∈Ci q∈C j • Lemma: If Ck is formed by merging Ci and Cj then for every other cluster Cl d(Ck,Cl) = (|Ci|*d(Ci,Cl) + |Cj|*d(Cj,Cl)) / (|Ci| + |Cj|) Can update distances between clusters in time prop. to the number of clusters. 20 CG © Ron Shamir UPGMA algorithm Initialize: each node is a cluster Ci={i}. d(Ci,Cj)=d(i,j) set height(i) = 0 ∀i Iterate: • Find Ci,Cj with smallest d(Ci,Cj) • Introduce a new cluster node Ck that replaces Ci and Cj //Ck represents all the leaves in clusters Ci and Cj • Introduce a new tree node Aij with height(Aij) = d(Ci,Cj)/2 // d(Ci,Cj) is the average dist among leaves of Ci and Cj • Connect the corresponding tree nodes Ci,Cj to Aij with length(Ci, Aij) = height(Aij) - height(Ci) length(Cj,Aij) = height(Aij) - height(Cj) • For all other Cl: d(Ck,Cl) = (|Ci|*d(Ci,Cl) + |Cj|*d(Cj,Cl)) / (|Ci| + |Cj|) //dist to any old cluster is the ave dist between its leaves and leaves in Ci, Cj Time: Naïve: O(n3); Can show O(n2 logn) (ex.) and O(n2) (harder ex.) 21 CG © Ron Shamir UPGMA alg (2) • Orange nodes represent the groups of nodes that they replaced, and maintain the average dist of the set from other leaf nodes/clusters A12345 h123 A123 h12 A12 h12 A45 1 2 3 4 5 C 3 C123 C 12 C45 45 1 22 22 CG © Ron Shamir,Shamir ‘09 UPGMA Algorithm 4 3 5 2 1 1 2 3 5 4 CG © Ron Shamir 23 UPGMA Algorithm 4 3 5 2 1 1 2 3 5 4 CG © Ron Shamir 24 UPGMA Algorithm 4 3 5 2 1 1 2 3 5 4 CG © Ron Shamir 25 UPGMA Algorithm 4 3 5 2 1 1 2 3 5 4 CG © Ron Shamir 26 UPGMA Algorithm 4 3 5 2 1 1 2 3 5 4 CG © Ron Shamir 27 Molecular Clock • UPGMA implicitly assumes the tree is ultrametric: equal leaf-root distances => common uniform clock 2 3 2 3 4 1 1 4 • Works reasonably well for nearby species 28 CG © Ron Shamir Additivity • A weaker additivity assumption: distances between species are the sum of distances between intermediate nodes (even if the tree is not ultrametric) k = + c d (i , j ) a b b d (i ,k ) = a + c j a d (j,k ) = b + c i 29 CG © Ron Shamir Consequences of Additivity • Suppose input distances are additive • For any three leaves k d (i , j ) = a + b c b d (i ,k ) = a + c j a m d (j,k ) = b + c i • Thus 1 d (m,k ) = (d (i ,k ) + d (j,k ) − d (i , j )) 30 CG © Ron Shamir 2 Consequences of Additivity II • Suppose input distances are additive • For any four leaves • Induced tree wt: W k j d(i, j) + d(k,l) = W + b c d d(i,l) + d( j,k) = W + b b a d(i,k) + d( j,l) = W − b e l i • Largest sum must appear twice 31 CG © Ron Shamir Consequences of Additivity III • If we can identify neighbor leaves, then can use pairwise distances to reconstruct the tree: • Remove neighbors i, j from the leaf set • Add k i m • Set dkm= (dim + djm –dij)/2 k But can we find neighbor leaves? b 32 CG © Ron Shamir j Closest pairs may not be neighbors! • Closest pair: kj even though they are not neighbors k j 1 1 1 4 4 i l 33 CG © Ron Shamir Neighbor Joining (Saitou-Nei ’87) • Let D(i , j ) = d (i , j ) − (ri + rj ) where 1 = “Corrected” average ri ∑ d(i,k) distance of i from all | L | −2 k other nodes Theorem: if D(i,j) is minimum among all pairs of leaves, then i and j are neighbors in the tree 34 CG © Ron Shamir 1 r = d(i,k) i − ∑ Neighbor Joining | L | 2 k D(i , j ) = d (i , j ) − (ri + rj ) • Set L to contain all leaves Iteration: • Choose i,j such that D(i,j) is minimum • Create new node k, and set i m d(i,k) = (d(i, j) + r − r ) / 2 i j k = + − d( j,k) (d(i, j) rj ri ) / 2 d(k,m) = (d(i,m) + d( j,m) − d(i, j)) / 2 j • remove i,j from L, and add k Time: O(n3) • Update r, D • Termination: when |L| =2 connect the two nodes Thm: Opt tree guaranteed if distances match a tree 35 Does CG © Ron notShamir assume a clock Neighbor Joining Algorithm A B C D A - 17 21 27 B - 12 18 C - 14 D - CENTER FOR BIOLOGICALSEQUENCE ANALYSIS Neighbor Joining Algorithm A B C D i ui A - 17 21 27 A ( 17+21+27)/2=32. 5 B - 12 18 B ( 17+12+18)/2=23. 5 C - 14 C ( 21+12+14)/2=23. 5 D - D ( 27+18+14)/2=29. 5 CENTER FOR BIOLOGICALSEQUENCE ANALYSIS Neighbor Joining Algorithm A B C D i ui A - 17 21 27 A ( 17+21+27)/2=32. 5 B - 12 18 B ( 17+12+18)/2=23. 5 C - 14 C ( 21+12+14)/2=23. 5 D - D ( 27+18+14)/2=29. 5 A B C D A - - 39 - 35 - 35 CENTER FOR BIOLOGICALSEQUENCE ANALYSIS B - - 35 - 35 C - - 39 D - Dij- ui - uj Neighbor Joining Algorithm A B C D i ui A - 17 21 27 A ( 17+21+27)/2=32. 5 B - 12 18 B ( 17+12+18)/2=23. 5 C - 14 C ( 21+12+14)/2=23. 5 D - D ( 27+18+14)/2=29. 5 A B C D A - - 39 - 35 - 35 CENTER FOR BIOLOGICALSEQUENCE ANALYSIS B - - 35 - 35 C - - 39 D - Dij- ui - uj Neighbor Joining Algorithm A B C D i ui A - 17 21 27 A ( 17+21+27)/2=32. 5 B - 12 18 B ( 17+12+18)/2=23. 5 C - 14 C ( 21+12+14)/2=23. 5 D - D ( 27+18+14)/2=29. 5 A B C D A - - 39 - 35 - 35 C D CENTER FOR BIOLOGICALSEQUENCE ANALYSIS B - - 35 - 35 X C - - 39 D - Dij- ui - uj Neighbor Joining Algorithm A B C D i ui A - 17 21 27 A ( 17+21+27)/2=32.
Recommended publications
  • Bonnie Berger Named ISCB 2019 ISCB Accomplishments by a Senior
    F1000Research 2019, 8(ISCB Comm J):721 Last updated: 09 APR 2020 EDITORIAL Bonnie Berger named ISCB 2019 ISCB Accomplishments by a Senior Scientist Award recipient [version 1; peer review: not peer reviewed] Diane Kovats 1, Ron Shamir1,2, Christiana Fogg3 1International Society for Computational Biology, Leesburg, VA, USA 2Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel 3Freelance Writer, Kensington, USA First published: 23 May 2019, 8(ISCB Comm J):721 ( Not Peer Reviewed v1 https://doi.org/10.12688/f1000research.19219.1) Latest published: 23 May 2019, 8(ISCB Comm J):721 ( This article is an Editorial and has not been subject https://doi.org/10.12688/f1000research.19219.1) to external peer review. Abstract Any comments on the article can be found at the The International Society for Computational Biology (ISCB) honors a leader in the fields of computational biology and bioinformatics each year with the end of the article. Accomplishments by a Senior Scientist Award. This award is the highest honor conferred by ISCB to a scientist who is recognized for significant research, education, and service contributions. Bonnie Berger, Simons Professor of Mathematics and Professor of Electrical Engineering and Computer Science at the Massachusetts Institute of Technology (MIT) is the 2019 recipient of the Accomplishments by a Senior Scientist Award. She is receiving her award and presenting a keynote address at the 2019 Joint International Conference on Intelligent Systems for Molecular Biology/European Conference on Computational Biology in Basel, Switzerland on July 21-25, 2019. Keywords ISCB, Bonnie Berger, Award This article is included in the International Society for Computational Biology Community Journal gateway.
    [Show full text]
  • DREAM: a Dialogue on Reverse Engineering Assessment And
    DREAM:DREAM: aa DialogueDialogue onon ReverseReverse EngineeringEngineering AssessmentAssessment andand MethodsMethods Andrea Califano: MAGNet: Center for the Multiscale Analysis of Genetic and Cellular Networks C2B2: Center for Computational Biology and Bioinformatics ICRC: Irving Cancer RResearchesearch Center Columbia University 1 ReverseReverse EngineeringEngineering • Inference of a predictive (generative) model from data. E.g. argmax[P(Data|Model)] • Assumptions: – Model variables (E.g., DNA, mRNA, Proteins, cellular sub- structures) – Model variable space: At equilibrium, temporal dynamics, spatio- temporal dynamics, etc. – Model variable interactions: probabilistics (linear, non-linear), explicit kinetics, etc. – Model topology: known a-priori, inferred. • Question: – Model ~= Reality? ReverseReverse EngineeringEngineering Data Biological System Expression Proteomics > NFAT ATGATGGATG CTCGCATGAT CGACGATCAG GTGTAGCCTG High-throughput GGCTGGA Structure Sequence Biology … Biochemical Model Validation Control X-Y- Control X+Y+ Y X Z X+Y- X-Y+ Control Control Specific Prediction SomeSome ReverseReverse EngineeringEngineering MethodsMethods • Optimization: High-Dimensional objective function max corresponds to best topology – Liang S, Fuhrman S, Somogyi (REVEAL) – Gat-Viks and R. Shamir (Chain Functions) – Segal E, Shapira M, Regev A, Pe’er D, Botstein D, KolKollerler D, and Friedman N (Prob. Graphical Models) – Jing Yu, V. Anne Smith, Paul P. Wang, Alexander J. Hartemink, Erich D. Jarvis (Dynamic Bayesian Networks) – … • Regression: Create a general model of biochemical interactions and fit the parameters – Gardner TS, di Bernardo D, Lorentz D, and Collins JJ (NIR) – Alberto de la Fuente, Paul Brazhnik, Pedro Mendes – Roven C and Bussemaker H (REDUCE) – … • Probabilistic and Information Theoretic: Compute probability of interaction and filter with statistical criteria – Atul Butte et al. (Relevance Networks) – Gustavo Stolovitzky et al. (Co-Expression Networks) – Andrea CaCalifanolifano et al.
    [Show full text]
  • Research Report 2006 Max Planck Institute for Molecular Genetics, Berlin Imprint | Research Report 2006
    MAX PLANCK INSTITUTE FOR MOLECULAR GENETICS Research Report 2006 Max Planck Institute for Molecular Genetics, Berlin Imprint | Research Report 2006 Published by the Max Planck Institute for Molecular Genetics (MPIMG), Berlin, Germany, August 2006 Editorial Board Bernhard Herrmann, Hans Lehrach, H.-Hilger Ropers, Martin Vingron Coordination Claudia Falter, Ingrid Stark Design & Production UNICOM Werbeagentur GmbH, Berlin Number of copies: 1,500 Photos Katrin Ullrich, MPIMG; David Ausserhofer Contact Max Planck Institute for Molecular Genetics Ihnestr. 63–73 14195 Berlin, Germany Phone: +49 (0)30-8413 - 0 Fax: +49 (0)30-8413 - 1207 Email: [email protected] For further information about the MPIMG please see our website: www.molgen.mpg.de MPI for Molecular Genetics Research Report 2006 Table of Contents The Max Planck Institute for Molecular Genetics . 4 • Organisational Structure. 4 • MPIMG – Mission, Development of the Institute, Research Concept. .5 Department of Developmental Genetics (Bernhard Herrmann) . 7 • Transmission ratio distortion (Hermann Bauer) . .11 • Signal Transduction in Embryogenesis and Tumor Progression (Markus Morkel). 14 • Development of Endodermal Organs (Heiner Schrewe) . 16 • Gene Expression and 3D-Reconstruction (Ralf Spörle). 18 • Somitogenesis (Lars Wittler). 21 Department of Vertebrate Genomics (Hans Lehrach) . 25 • Molecular Embryology and Aging (James Adjaye). .31 • Protein Expression and Protein Structure (Konrad Büssow). .34 • Mass Spectrometry (Johan Gobom). 37 • Bioinformatics (Ralf Herwig). .40 • Comparative and Functional Genomics (Heinz Himmelbauer). 44 • Genetic Variation (Margret Hoehe). 48 • Cell Arrays/Oligofingerprinting (Michal Janitz). .52 • Kinetic Modeling (Edda Klipp) . .56 • In Vitro Ligand Screening (Zoltán Konthur). .60 • Neurodegenerative Disorders (Sylvia Krobitsch). .64 • Protein Complexes & Cell Organelle Assembly/ USN (Bodo Lange/Thorsten Mielke). .67 • Automation & Technology Development (Hans Lehrach).
    [Show full text]
  • Machine Learning and Statistical Methods for Clustering Single-Cell RNA-Sequencing Data Raphael Petegrosso 1, Zhuliu Li 1 and Rui Kuang 1,∗
    i i “main” — 2019/5/3 — 12:56 — page 1 — #1 i i Briefings in Bioinformatics doi.10.1093/bioinformatics/xxxxxx Advance Access Publication Date: Day Month Year Manuscript Category Subject Section Machine Learning and Statistical Methods for Clustering Single-cell RNA-sequencing Data Raphael Petegrosso 1, Zhuliu Li 1 and Rui Kuang 1,∗ 1Department of Computer Science and Engineering, University of Minnesota Twin Cities, Minneapolis, MN, USA ∗To whom correspondence should be addressed. Associate Editor: XXXXXXX Received on XXXXX; revised on XXXXX; accepted on XXXXX Abstract Single-cell RNA-sequencing (scRNA-seq) technologies have enabled the large-scale whole-transcriptome profiling of each individual single cell in a cell population. A core analysis of the scRNA-seq transcriptome profiles is to cluster the single cells to reveal cell subtypes and infer cell lineages based on the relations among the cells. This article reviews the machine learning and statistical methods for clustering scRNA- seq transcriptomes developed in the past few years. The review focuses on how conventional clustering techniques such as hierarchical clustering, graph-based clustering, mixture models, k-means, ensemble learning, neural networks and density-based clustering are modified or customized to tackle the unique challenges in scRNA-seq data analysis, such as the dropout of low-expression genes, low and uneven read coverage of transcripts, highly variable total mRNAs from single cells, and ambiguous cell markers in the presence of technical biases and irrelevant confounding biological variations. We review how cell-specific normalization, the imputation of dropouts and dimension reduction methods can be applied with new statistical or optimization strategies to improve the clustering of single cells.
    [Show full text]
  • Annual Meeting 2016
    The Genetics Society of Israel Annual Meeting 2016 The Hebrew University of Jerusalem Edmond J. Safra Campus Givat Ram January 25, 2016 FRONTIERS IN GENETICS X PROGRAM MEETING ORGANIZER Target Conferences Ltd. P.O. Box 51227 Tel Aviv 6713818, Israel Tel: +972 3 5175150, Fax: +972 3 5175155 E-mail: [email protected] 79 Honorary Membership in the Genetic Society of Israel is awarded to Prof. Eliezer Lifschitz from the Technion – Israel Institute of Technology, Israel We are indebted to you for your contributions to Genetics in Israel תואר חבר כבוד מוענק בשם החברה לגנטיקה בישראל לפרופ' אליעזר ליפשיץ הטכניון מכון טכנולוגי לישראל אנו מוקירים את תרומתך למחקר בתחום הגנטיקה בישראל 2 TABLE OF CONTENTS Page Board Members .......................................................................................................... 4 Acknowledgements .................................................................................................... 4 General Information .................................................................................................. 5 Exhibitors ................................................................................................................... 6 Scientific Program ....................................................................................................... 8 Invited Speakers ....................................................................................................... 10 Poster Presentations................................................................................................
    [Show full text]
  • Personal Data Education Current Research Interests Academic
    Vitae of Ron Shamir 1 CURRICULUM VITAE Ron Shamir July 2011 Personal Data Address: Blavatnik School of Computer Science Tel Aviv University Tel Aviv, 69978 ISRAEL Telephones: Office: (03)640-5383, Assistant: (03)640-5391 Home: (08)946-6864 Facsimile: (03)640-5384 E-mail: [email protected] Web Page: http//www.cs.tau.ac.il/∼rshamir Group Web Page: http//actg.cs.tau.ac.il/ Born: Nov. 29, 1953, Jerusalem, Israel Married: Michal Oren-Shamir, September 8 1982 Children: Alon Shamir February 12 1985 Ittai Shamir April 25 1988 Yoav Shamir December 29 1993 Education University of California, Berkeley, USA (1981-1984) Ph.D., Operations Research. Advisors: Richard M. Karp and Ilan Adler Tel Aviv University (1978-1981) M.Sc. Operations Research. Advisor: Uri Yechiali (completed at U. C. Berkeley) The Hebrew University, Jerusalem (1975-1977) B.Sc. Mathematics and Physics Tel Aviv University (1973-1975) B.Sc. Mathematics and Physics (completed at the Hebrew University) Current Research Interests Bioinformatics / Computational molecular biology. Computational genomics, gene regulation, systems biology, genome rearrangements, human disease, in- tegrative analysis of heterogeneous high throughput biological data, Bioinformatics education. Design and analysis of algorithms. Algorithmic graph theory. Academic appointments Professor (2000-today) Tel Aviv University, School of Computer Science Associate professor (1995-2000) Tel Aviv University, School of Computer Science Senior Lecturer (1990-1995) Tel Aviv University, School of Mathematics, Department of Computer Science Lecturer; Alon fellow (1987-1989) Tel Aviv University, School of Mathematics, Department of Computer Science Vitae of Ron Shamir 2 Visiting appointments Visiting professor (March 2009) University of California, San Diego, Department of Computer Science and Engineering.
    [Show full text]
  • Igor Ulitsky – CV September 2016
    Igor Ulitsky – CV September 2016 Date of birth: Aug 26, 1980 (St. Petersburg, Russia) Marital Status: Married+4 Citizenship: Israeli Mailing address: Department of Biological Regulation, Weizmann Institute of Science, Rehovot, Israel 76100 Email: [email protected] Homepage: http://www.weizmann.ac.il/~igoru Education / Positions held 8/2013- Senior Scientist, Department of Biological Regulation, Weizmann Institute of Science, Rehovot, Israel 9/2009-8/2013 Postdoctoral Fellow at the Bartel Lab, Whitehead Institute for Biomedical Research, Cambridge MA 7/2004-7/2009 Ph.D. (direct track) in Computer Science, Tel Aviv University, (awarded 5/2010) Advisor: Prof. Ron Shamir. Thesis: Algorithmic methods for integrating heterogeneous biological data for disease modeling. 2001 – 2004 B.Sc. in Computer Science (Summa Cum Laude) and Life Sciences (Summa Cum Laude), Tel Aviv University, combined program with an emphasis on Bioinformatics (double major). Fellowships and Awards 2015-2020 ERC Starting Grant award 2014-2017 Alon fellowship (“Milgat Alon”), Israel Council for Higher Education 2010-2011 EMBO long term postdoctoral fellowship 2009 Legacy Heritage Fund stem cells research fellowship 2008 Wolf prize for outstanding PhD students (a nation-wide prize) 2005-2009 Fellow, Edmond J. Safra Program in bioinformatics, Tel Aviv University 2004-2005 President and Rector's M.Sc fellowship 2005 Special excellence award, Knesset education committee and university directors committee (a nation-wide prize) 2004 B.Sc. summa cum laude Teaching Experience International schools 4th international course on the noncoding genome @Institute Curie, Paris, February 2014 EMBO practical course “‘Non-coding RNAs: From discovery to function”, Brno, July 2015 @Weizmann RNA World (w/ Prof.
    [Show full text]
  • Front Matter
    Cambridge University Press 978-1-107-01146-5 - Bioinformatics for Biologists Edited by Pavel Pevzner and Ron Shamir Frontmatter More information BIOINFORMATICS FOR BIOLOGISTS The computational education of biologists is changing to prepare students for facing the complex data sets of today’s life science research. In this concise textbook, the authors’ fresh pedagogical approaches lead biology students from first principles towards computational thinking. A team of renowned bioinformaticians take innovative routes to introduce computational ideas in the context of real biological problems. Intuitive explanations promote deep understanding, using little mathematical formalism. Self-contained chapters show how computational procedures are developed and applied to central topics in bioinformatics and genomics, such as the genetic basis of disease, genome evolution, or the tree of life concept. Using bioinformatic resources requires a basic understanding of what bioinformatics is and what it can do. Rather than just presenting tools, the authors – each a leading scientist – engage the students’ problem-solving skills, preparing them to meet the computational challenges of their life science careers. PAVEL PEVZNER is Ronald R. Taylor Professor of Computer Science and Director of the Bioinformatics and Systems Biology Program at the University of California, San Diego. He was named a Howard Hughes Medical Institute Professor in 2006. RON SHAMIR is Raymond and Beverly Sackler Professor of Bioinformatics and head of the Edmond J. Safra Bioinformatics
    [Show full text]
  • Biclustering Usinig Message Passing
    Biclustering Using Message Passing Luke O’Connor Soheil Feizi Bioinformatics and Integrative Genomics Electrical Engineering and Computer Science Harvard University Massachusetts Institute of Technology Cambridge, MA 02138 Cambridge, MA 02139 [email protected] [email protected] Abstract Biclustering is the analog of clustering on a bipartite graph. Existent methods infer biclusters through local search strategies that find one cluster at a time; a common technique is to update the row memberships based on the current column member- ships, and vice versa. We propose a biclustering algorithm that maximizes a global objective function using message passing. Our objective function closely approx- imates a general likelihood function, separating a cluster size penalty term into row- and column-count penalties. Because we use a global optimization frame- work, our approach excels at resolving the overlaps between biclusters, which are important features of biclusters in practice. Moreover, Expectation-Maximization can be used to learn the model parameters if they are unknown. In simulations, we find that our method outperforms two of the best existing biclustering algorithms, ISA and LAS, when the planted clusters overlap. Applied to three gene expres- sion datasets, our method finds coregulated gene clusters that have high quality in terms of cluster size and density. 1 Introduction The term biclustering has been used to describe several distinct problems variants. In this paper, In this paper, we consider the problem of biclustering as a bipartite analogue of clustering: Given an N × M matrix, a bicluster is a subset of rows that are heavily connected to a subset of columns. In this framework, biclustering methods are data mining techniques allowing simultaneous clustering of the rows and columns of a matrix.
    [Show full text]
  • Biclustering Algorithms for Biological Data Analysis: a Survey* Sara C
    INESC-ID TECHNICAL REPORT 1/2004, JANUARY 2004 1 Biclustering Algorithms for Biological Data Analysis: A Survey* Sara C. Madeira and Arlindo L. Oliveira Abstract A large number of clustering approaches have been proposed for the analysis of gene expression data obtained from microarray experiments. However, the results of the application of standard clustering methods to genes are limited. These limited results are imposed by the existence of a number of experimental conditions where the activity of genes is uncorrelated. A similar limitation exists when clustering of conditions is performed. For this reason, a number of algorithms that perform simultaneous clustering on the row and column dimensions of the gene expression matrix has been proposed to date. This simultaneous clustering, usually designated by biclustering, seeks to find sub-matrices, that is subgroups of genes and subgroups of columns, where the genes exhibit highly correlated activities for every condition. This type of algorithms has also been proposed and used in other fields, such as information retrieval and data mining. In this comprehensive survey, we analyze a large number of existing approaches to biclustering, and classify them in accordance with the type of biclusters they can find, the patterns of biclusters that are discovered, the methods used to perform the search and the target applications. Index Terms Biclustering, simultaneous clustering, co-clustering, two-way clustering, subspace clustering, bi-dimensional clustering, microarray data analysis, biological data analysis I. INTRODUCTION NA chips and other techniques measure the expression level of a large number of genes, perhaps all D genes of an organism, within a number of different experimental samples (conditions).
    [Show full text]
  • Reconstructing Cancer Karyotypes from Short Read Data: the Half Empty and Half Full Glass Rami Eitan and Ron Shamir*
    Eitan and Shamir BMC Bioinformatics (2017) 18:488 DOI 10.1186/s12859-017-1929-9 METHODOLOGY ARTICLE Open Access Reconstructing cancer karyotypes from short read data: the half empty and half full glass Rami Eitan and Ron Shamir* Abstract Background: During cancer progression genomes undergo point mutations as well as larger segmental changes. The latter include, among others, segmental deletions duplications, translocations and inversions.The result is a highly complex, patient-specific cancer karyotype. Using high-throughput technologies of deep sequencing and microarrays it is possible to interrogate a cancer genome and produce chromosomal copy number profiles and a list of breakpoints (“jumps”) relative to the normal genome. This information is very detailed but local, and does not give the overall picture of the cancer genome. One of the basic challenges in cancer genome research is to use such information to infer the cancer karyotype. We present here an algorithmic approach, based on graph theory and integer linear programming, that receives segmental copy number and breakpoint data as input and produces a cancer karyotype that is most concordant with them. We used simulations to evaluate the utility of our approach, and applied it to real data. Results: By using a simulation model, we were able to estimate the correctness and robustness of the algorithm in a spectrum of scenarios. Under our base scenario, designed according to observations in real data, the algorithm correctly inferred 69% of the karyotypes. However, when using less stringent correctness metrics that account for incomplete and noisy data, 87% of the reconstructed karyotypes were correct. Furthermore, in scenarios where the data were very clean and complete, accuracy rose to 90%–100%.
    [Show full text]
  • Opportunities and Obstacles for Deep Learning in Biology and Medicine
    bioRxiv preprint doi: https://doi.org/10.1101/142760; this version posted January 19, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Opportunities and obstacles for deep learning in biology and medicine A DOI-citable preprint of this manuscript is available at https://doi.org/10.1101/142760. This manuscript was automatically generated from greenelab/deep-review@a01dd71 on January 19, 2018. Authors 1,☯ 2 3 4 Travers Ching , Daniel S. Himmelstein , Brett K. Beaulieu-Jones , Alexandr A. Kalinin , 5 2 6 7 2 Brian T. Do , Gregory P. Way , Enrico Ferrero , Paul-Michael Agapow , Michael Zietz , 8,9,10 11 12 13 Michael M. Hoffman , Wei Xie , Gail L. Rosen , Benjamin J. Lengerich , Johnny 14 15 12 16 Israeli , Jack Lanchantin , Stephen Woloszynek , Anne E. Carpenter , Avanti 17 18 19,20 21 Shrikumar , Jinbo Xu , Evan M. Cofer , Christopher A. Lavender , Srinivas C. 22 17 23 24 25 Turaga , Amr M. Alexandari , Zhiyong Lu , David J. Harris , Dave DeCaprio , 15 17,26 23 27 28 Yanjun Qi , Anshul Kundaje , Yifan Peng , Laura K. Wiley , Marwin H.S. Segler , 29 30 31 32,33,† Simina M. Boca , S. Joshua Swamidass , Austin Huang , Anthony Gitter , 2,† Casey S. Greene ☯ — Author order was determined with a randomized algorithm † — To whom correspondence should be addressed: [email protected] (A.G.) and [email protected] (C.S.G.) 1. Molecular Biosciences and Bioengineering Graduate Program, University of Hawaii at Manoa, Honolulu, HI 2.
    [Show full text]