Eval2011.Pdf

Total Page:16

File Type:pdf, Size:1020Kb

Eval2011.Pdf William Stafford Noble Department of Genome Sciences University of Washington 2006–2011 Research Since 2006, major research accomplishments include the publication and widespread adoption of a semi- supervised learning method for improving peptide identification from shotgun proteomics. The software, Percolator, published with Mike MacCoss in Nature Methods in 2007, is now distributed along with every copy of Mascot, the most widely used MS/MS search engine and will soon be distributed by Thermo as well. Also, last year, with Tony Blau, Stan Fields and Jay Shendure, we published in Nature a kilobase resolution model of the 3D structure of the yeast genome in vivo. This year, a postdoc in my lab, Michael Hoffman, received K99 funding. I also received the University of Washington Postdoc Mentor of the Year award from the UWPA. My lab is now approaching a major transition. All three of my current PhD students will be finished by the end of August. One of my postdocs and two masters students are also leaving. This leaves four postdocs and one programmer in my lab. I am currently advertising for the following seven postdoc positions, which span most of the current research in my lab: 1. Structure of mammalian genomes: Last year, in collaboration with Tony Blau’s lab, we published a detailed description of the three-dimensional architecture of the yeast genome in vivo. We have recently received NIH funding to continue this work in mammalian systems. The postdoc involved in this project would work on developing and applying statistical methods for interpreting the raw sequencing data, for relating these data to known classes of functional elements, and for improving our ability to infer 3D structure from observed pairs of interactions. Funded by a new R01, with Tony Blau as PI. 2. Clonal population of cancer: More recently, also in collaboration with Tony Blau, we have been developing next generation sequencing strategies for characterizing the population of clones in a single cancer by assaying paired cancerous and non-cancerous samples. This project will employ dynamic Bayesian network models to infer the clonal population structure. Funded by Tony Blau. 3. Genomics and proteomics of Plasmodium: Our lab is about to embark in a new research direction, focusing on analyses of Plasmodium falciparum, the parasite responsible for the most lethal form of malaria. In collaboration with Karine Le Roch’s lab at UC Riverside, we will investigate local and global DNA structure, with the goal of building a computational model of gene regulation in this organism. We will also be applying our expertise in interpreting shotgun proteomics data to help shed light on the differences between RNA and protein expression. Funded by the Yeast Resource Center P41. I am planning to submit an R01 in the fall on genome structure in yeast and Plasmodium, with two or three co-investigators (Karine Le Roch at UCR, Zhijun Duan in Hematology, and possibly Linda Breeden at the FHCRC). 4. Local chromatin structure and gene regulation: This project involves investigating the relation- ship between DNA sequence and chromatin structure of the human genome. Computational models, such as dynamic Bayesian networks or support vector machines, will be employed to investigate the competitive binding of proteins to nuclear DNA and to understand their collective influence on gene regulation. This project is a collaboration with Prof. Zhiping Weng at the University of Massachusetts Medical School. Funded by an NSF award, with Zhiping Weng as the PI. 5. Integration of functional genomics data: This project will be carried out in the context of the NIH ENCODE Consortium, the aim of which is to discover all of the functional elements in the human genome. Our lab’s role in this consortium is to develop unsupervised and semi-supervised machine 1 learning methods for identifying new instances and new types of functional elements. Funded by the ENCODE Data Analysis Center, with Ewan Birney as PI. 6. Machine learning for mass spectrometry analysis: In collaboration with Mike MacCoss’s lab here in Genome Sciences, as well as Jeff Bilmes’ lab in Electrical Engineering, we have developed a series of machine learning and statistical methods for interpreting shotgun proteomics data sets. The postdoc working on this project will have opportunities to develop new methods for quantifying proteins, interpreting targeted proteomics data, identifying modified proteins, etc. Funded by my R01, with Jeff Bilmes as co-I. 7. Genomics and proteomics of auditory pathways: Dr. Ed Rubel’s lab, in the UW Department of Otalaryngology, studies auditory pathways in the developing mouse brain. A collaboration involving Ed, Mike MacCoss and our lab will collect a series of RNA and protein samples from microdissected mouse brains at particular time points. These samples will be subjected to shotgun proteomics and RNAseq analysis, with the goal of identifying genes and proteins involved in development of these pathways. The postdoc working on this project would have the opportunity to work in any of the three collaborating labs. Funded by an R01, with Ed Rubel as PI. In addition, I have an R01, with Tim Bailey as co-investigator, to maintain and develop the MEME Suite. This grant funds one senior programmer in my lab. I also have a pending R01 application with Christina Leslie at Memorial Sloan-Kettering as the PI. Teaching This year I taught part of GENOME 541 and managed the entire course. Last year, due to my sabbatical, I did no teaching, but the previous year I taught an entire 10-week undergraduate course (GENOME 373) in addition to part of GENOME 541. Finally, every quarter I help Martin Tompa, Larry Ruzzo and Joe Felsenstein run the CMB journal club (CS590C). Service I am currently serving on three editorial boards—PLoS Computational Biology, IEEE Transactions in Com- putatioanl Biology and Bioinformatics and Journal of Bioinformatics and Computational Biology. In addition to extensive ad hoc reviewing as well as program committee memberships, I have served on five NIH review panels since 2006 and am slated for another in early July. I am just finishing a three-year term on the Board of Directors of the International Society for Computational Biology. With Job Dekker and Tony Blau, I will be leading a workshop on genome structure and function at the Pacific Symposium on Biocomputing in January. More details are provided in my attached CV. 2.
Recommended publications
  • Ontology-Based Methods for Analyzing Life Science Data
    Habilitation a` Diriger des Recherches pr´esent´ee par Olivier Dameron Ontology-based methods for analyzing life science data Soutenue publiquement le 11 janvier 2016 devant le jury compos´ede Anita Burgun Professeur, Universit´eRen´eDescartes Paris Examinatrice Marie-Dominique Devignes Charg´eede recherches CNRS, LORIA Nancy Examinatrice Michel Dumontier Associate professor, Stanford University USA Rapporteur Christine Froidevaux Professeur, Universit´eParis Sud Rapporteure Fabien Gandon Directeur de recherches, Inria Sophia-Antipolis Rapporteur Anne Siegel Directrice de recherches CNRS, IRISA Rennes Examinatrice Alexandre Termier Professeur, Universit´ede Rennes 1 Examinateur 2 Contents 1 Introduction 9 1.1 Context ......................................... 10 1.2 Challenges . 11 1.3 Summary of the contributions . 14 1.4 Organization of the manuscript . 18 2 Reasoning based on hierarchies 21 2.1 Principle......................................... 21 2.1.1 RDF for describing data . 21 2.1.2 RDFS for describing types . 24 2.1.3 RDFS entailments . 26 2.1.4 Typical uses of RDFS entailments in life science . 26 2.1.5 Synthesis . 30 2.2 Case study: integrating diseases and pathways . 31 2.2.1 Context . 31 2.2.2 Objective . 32 2.2.3 Linking pathways and diseases using GO, KO and SNOMED-CT . 32 2.2.4 Querying associated diseases and pathways . 33 2.3 Methodology: Web services composition . 39 2.3.1 Context . 39 2.3.2 Objective . 40 2.3.3 Semantic compatibility of services parameters . 40 2.3.4 Algorithm for pairing services parameters . 40 2.4 Application: ontology-based query expansion with GO2PUB . 43 2.4.1 Context . 43 2.4.2 Objective .
    [Show full text]
  • PREDICTD: Parallel Epigenomics Data Imputation with Cloud-Based Tensor Decomposition
    bioRxiv preprint doi: https://doi.org/10.1101/123927; this version posted April 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. PREDICTD: PaRallel Epigenomics Data Imputation with Cloud-based Tensor Decomposition Timothy J. Durham Maxwell W. Libbrecht Department of Genome Sciences Department of Genome Sciences University of Washington University of Washington J. Jeffry Howbert Jeff Bilmes Department of Genome Sciences Department of Electrical Engineering University of Washington University of Washington William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington April 4, 2017 Abstract The Encyclopedia of DNA Elements (ENCODE) and the Roadmap Epigenomics Project have produced thousands of data sets mapping the epigenome in hundreds of cell types. How- ever, the number of cell types remains too great to comprehensively map given current time and financial constraints. We present a method, PaRallel Epigenomics Data Imputation with Cloud-based Tensor Decomposition (PREDICTD), to address this issue by computationally im- puting missing experiments in collections of epigenomics experiments. PREDICTD leverages an intuitive and natural model called \tensor decomposition" to impute many experiments si- multaneously. Compared with the current state-of-the-art method, ChromImpute, PREDICTD produces lower overall mean squared error, and combining methods yields further improvement. We show that PREDICTD data can be used to investigate enhancer biology at non-coding human accelerated regions. PREDICTD provides reference imputed data sets and open-source software for investigating new cell types, and demonstrates the utility of tensor decomposition and cloud computing, two technologies increasingly applicable in bioinformatics.
    [Show full text]
  • Motif Selection Using Simulated Annealing Algorithm with Application to Identify Regulatory Elements
    Motif Selection Using Simulated Annealing Algorithm with Application to Identify Regulatory Elements A thesis presented to the faculty of the Russ College of Engineering and Technology of Ohio University In partial fulfillment of the requirements for the degree Master of Science Liang Chen August 2018 © 2018 Liang Chen. All Rights Reserved. 2 This thesis titled Motif Selection Using Simulated Annealing Algorithm with Application to Identify Regulatory Elements by LIANG CHEN has been approved for the Department of Electrical Engineering and Computer Science and the Russ College of Engineering and Technology by Lonnie Welch Professor of Electrical Engineering and Computer Science Dennis Irwin Dean, Russ College of Engineering and Technology 3 Abstract CHEN, LIANG, M.S., August 2018, Computer Science Master Program Motif Selection Using Simulated Annealing Algorithm with Application to Identify Regulatory Elements (106 pp.) Director of Thesis: Lonnie Welch Modern research on gene regulation and disorder-related pathways utilize the tools such as microarray and RNA-Seq to analyze the changes in the expression levels of large sets of genes. In silico motif discovery was performed based on the gene expression profile data, which generated a large set of candidate motifs (usually hundreds or thousands of motifs). How to pick a set of biologically meaningful motifs from the candidate motif set is a challenging biological and computational problem. As a computational problem it can be modeled as motif selection problem (MSP). Building solutions for motif selection problem will give biologists direct help in finding transcription factors (TF) that are strongly related to specific pathways and gaining insights of the relationships between genes.
    [Show full text]
  • Genome Informatics
    Joint Cold Spring Harbor Laboratory/Wellcome Trust Conference GENOME INFORMATICS September 15–September 19, 2010 View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Cold Spring Harbor Laboratory Institutional Repository Joint Cold Spring Harbor Laboratory/Wellcome Trust Conference GENOME INFORMATICS September 15–September 19, 2010 Arranged by Inanc Birol, BC Cancer Agency, Canada Michele Clamp, BioTeam, Inc. James Kent, University of California, Santa Cruz, USA SCHEDULE AT A GLANCE Wednesday 15th September 2010 17.00-17.30 Registration – finger buffet dinner served from 17.30-19.30 19.30-20:50 Session 1: Epigenomics and Gene Regulation 20.50-21.10 Break 21.10-22.30 Session 1, continued Thursday 16th September 2010 07.30-09.00 Breakfast 09.00-10.20 Session 2: Population and Statistical Genomics 10.20-10:40 Morning Coffee 10:40-12:00 Session 2, continued 12.00-14.00 Lunch 14.00-15.20 Session 3: Environmental and Medical Genomics 15.20-15.40 Break 15.40-17.00 Session 3, continued 17.00-19.00 Poster Session I and Drinks Reception 19.00-21.00 Dinner Friday 17th September 2010 07.30-09.00 Breakfast 09.00-10.20 Session 4: Databases, Data Mining, Visualization and Curation 10.20-10.40 Morning Coffee 10.40-12.00 Session 4, continued 12.00-14.00 Lunch 14.00-16.00 Free afternoon 16.00-17.00 Keynote Speaker: Alex Bateman 17.00-19.00 Poster Session II and Drinks Reception 19.00-21.00 Dinner Saturday 18th September 2010 07.30-09.00 Breakfast 09.00-10.20 Session 5: Sequencing Pipelines and Assembly 10.20-10.40
    [Show full text]
  • (Title of the Thesis)*
    Discovery of Flexible Gap Patterns from Sequences by En Hui Zhuang A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor of Philosophy in Systems Design Engineering Waterloo, Ontario, Canada, 2014 ©En Hui Zhuang 2014 AUTHOR'S DECLARATION I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii Abstract Human genome contains abundant motifs bound by particular biomolecules. These motifs are involved in the complex regulatory mechanisms of gene expressions. The dominant mechanism behind the intriguing gene expression patterns is known as combinatorial regulation, achieved by multiple cooperating biomolecules binding in a nearby genomic region to provide a specific regulatory behavior. To decipher the complicated combinatorial regulation mechanism at work in the cellular processes, there is a pressing need to identify co-binding motifs for these cooperating biomolecules in genomic sequences. The great flexibility of the interaction distance between nearby cooperating biomolecules leads to the presence of flexible gaps in between component motifs of a co- binding motif. Many existing motif discovery methods cannot handle co-binding motifs with flexible gaps. Existing co-binding motif discovery methods are ineffective in dealing with the following problems: (1) co-binding motifs may not appear in a large fraction of the input sequences, (2) the lengths of component motifs are unknown and (3) the maximum range of the flexible gap can be large.
    [Show full text]
  • Genomic and Transcriptomic Investigation of Endemic Burkitt Lymphoma and Epstein Barr Virus
    University of Massachusetts Medical School eScholarship@UMMS GSBS Dissertations and Theses Graduate School of Biomedical Sciences 2017-07-31 Genomic and Transcriptomic Investigation of Endemic Burkitt Lymphoma and Epstein Barr Virus Yasin Kaymaz University of Massachusetts Medical School Let us know how access to this document benefits ou.y Follow this and additional works at: https://escholarship.umassmed.edu/gsbs_diss Part of the Bioinformatics Commons, Computational Biology Commons, Genetics Commons, Genomics Commons, Hematology Commons, Immunology of Infectious Disease Commons, Molecular Genetics Commons, Oncology Commons, Other Genetics and Genomics Commons, Parasitology Commons, Pathology Commons, and the Pediatrics Commons Repository Citation Kaymaz Y. (2017). Genomic and Transcriptomic Investigation of Endemic Burkitt Lymphoma and Epstein Barr Virus. GSBS Dissertations and Theses. https://doi.org/10.13028/M2R95Z. Retrieved from https://escholarship.umassmed.edu/gsbs_diss/914 Creative Commons License This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License This material is brought to you by eScholarship@UMMS. It has been accepted for inclusion in GSBS Dissertations and Theses by an authorized administrator of eScholarship@UMMS. For more information, please contact [email protected]. GENOMIC AND TRANSCRIPTOMIC INVESTIGATION OF ENDEMIC BURKITT LYMPHOMA AND EPSTEIN BARR VIRUS A Dissertation Presented by YASIN KAYMAZ Submitted to the Faculty of the University Of Massachusetts Graduate School Of Biomedical Sciences, Worcester in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY July 31st, 2017 1 GENOMIC AND TRANSCRIPTOMIC INVESTIGATION OF ENDEMIC BURKITT LYMPHOMA AND EPSTEIN BARR VIRUS A Dissertation Presented by YASIN KAYMAZ The signatures of the Dissertation Defense Committee signify completion and approval as to style and content of the Dissertation Jeffrey A.
    [Show full text]
  • Here Have Been Major Community Efforts on Algorithm and Software Development
    ACM-BCB 2013 ACM-BCB Organization Committee Steering Committee Chair Poster Chairs Local Arrangement Chairs Amarda Shehu Aidong Zhang Dongxiao Zhu George Mason University State University of New York at Buffalo Wayne State University Liliana Florea Yu-Ping Wang Johns Hopkins University General Chairs Tulane University Sridhar Hannenhalli Women in Bioinformatics University of Maryland Industry Chairs Panel Chair Cathy H. Wu Anastasia Christianson May Dongmei Wang University of Delaware & AstraZeneca Pharmaceutical Georgia Tech & Emory University Georgetown University Michael Liebman Strategic Medicine Program Chairs Panel Chair Iosif Vaisman Srinivas Aluru Health Informatics George Mason University Georgia Institute of Technology Symposium Chairs Donna Slonim Maricel G. Kann Tufts University University of Maryland, Baltimore County Publicity Chair Philip Payne Jianlin Cheng Workshop Chair Ohio State University University of Missouri, Columbia Ümit V. Çatalyürek Ohio State University Exhibit/System Demo Chair Proceedings Chair Nathan Edwards Jing Gao Tutorial Chairs Georgetown University State University of New York at Buffalo Clare Bates Congdon University of Southern Maine PhD Forum Chair Registration Chair Vasant Honavar Yanni Sun Preetam Ghosh Iowa State University Michigan State University Virginia Commonwealth University ACM‐BCB 2013 Conference Schedule Sunday Monday Tuesday Wednesday Sep. 22 Sep. 23 Sep. 24 Sep. 25 8:15am – 8:30am Opening Remarks 8:30am – 10:00am 8:30am – 10:25am 8:30am – 9:30am Paper Session 7 Paper Session 4 Keynote
    [Show full text]
  • Organizing Committee
    Logo design by Barbara Pixton The best poster award is kindly offered by High Throughput, an open access journal from MDPI 1 Organizing Committee General Chairs: Nurit Haspel, University of Massachusetts Boston, USA Student Travel Award Chairs: Lenore J. Cowen, Tufts University, USA May D. Wang, Georgia Institute of Technology and Emory University, USA Program Chairs: Anna Ritz, Reed College, USA Amarda Shehu, George Mason University, USA Zhaohui (Steve) Qin, Emory University, USA Tamer Kahveci, University of Florida, USA Ying Sha, Georgia Institute of Technology, USA Giuseppe Pozzi, Politecnico di Milano, Italy Women in Bioinformatics (WiB) Chair: Workshop Chairs: May D. Wang, Georgia Institute of Technology and Jianlin Cheng, University of Missouri, USA Emory University, USA Bhaskar DasGupta, University of Illinois at Chicago, USA Lydia Tapia, University of New Mexico, USA Proceedings Chairs: Xinghua Mindy Shi, University of North Carolina at Tutorial Chairs: Charlotte, USA Filip Jagodzinski, Western Washington University, USA Yang Shen, Texas A&M University, USA Dario Ghersi, University of Nebraska, USA Benjamin Hescott, Tufts University, USA Giuseppe Tradigo, University Magna Graecia of Catanzaro, Italy Web Admin: Jonathan Kho, Georgia Institute of Technology, USA Demo and Exhibit Chairs: Robert (Bob) Cottingham, Oak Ridge National Publicity Chairs Laboratory, USA A. Ercument Cicek, Bilkent University, Turkey Narayan Ganesan, Stevens Institute of Technology, USA Oznur Tastan, Bilkent University, Turkey Rolf Backofen, University of Freiburg, Germany Poster Chairs: Pierangelo Veltri, University Magna Graecia of Dong Si, University of Washington Bothell, ISA Catanzaro, Italy A. Ercument Cicek, Bilkent University, USA Noah Daniels, University of Rhode Island, USA Steering Committee: Aidong Zhang, State University of New York at Registration Chair: Buffalo,USA, Co-Chair Preetam Ghosh, Virginia Commonwealth University, May D.
    [Show full text]
  • Computational Study of Transcriptional Regulation - from Sequence to Expression
    Computational Study of Transcriptional Regulation - From Sequence To Expression Shan Zhong CMU-CB-13-101 May 2013 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Thesis Committee: Ziv Bar-Joseph, Chair Roni Rosenfeld Seyoung Kim Takis Benos (University of Pittsburgh) Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Copyright c 2013 Shan Zhong Keywords: Motif finding, Transcriptional regulatory network, p53, Protein binding microar- ray, Tissue specificity, EIN3, Ethylene response To my parents, my wife, and my soon-to-be-born son. iv Abstract Transcription is the process during which RNA molecules are synthesized based on the DNAs in cells. Transcription leads to gene expression, and it is the first step in the flow of genetic information from DNA to proteins that carry out bio- logical functions. Transcription is tightly regulated both spatially and temporally at multiple levels, so that the amount of mRNAs produced for different genes is controlled across different kinds of cells and tissues, as well as in different devel- opmental stages and in response to different environmental stimulus. In eukaryotes, transcription is a complicated process and its regulation involves both cis-regulatory elements and trans-acting factors. By studying spatiotemporally what genes are reg- ulated by which cis-elements and trans-factors, we can get a better understanding of how we develop, how we react to environmental signals, and the mechanisms behind diseases like cancer that, at least in part, result from failures in proper transcriptional regulation. In this thesis, we present a suite of computational methods and analyses that, combined, provide a solution to problems related to the identification of DNA bind- ing motifs, linking these motifs to the TFs that bind them and the genes that they con- trol, and integrating these motifs and interactions with time series expression data to model dynamic regulatory networks.
    [Show full text]
  • The University of Chicago Interrogating the 3D Structure of Primate Genomes a Dissertation Submitted to the Faculty of the Divis
    THE UNIVERSITY OF CHICAGO INTERROGATING THE 3D STRUCTURE OF PRIMATE GENOMES A DISSERTATION SUBMITTED TO THE FACULTY OF THE DIVISION OF THE BIOLOGICAL SCIENCES AND THE PRITZKER SCHOOL OF MEDICINE IN CANDIDACY FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF HUMAN GENETICS BY ITTAI ETHAN ERES CHICAGO, ILLINOIS DECEMBER 2020 Copyright © 2020 by Ittai Ethan Eres All Rights Reserved Freely available under a CC-BY 4.0 International license \If I am not for myself, who will be for me? But if I am only for myself, who am I? And if not now, when?" Rabbi Hillel Table of Contents LIST OF FIGURES . vi LIST OF TABLES . vii ACKNOWLEDGMENTS . viii ABSTRACT . xi 1 INTRODUCTION . 1 1.1 The evolution of gene regulation . 1 1.2 Gene regulatory evolution insights from comparative primate genomics . 3 1.3 The growing importance of the 3D genome . 11 2 REORGANIZATION OF 3D GENOME STRUCTURE MAY CONTRIBUTE TO GENE REGULATORY EVOLUTION IN PRIMATES . 15 2.1 Abstract . 15 2.2 Introduction . 16 2.3 Results . 18 2.3.1 Inter-species differences in 3D genomic interactions . 19 2.3.2 The relationship between inter-species differences in contacts and gene expression . 26 2.3.3 The chromatin and epigenetic context of inter-species differences in 3D genome structure . 30 2.4 Discussion . 33 2.4.1 Contribution of variation in 3D genome structure to expression diver- gence . 36 2.4.2 Functional annotations . 37 2.5 Materials and methods . 38 2.5.1 Ethics statement . 38 2.5.2 Induced pluripotent stem cells (iPSCs) .
    [Show full text]
  • Analysis, Visualization, and Machine Learning of Epigenomic Data
    University of Massachusetts Medical School eScholarship@UMMS GSBS Dissertations and Theses Graduate School of Biomedical Sciences 2017-12-12 Analysis, Visualization, and Machine Learning of Epigenomic Data Michael J. Purcaro University of Massachusetts Medical School Let us know how access to this document benefits ou.y Follow this and additional works at: https://escholarship.umassmed.edu/gsbs_diss Part of the Computational Biology Commons, Genomics Commons, and the Integrative Biology Commons Repository Citation Purcaro MJ. (2017). Analysis, Visualization, and Machine Learning of Epigenomic Data. GSBS Dissertations and Theses. https://doi.org/10.13028/M23T1Q. Retrieved from https://escholarship.umassmed.edu/gsbs_diss/938 Creative Commons License This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License This material is brought to you by eScholarship@UMMS. It has been accepted for inclusion in GSBS Dissertations and Theses by an authorized administrator of eScholarship@UMMS. For more information, please contact [email protected]. ANALYSIS, VISUALIZATION, AND MACHINE LEARNING OF EPIGENOMIC DATA A Dissertation Presented By MICHAEL JOSEPH PURCARO Submitted to the Faculty of the University of Massachusetts Graduate School of Biomedical Sciences, Worcester in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY DECEMBER 12, 2017 BIOINFORMATICS AND COMPUTATIONAL BIOLOGY M.D., PH.D. PROGRAM I-ii ANALYSIS, VISUALIZATION, AND MACHINE LEARNING OF EPIGENOMIC DATA A Dissertation Presented
    [Show full text]
  • Protein Structural Alignments from Sequence
    bioRxiv preprint doi: https://doi.org/10.1101/2020.11.03.365932; this version posted November 4, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Protein Structural Alignments From Sequence James T. Morton Charlie E. M. Strauss Center for Computational Biology Bioscience Division, Flatiron Institute, Simons Foundation Los Alamos National Laboratory, New York, NY, 10010 Los Alamos NM 87544 Robert Blackwell Daniel Berenberg Scientific Computing Core Center for Computational Biology Flatiron Institute, Simons Foundation Flatiron Institute, Simons Foundation New York, NY, 10010 New York, NY, 10010 Vladimir Gligorijevic Richard Bonneau Center for Computational Biology Center for Computational Biology Flatiron Institute, Simons Foundation Flatiron Institute, Simons Foundation New York, NY, 10010 New York, NY, 10010 Abstract Computing sequence similarity is a fundamental task in biology, with alignment forming the basis for the annotation of genes and genomes and providing the core data structures for evolutionary analysis. Standard approaches are a mainstay of modern molecular biology and rely on variations of edit distance to obtain explicit alignments between pairs of biological sequences. However, sequence alignment algorithms struggle with remote homology tasks and cannot identify similarities between many pairs of proteins with similar structures and likely homology. Recent work suggests that using machine learning language models can improve remote homology detection. To this end, we introduce DeepBLAST, that obtains explicit alignments from residue embeddings learned from a protein language model in- tegrated into an end-to-end differentiable alignment framework.
    [Show full text]