Analysis, Visualization, and Machine Learning of Epigenomic Data

Total Page:16

File Type:pdf, Size:1020Kb

Analysis, Visualization, and Machine Learning of Epigenomic Data University of Massachusetts Medical School eScholarship@UMMS GSBS Dissertations and Theses Graduate School of Biomedical Sciences 2017-12-12 Analysis, Visualization, and Machine Learning of Epigenomic Data Michael J. Purcaro University of Massachusetts Medical School Let us know how access to this document benefits ou.y Follow this and additional works at: https://escholarship.umassmed.edu/gsbs_diss Part of the Computational Biology Commons, Genomics Commons, and the Integrative Biology Commons Repository Citation Purcaro MJ. (2017). Analysis, Visualization, and Machine Learning of Epigenomic Data. GSBS Dissertations and Theses. https://doi.org/10.13028/M23T1Q. Retrieved from https://escholarship.umassmed.edu/gsbs_diss/938 Creative Commons License This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License This material is brought to you by eScholarship@UMMS. It has been accepted for inclusion in GSBS Dissertations and Theses by an authorized administrator of eScholarship@UMMS. For more information, please contact [email protected]. ANALYSIS, VISUALIZATION, AND MACHINE LEARNING OF EPIGENOMIC DATA A Dissertation Presented By MICHAEL JOSEPH PURCARO Submitted to the Faculty of the University of Massachusetts Graduate School of Biomedical Sciences, Worcester in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY DECEMBER 12, 2017 BIOINFORMATICS AND COMPUTATIONAL BIOLOGY M.D., PH.D. PROGRAM I-ii ANALYSIS, VISUALIZATION, AND MACHINE LEARNING OF EPIGENOMIC DATA A Dissertation Presented By MICHAEL JOSEPH PURCARO This work was undertaken in the Graduate School of Biomedical Sciences Bioinformatics and Computational Biology Under the mentorship of ________________________________________________ Zhiping Weng, PhD, Thesis Advisor ________________________________________________ Elinor Karlsson, PhD, Member of Committee ________________________________________________ Manuel Garber, PhD, Member of Committee ________________________________________________ Robert Brewster, PhD, Member of Committee ________________________________________________ Ross Hardison, PhD, External Member of Committee ________________________________________________ Jeffrey Bailey, MD, PhD, Chair of Committee ________________________________________________ Anthony Carruthers, Ph.D., Dean of the Graduate School of Biomedical Sciences December 12, 2017 iii This work is dedicated to the most loving, understanding, and special woman in my world: my wife and best friend, Zemei. It is also dedicated to my parents, who have always encouraged me to pursue my dreams, despite the many, many years of training it has entailed. iv Acknowledgements Being involved for 4.5 years in an undertaking as complex as the ENCODE Consortium complicates writing acknowledgments. With so many moving pieces—data, metadata, code, papers, conference calls, and Google Docs—to deal with, ENCODE, and especially the Weng Lab, is a swirl of activity and ideas; keeping track of the intellectual and personal impact so many people have made on me would be a job in itself. Certain people, though, were particularly important in my journey: Arjan van der Velde, Nicholas Hathaway, Henry Pratt, Nathaniel Erskine, and Eugenio Mattei (in no particular order) were critical in my survival, allowing me to test ideas (about code, science, medicine, and life in general) and get honest, always entertaining feedback. I learned much from the intellectual prowess of our lab mates: Jill Moore, Tyler Borrman, Jack Huey, Sweta Vangaveti, Sowmya Iyer, Xiao-Ou Zhang, Shikui Tu, Junko Tsuhi, Jie Wang, Hao Chen, Thom Vreven, Yu Fu, Adam Wespiser, and Wei Wang. From the long suffering ENCODE DCC group, I would like to particularly thank "data heroes" Cricket Sloan, Kathrina Onate, and Jason Hilton for all their work in supporting us. From the IT department, I would like to thank David Plamondon, Lewis Robbins, and Charles Davidson for help and advice. I am very appreciative of mentoring and advice from Thomas Smith, who has helped put my work in a wider clinical perspective. I am also very appreciative of my TRAC committee—Jeffrey Bailey, Elinor Karlsson, Konstantin Zeldovich, and Robert Brewster—for helping me steer through this process, as well as the MD/PhD program for giving me the opportunity to undertake training here for a career as a physician scientist. v Life in the lab would come to a crashing halt without the endless support, advice, and motherly watchfulness from Barbara Bucciaglia, Heidi Beberman, Christine Tonevski, and Maureen Schulz. And, of course, none of this would have been possible without Zhiping Weng, whose endless generosity, curiosity, and intellectual insight have been deeply inspiring, and made the lab into an intellectual—and computational— playground. vi Abstract The goal of the Encyclopedia of DNA Elements (ENCODE) project has been to characterize all the functional elements of the human genome. These elements include expressed transcripts and genomic regions bound by transcription factors (TFs), occupied by nucleosomes, occupied by nucleosomes with modified histones, or hypersensitive to DNase I cleavage, etc. Chromatin Immunoprecipitation (ChIP-seq) is an experimental technique for detecting TF binding in living cells, and the genomic regions bound by TFs are called ChIP-seq peaks. ENCODE has performed and compiled results from tens of thousands of experiments, including ChIP-seq, DNase, RNA-seq and Hi-C. These efforts have culminated in two web-based resources from our lab— Factorbook and SCREEN—for the exploration of epigenomic data for both human and mouse. Factorbook is a peak-centric resource presenting data such as motif enrichment and histone modification profiles for transcription factor binding sites computed from ENCODE ChIP-seq data. SCREEN provides an encyclopedia of ~2 million regulatory elements, including promoters and enhancers, identified using ENCODE ChIP-seq and DNase data, with an extensive UI for searching and visualization. While we have successfully utilized the thousands of available ENCODE ChIP- seq experiments to build the Encyclopedia and visualizers, we have also struggled with the practical and theoretical inability to assay every possible experiment on every possible biosample under every conceivable biological scenario. We have used machine learning techniques to predict TF binding sites and enhancers location, and demonstrate machine learning is critical to help decipher functional regions of the genome. vii Table of Contents I. Chapter I: Introduction ........................................................................................ 2 II. Chapter II: Building and Visualizing an Encyclopedia of ENCODE candidate Regulatory Elements ........................................................ 9 II.1 Preface .......................................................................................................................................... 9 II.2 Summary ..................................................................................................................................... 11 II.3 Introduction ................................................................................................................................. 11 II.4 Results ......................................................................................................................................... 14 II.4.1 Summary of Encode Phase 3 Data Production .................................................................................. 14 II.4.2 The Encode Portal and Uniformly Processed Data ........................................................................... 18 II.4.3 The Encode Encyclopedia ................................................................................................................. 19 II.4.4 Encyclopedia Ground Level .............................................................................................................. 20 II.4.5 Encyclopedia Integrative Level ......................................................................................................... 21 II.4.6 The Registry of candidate Regulatory Elements ............................................................................... 23 II.4.7 Selection of cREs for the Registry .................................................................................................... 27 II.4.8 Comprehensiveness of the current Registry of cREs ........................................................................ 28 II.4.9 Classifying cREs in the Registry ....................................................................................................... 30 II.4.10 Relative abundance of cREs-PLS vs. cREs-ELS .............................................................................. 34 II.4.11 Comparison between cREs and the corresponding ChromHMM states............................................ 35 II.4.12 Cell and tissue type clustering ........................................................................................................... 35 II.5 SCREEN: A Web Engine for Searching and Visualizing cREs ................................................. 36 II.5.1 SCREEN Methods ............................................................................................................................ 40 II.5.2 SCREEN Usage ................................................................................................................................ 46 II.5.3 Testing SCREEN User Interface ......................................................................................................
Recommended publications
  • Aberrant Methylation Underlies Insulin Gene Expression in Human Insulinoma
    ARTICLE https://doi.org/10.1038/s41467-020-18839-1 OPEN Aberrant methylation underlies insulin gene expression in human insulinoma Esra Karakose1,6, Huan Wang 2,6, William Inabnet1, Rajesh V. Thakker 3, Steven Libutti4, Gustavo Fernandez-Ranvier 1, Hyunsuk Suh1, Mark Stevenson 3, Yayoi Kinoshita1, Michael Donovan1, Yevgeniy Antipin1,2, Yan Li5, Xiaoxiao Liu 5, Fulai Jin 5, Peng Wang 1, Andrew Uzilov 1,2, ✉ Carmen Argmann 1, Eric E. Schadt 1,2, Andrew F. Stewart 1,7 , Donald K. Scott 1,7 & Luca Lambertini 1,6 1234567890():,; Human insulinomas are rare, benign, slowly proliferating, insulin-producing beta cell tumors that provide a molecular “recipe” or “roadmap” for pathways that control human beta cell regeneration. An earlier study revealed abnormal methylation in the imprinted p15.5-p15.4 region of chromosome 11, known to be abnormally methylated in another disorder of expanded beta cell mass and function: the focal variant of congenital hyperinsulinism. Here, we compare deep DNA methylome sequencing on 19 human insulinomas, and five sets of normal beta cells. We find a remarkably consistent, abnormal methylation pattern in insu- linomas. The findings suggest that abnormal insulin (INS) promoter methylation and altered transcription factor expression create alternative drivers of INS expression, replacing cano- nical PDX1-driven beta cell specification with a pathological, looping, distal enhancer-based form of transcriptional regulation. Finally, NFaT transcription factors, rather than the cano- nical PDX1 enhancer complex, are predicted to drive INS transactivation. 1 From the Diabetes Obesity and Metabolism Institute, The Department of Surgery, The Department of Pathology, The Department of Genetics and Genomics Sciences and The Institute for Genomics and Multiscale Biology, The Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA.
    [Show full text]
  • Analysis of Gene Expression Data for Gene Ontology
    ANALYSIS OF GENE EXPRESSION DATA FOR GENE ONTOLOGY BASED PROTEIN FUNCTION PREDICTION A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Science Robert Daniel Macholan May 2011 ANALYSIS OF GENE EXPRESSION DATA FOR GENE ONTOLOGY BASED PROTEIN FUNCTION PREDICTION Robert Daniel Macholan Thesis Approved: Accepted: _______________________________ _______________________________ Advisor Department Chair Dr. Zhong-Hui Duan Dr. Chien-Chung Chan _______________________________ _______________________________ Committee Member Dean of the College Dr. Chien-Chung Chan Dr. Chand K. Midha _______________________________ _______________________________ Committee Member Dean of the Graduate School Dr. Yingcai Xiao Dr. George R. Newkome _______________________________ Date ii ABSTRACT A tremendous increase in genomic data has encouraged biologists to turn to bioinformatics in order to assist in its interpretation and processing. One of the present challenges that need to be overcome in order to understand this data more completely is the development of a reliable method to accurately predict the function of a protein from its genomic information. This study focuses on developing an effective algorithm for protein function prediction. The algorithm is based on proteins that have similar expression patterns. The similarity of the expression data is determined using a novel measure, the slope matrix. The slope matrix introduces a normalized method for the comparison of expression levels throughout a proteome. The algorithm is tested using real microarray gene expression data. Their functions are characterized using gene ontology annotations. The results of the case study indicate the protein function prediction algorithm developed is comparable to the prediction algorithms that are based on the annotations of homologous proteins.
    [Show full text]
  • Original Article Correlation Between LSP1 Polymorphisms and the Susceptibility to Breast Cancer
    Int J Clin Exp Pathol 2015;8(5):5798-5802 www.ijcep.com /ISSN:1936-2625/IJCEP0005835 Original Article Correlation between LSP1 polymorphisms and the susceptibility to breast cancer Hai Chen, Xiaodong Qi, Ping Qiu, Jiali Zhao Department of Galactophore, The General Hospital of Beijing Military Command, Beijing, China Received January 12, 2015; Accepted March 16, 2015; Epub May 1, 2015; Published May 15, 2015 Abstract: Objective: The present study aimed at assessing the relationship between Leukocyte-specific protein 1 gene (LSP1) polymorphisms (rs569550 and rs592373) and the pathogenesis of breast cancer (BC). Methods: 70 BC patients and 72 healthy subjects were enrolled in the study. Rs569550 and rs592373 polymorphisms were genotyped by polymerase chain reaction-restriction fragment length polymorphism (PCR-RFLP). Odds ratio (OR) with 95% confidence interval (CI) were calculated by the chi-squared test to assess the relationship between LSP1 polymorphisms and BC risk. Linkage disequilibrium (LD) and haplotypes were also analyzed by HaploView software. Results: Genotype distribution of the control was in accordance with Hardy-Weinberg equilibrium (HWE). The homo- zygous genotype TT and T allele of rs569550 could significantly increase the risk of BC (TT vs. GG: OR=3.17, 95% CI=1.23-8.91; T vs. G: OR=1.63, 95% CI=1.01-2.64). For rs592373, mutation homozygous genotype CC and C allele were significantly associated with BC susceptibility (CC vs. TT: OR=4.45, 95% CI=1.38-14.8; C vs. T: OR=1.70, 95% CI=1.03-2.81). LD and haplotypes analysis of rs569550 and rs592373 polymorphisms showed that T-C haplotype was a risk factor for BC (T-C vs.
    [Show full text]
  • Meta-Analysis of Nasopharyngeal Carcinoma
    BMC Genomics BioMed Central Research article Open Access Meta-analysis of nasopharyngeal carcinoma microarray data explores mechanism of EBV-regulated neoplastic transformation Xia Chen†1,2, Shuang Liang†1, WenLing Zheng1,3, ZhiJun Liao1, Tao Shang1 and WenLi Ma*1 Address: 1Institute of Genetic Engineering, Southern Medical University, Guangzhou, PR China, 2Xiangya Pingkuang associated hospital, Pingxiang, Jiangxi, PR China and 3Southern Genomics Research Center, Guangzhou, Guangdong, PR China Email: Xia Chen - [email protected]; Shuang Liang - [email protected]; WenLing Zheng - [email protected]; ZhiJun Liao - [email protected]; Tao Shang - [email protected]; WenLi Ma* - [email protected] * Corresponding author †Equal contributors Published: 7 July 2008 Received: 16 February 2008 Accepted: 7 July 2008 BMC Genomics 2008, 9:322 doi:10.1186/1471-2164-9-322 This article is available from: http://www.biomedcentral.com/1471-2164/9/322 © 2008 Chen et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background: Epstein-Barr virus (EBV) presumably plays an important role in the pathogenesis of nasopharyngeal carcinoma (NPC), but the molecular mechanism of EBV-dependent neoplastic transformation is not well understood. The combination of bioinformatics with evidences from biological experiments paved a new way to gain more insights into the molecular mechanism of cancer. Results: We profiled gene expression using a meta-analysis approach. Two sets of meta-genes were obtained. Meta-A genes were identified by finding those commonly activated/deactivated upon EBV infection/reactivation.
    [Show full text]
  • LSP1-Myosin1e Bi-Molecular Complex Regulates Focal Adhesion Dynamics
    bioRxiv preprint doi: https://doi.org/10.1101/2020.02.26.963991; this version posted February 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. LSP1-myosin1e bi-molecular complex regulates focal adhesion dynamics and cell migration Katja Schäringer1*, Sebastian Maxeiner1*, Carmen Schalla1, Stephan Rütten2, Martin Zenke1 and Antonio Sechi1 1Institute of Biomedical Engineering, Dept. of Cell Biology, RWTH Aachen University, Pauwelsstrasse, 30, D-52074 Aachen, Germany 2Electron Microscopy Facility, Institute of Pathology, RWTH Aachen University, Pauwelsstrasse, 30, D-52074 Aachen, Germany *equal contribution Corresponding author: Email: [email protected] Telephone: +49 241 8085248 Running head: LSP1-myosin1e complex regulates cell migration. Keywords: Actin cytoskeleton remodelling, focal adhesions, cell motility, macrophages. 1 bioRxiv preprint doi: https://doi.org/10.1101/2020.02.26.963991; this version posted February 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Abstract Several cytoskeleton-associated proteins and signalling pathways work in concert to regulate actin cytoskeleton remodelling, cell adhesion and migration. We have recently demonstrated that the bi-molecular complex between the leukocyte-specific protein 1 (LSP1) and myosin1e controls actin cytoskeleton remodelling during phagocytosis. In this study, we show that LSP1 down regulation severely impairs cell migration, lamellipodia formation and focal adhesion dynamics in macrophages. Inhibition of the interaction between LSP1 and myosin1e also impairs these processes resulting in poorly motile cells, which are characterised by few and small lamellipodia.
    [Show full text]
  • Anti-LSP1 Antibody (ARG55377)
    Product datasheet [email protected] ARG55377 Package: 50 μl anti-LSP1 antibody Store at: -20°C Summary Product Description Rabbit Polyclonal antibody recognizes LSP1 Tested Reactivity Hu Tested Application WB Host Rabbit Clonality Polyclonal Isotype IgG Target Name LSP1 Antigen Species Human Immunogen Synthetic peptide corresponding to aa. 121-149 of Human LSP1. Conjugation Un-conjugated Alternate Names Lymphocyte-specific protein 1; pp52; 52 kDa phosphoprotein; 47 kDa actin-binding protein; WP34; Lymphocyte-specific antigen WP34 Application Instructions Application table Application Dilution WB 1:1000 Application Note * The dilutions indicate recommended starting dilutions and the optimal dilutions or concentrations should be determined by the scientist. Positive Control Daudi Calculated Mw 37 kDa Properties Form Liquid Purification Affinity purification with immunogen. Buffer PBS (without Mg2+ and Ca2+, pH 7.4), 150mM NaCl, 0.02% Sodium azide and 50% Glycerol Preservative 0.02% Sodium azide Stabilizer 50% Glycerol Storage instruction For continuous use, store undiluted antibody at 2-8°C for up to a week. For long-term storage, aliquot and store at -20°C. Storage in frost free freezers is not recommended. Avoid repeated freeze/thaw cycles. Suggest spin the vial prior to opening. The antibody solution should be gently mixed before use. Note For laboratory research only, not for drug, diagnostic or other use. www.arigobio.com 1/2 Bioinformation Database links GeneID: 4046 Human Swiss-port # P33241 Human Gene Symbol LSP1 Gene Full Name lymphocyte-specific protein 1 Background This gene encodes an intracellular F-actin binding protein. The protein is expressed in lymphocytes, neutrophils, macrophages, and endothelium and may regulate neutrophil motility, adhesion to fibrinogen matrix proteins, and transendothelial migration.
    [Show full text]
  • Expression of Mouse LSP1/S37 Isoforms S37 Is Expressed in Embryonic Mesenchymal Cells
    Journal of Cell Science 107, 3591-3600 (1994) 3591 Printed in Great Britain © The Company of Biologists Limited 1994 Expression of mouse LSP1/S37 isoforms S37 is expressed in embryonic mesenchymal cells V. L. Misener1, C.-c. Hui2, I. A. Malapitan1, M.-E. Ittel1, A. L. Joyner2,3 and J. Jongstra1,* 1The Arthritis Centre-Research Unit, The Toronto Hospital Research Institute and Department of Immunology, University of Toronto, Toronto, Ontario, Canada 2Division of Molecular and Developmental Biology, Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto, Ontario, Canada 3Department of Molecular and Medical Genetics, University of Toronto, Toronto, Ontario, Canada *Author for correspondence at: Toronto Western Hospital, Room 13-419, 399 Bathurst Street, Toronto, Ontario, M5T 2S8 Canada SUMMARY Mouse LSP1 is a 330 amino acid intracellular F-actin 18 bp encoding the 6 amino acids HLIRHQ of the acidic binding protein expressed in lymphocytes and domain. Therefore, the Lsp1 gene encodes four protein macrophages but not in non-hematopoietic tissues. A 328 isoforms: full-length LSP1 and S37 proteins, designated amino acid LSP1-related protein, designated S37, is LSP1-I and S37-I and the same proteins without the expressed in murine bone marrow stromal cells, in fibro- HLIRHQ sequence, designated LSP1-II and S37-II. By in blasts, and in a myocyte cell line. The two proteins differ situ hybridization analysis we show that the S37 isoforms only at their N termini, the first 23 amino acid residues of are expressed in mesenchymal tissue, but not in adjacent LSP1 being replaced by 21 different residues in S37. The epithelial tissue, of several developing organs during mouse presence of different amino termini suggests that the LSP1 embryogenesis.
    [Show full text]
  • Ontology-Based Methods for Analyzing Life Science Data
    Habilitation a` Diriger des Recherches pr´esent´ee par Olivier Dameron Ontology-based methods for analyzing life science data Soutenue publiquement le 11 janvier 2016 devant le jury compos´ede Anita Burgun Professeur, Universit´eRen´eDescartes Paris Examinatrice Marie-Dominique Devignes Charg´eede recherches CNRS, LORIA Nancy Examinatrice Michel Dumontier Associate professor, Stanford University USA Rapporteur Christine Froidevaux Professeur, Universit´eParis Sud Rapporteure Fabien Gandon Directeur de recherches, Inria Sophia-Antipolis Rapporteur Anne Siegel Directrice de recherches CNRS, IRISA Rennes Examinatrice Alexandre Termier Professeur, Universit´ede Rennes 1 Examinateur 2 Contents 1 Introduction 9 1.1 Context ......................................... 10 1.2 Challenges . 11 1.3 Summary of the contributions . 14 1.4 Organization of the manuscript . 18 2 Reasoning based on hierarchies 21 2.1 Principle......................................... 21 2.1.1 RDF for describing data . 21 2.1.2 RDFS for describing types . 24 2.1.3 RDFS entailments . 26 2.1.4 Typical uses of RDFS entailments in life science . 26 2.1.5 Synthesis . 30 2.2 Case study: integrating diseases and pathways . 31 2.2.1 Context . 31 2.2.2 Objective . 32 2.2.3 Linking pathways and diseases using GO, KO and SNOMED-CT . 32 2.2.4 Querying associated diseases and pathways . 33 2.3 Methodology: Web services composition . 39 2.3.1 Context . 39 2.3.2 Objective . 40 2.3.3 Semantic compatibility of services parameters . 40 2.3.4 Algorithm for pairing services parameters . 40 2.4 Application: ontology-based query expansion with GO2PUB . 43 2.4.1 Context . 43 2.4.2 Objective .
    [Show full text]
  • PREDICTD: Parallel Epigenomics Data Imputation with Cloud-Based Tensor Decomposition
    bioRxiv preprint doi: https://doi.org/10.1101/123927; this version posted April 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. PREDICTD: PaRallel Epigenomics Data Imputation with Cloud-based Tensor Decomposition Timothy J. Durham Maxwell W. Libbrecht Department of Genome Sciences Department of Genome Sciences University of Washington University of Washington J. Jeffry Howbert Jeff Bilmes Department of Genome Sciences Department of Electrical Engineering University of Washington University of Washington William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington April 4, 2017 Abstract The Encyclopedia of DNA Elements (ENCODE) and the Roadmap Epigenomics Project have produced thousands of data sets mapping the epigenome in hundreds of cell types. How- ever, the number of cell types remains too great to comprehensively map given current time and financial constraints. We present a method, PaRallel Epigenomics Data Imputation with Cloud-based Tensor Decomposition (PREDICTD), to address this issue by computationally im- puting missing experiments in collections of epigenomics experiments. PREDICTD leverages an intuitive and natural model called \tensor decomposition" to impute many experiments si- multaneously. Compared with the current state-of-the-art method, ChromImpute, PREDICTD produces lower overall mean squared error, and combining methods yields further improvement. We show that PREDICTD data can be used to investigate enhancer biology at non-coding human accelerated regions. PREDICTD provides reference imputed data sets and open-source software for investigating new cell types, and demonstrates the utility of tensor decomposition and cloud computing, two technologies increasingly applicable in bioinformatics.
    [Show full text]
  • Manolis Kellis Piotr Indyk
    6.095 / 6.895 Computational Biology: Genomes, Networks, Evolution Manolis Kellis Rapid database search Courtsey of CCRNP, The National Cancer Institute. Piotr Indyk Protein interaction network Courtesy of GTL Center for Molecular and Cellular Systems. Genome duplication Courtesy of Talking Glossary of Genetics. Administrivia • Course information – Lecturers: Manolis Kellis and Piotr Indyk • Grading: Part. Problem sets 50% Final Project 25% Midterm 20% 5% • 5 problem sets: – Each problem set: covers 4 lectures, contains 4 problems. – Algorithmic problems and programming assignments – Graduate version includes 5th problem on current research •Exams – In-class midterm, no final exam • Collaboration policy – Collaboration allowed, but you must: • Work independently on each problem before discussing it • Write solutions on your own • Acknowledge sources and collaborators. No outsourcing. Goals for the term • Introduction to computational biology – Fundamental problems in computational biology – Algorithmic/machine learning techniques for data analysis – Research directions for active participation in the field • Ability to tackle research – Problem set questions: algorithmic rigorous thinking – Programming assignments: hands-on experience w/ real datasets • Final project: – Research initiative to propose an innovative project – Ability to carry out project’s goals, produce deliverables – Write-up goals, approach, and findings in conference format – Present your project to your peers in conference setting Course outline • Organization – Duality:
    [Show full text]
  • Prof. Manolis Kellis April 15, 2008
    Chromosomes inside the cell Introduction to Algorithms 6.046J/18.401J • Eukaryote cell LECTURE 18 • Prokaryote Computational Biology cell • Bio intro: Regulatory Motifs • Combinatorial motif discovery - Median string finding • Probabilistic motif discovery - Expectation maximization • Comparative genomics Prof. Manolis Kellis April 15, 2008 DNA packaging DNA: The double helix • Why packaging • The most noble molecule of our time – DNA is very long – Cell is very small • Compression – Chromosome is 50,000 times shorter than extended DNA • Using the DNA – Before a piece of DNA is used for anything, this compact structure must open locally ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA ATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC ATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC AATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC AATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC GCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT GCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT TTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG TTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA
    [Show full text]
  • ENCODE Consortium Meeting
    ENCODE Consortium Meeting June 17-19, 2008 Hilton Washington DC/Rockville Executive Meeting Center Rockville, Maryland PARTICIPANTS Bradley Bernstein Piero Carninci, Ph.D. Molecular Pathology Unit Leader Massachusetts General Hospital Functional Genomics Technology Team and 149 13th Street Omics Resource Development Unit Charlestown, MA 02129 Deputy Project Director (617) 726-6906 LSA Technology Development Group (617) 726-5684 Fax Omics Science Center [email protected] RIKEN Yokohama Institute 1-7-22 Suehiro-cho, Tsurumi-ku Ewan Birney Yokohama 230-0045 Joint Team Leader Japan Panda Group Nucleotides +81-(0)901-709-2277 Panda Coordination and Outreach [email protected] Panda Metabolism European Molecular Biology Laboratory Philip Cayting European Bioinformatics Institute Gerstein Laboratory Hinxton Outstation Department of Molecular Biophysics and Wellcome Trust Genome Campus Biochemistry Hinxton, Cambridge CB10 1SD Yale University United Kingdom P.O. Box 208114 +44-(0)1223-494 444, ext. 4420 New Haven, CT 06520-8114 +44-(0)1223-494 494 Fax (203) 432-6337 [email protected] [email protected] Michael Brent, Ph.D. Howard Y. Chang, M.D., Ph.D. Professor Assistant Professor Center for Genome Sciences Stanford University Washington University Center for Clinical Sciences Research, Campus Box 8510 Room 2155C 4444 Forest Park 269 Campus Drive Saint Louis, MO 63108 Stanford, CA 94305 (314) 286-0210 (650) 736-0306 [email protected] [email protected] James Bentley Brown Mike Cherry, Ph.D. Graduate Student Researcher Associate Professor Graduate Program in Applied Science and Department of Genetics Technology Stanford University Bickel Group 300 Pasteur Drive University of California, Berkeley Stanford, CA 94305-5120 Room 2 (650) 723-7541 1246 Hearst Avenue [email protected] Berkeley, CA 94702 (510) 703-4706 [email protected] Francis S.
    [Show full text]