Motif Selection Using Simulated Annealing Algorithm with Application to Identify Regulatory Elements

Total Page:16

File Type:pdf, Size:1020Kb

Motif Selection Using Simulated Annealing Algorithm with Application to Identify Regulatory Elements Motif Selection Using Simulated Annealing Algorithm with Application to Identify Regulatory Elements A thesis presented to the faculty of the Russ College of Engineering and Technology of Ohio University In partial fulfillment of the requirements for the degree Master of Science Liang Chen August 2018 © 2018 Liang Chen. All Rights Reserved. 2 This thesis titled Motif Selection Using Simulated Annealing Algorithm with Application to Identify Regulatory Elements by LIANG CHEN has been approved for the Department of Electrical Engineering and Computer Science and the Russ College of Engineering and Technology by Lonnie Welch Professor of Electrical Engineering and Computer Science Dennis Irwin Dean, Russ College of Engineering and Technology 3 Abstract CHEN, LIANG, M.S., August 2018, Computer Science Master Program Motif Selection Using Simulated Annealing Algorithm with Application to Identify Regulatory Elements (106 pp.) Director of Thesis: Lonnie Welch Modern research on gene regulation and disorder-related pathways utilize the tools such as microarray and RNA-Seq to analyze the changes in the expression levels of large sets of genes. In silico motif discovery was performed based on the gene expression profile data, which generated a large set of candidate motifs (usually hundreds or thousands of motifs). How to pick a set of biologically meaningful motifs from the candidate motif set is a challenging biological and computational problem. As a computational problem it can be modeled as motif selection problem (MSP). Building solutions for motif selection problem will give biologists direct help in finding transcription factors (TF) that are strongly related to specific pathways and gaining insights of the relationships between genes. This study implemented an algorithm based on simulated annealing (SA) optimization algorithm for the motif selection problem, and investigated the properties of the implemented algorithm with the real world datasets (ENCODE project data). The results of evaluation based on ENCODE datasets indicate that simulated annealing algorithm is good for solving motif selection problem. The performance of simulated annealing algorithm can be tuned based on some parameters to fit for special requirements. Future improvement may be achieved via extending algorithm model (adaptive simulated annealing) and applying high dimensional cost function. 4 Dedication To my family, and my parents. 5 Acknowledgments First I would like to thank my advisor, Dr. Lonnie Welch for his mentoring and support on my daily study and research project. Then I would like to thank my graduate committee members, Dr. Frank Drews, Dr. Razvan Bunescu, for their support, help, comments and suggestions for my research. I also want to thank Dr. Karen Coschigano for serving as college representative for my thesis defense. Special thanks to: graduate student Rami Al-Ouran, Yi-Chao Li, and Yating Liu in Dr. Welch’s lab, graduate student alumni Jens Schmidt, Robert Schmidt, and Krystine Garcia in Dr. Welch’s lab, graduate student Bibo Shi, and Zhe-Wei Wang in Dr. Jundong Liu’s lab. 6 Table of Contents Page Abstract . 3 Dedication . 4 Acknowledgments . 5 List of Tables . 8 List of Figures . 9 List of Acronyms . 10 1 Introduction . 11 1.1 Background . 11 1.2 Biological Motivation . 13 1.3 Foundations of Computational Modeling and Optimization Algorithm . 20 1.4 Problem Statement . 22 1.5 Contributions . 23 2 Methods . 24 2.1 Motif Selection Problem . 24 2.2 Set Cover Problem (SCP) . 24 2.3 Mapping Motif Selection Problem to Set Cover Problem . 25 2.4 SA Relaxed Version . 25 2.5 Simulated Annealing Algorithm . 26 2.6 Implementation for Solving MSP . 31 2.7 Adjustable Parameters of SA Implementation for MSP . 34 3 Evaluation Using ENCODE Data . 38 3.1 Overview . 38 3.2 Datasets . 38 3.3 Parameters . 40 3.4 Results . 41 3.5 Analysis on Results . 42 3.6 Biological Insights of Selected Motifs . 46 4 Conclusion and Future Work . 50 4.1 Conclusion . 50 4.2 Future Work . 51 7 References . 54 Appendix A: Source Code . 67 Appendix B: Supplementary Contents . 82 Appendix C: Disclaimer . 106 8 List of Tables Table Page 2.1 Parameter Settings for Simulated Annealing Algorithm . 35 3.1 Parameter Settings for ENCODE Datasets . 40 B.1 ENCODE TF Group Datasets . 82 B.2 Feature Set Size Result . 84 B.3 Sequence Sensitivity Result . 86 B.4 Motifs selected by SAr85 from BATF group . 89 B.5 Examples of TOMTOM reported alignments . 89 B.6 Motifs selected by SAr85 from PBX3 group . 98 B.7 Examples of TOMTOM reported alignments . 98 9 List of Figures Figure Page 1.1 General Pipeline for Motif Selection . 19 2.1 Flowchart for Simulated Annealing . 27 2.2 Temperature Curve for Exponential Cooling . 30 2.3 Class Relationships . 32 3.1 Overview of ENCODE Project . 39 3.2 Boxplot for Feature Set Size . 41 3.3 Line plot for Feature Set Size . 42 3.4 Boxplot for Sequence Sensitivity (sSn) . 43 3.5 Line plot for Sequence Sensitivity (sSn) . 44 3.6 Comprehensive comparison: SA . 46 3.7 Comprehensive comparison: SAr85 . 47 3.8 Comprehensive comparison: SAr70 . 48 10 List of Acronyms ChIP Chromatin Immunoprecipitation CPL Common Public License DECOD DECOnvolved Discriminative motif discovery DME Discriminating Matrix Enumerator DNA DeoxyriboNucleic Acid DP Dynamic Programming ENCODE Encyclopedia of DNA Elements FIMO Find Individual Motif Occurrences GNU GNU’s Not Unix GPL General Public License HGP Human Genome Project ILP Integer Linear Programming LP Linear Programming MEME Multiple Em for Motif Elicitation MSP Motif Selection Problem NCBI National Center for Biotechnology Information NGS Next Generation Sequencing NP Non-deterministic Polynomial PWM Position Weight Matrix RILP Relaxed Integer Linear Programming RNA RiboNucleic Acid SA Simulated Annealing SCP Set Cover Problem TF Transcription Factor TFBS Transcription Factor Binding Site TSS Transcription Start Site UTR UnTranslated Region 11 1 Introduction This research project focuses on the implementation and evaluation of simulated annealing optimization algorithm for motif selection problem with application to ENCODE datasets. 1.1 Background Biologists have proven that all the species of living beings on the earth have their own genetic codes to store the information about how to construct themselves and control the metabolic processes that are essential to their survival, development, and reproduction [1, 2]. In order to investigate the internal mechanisms of these genetic codes and decode the encrypted information of natural beings, huge work have been done: from the structure and properties of deoxyribonucleic acid (DNA) molecules [3], the amino acid sequences of proteins [4], classical genetics theories [5], to the modern views of genome and genes and various projects and achievement on gnomic information such as the Human Genome Project (HGP) [6], the International HapMap project[7], and the ENCODE project [8]. With continuous efforts and international collaboration, many species such as Drosophila melanogaster (model species, fruit fly) [9], Caenorhabditis Elegans (worm, model species )[10], Escherichia Coli (bacteria, model species) [11], Arabidopsis thaliana (model plant species) [12], Oryza sativa (rice, food crop) [13], and Homo Sapiens (human being) [14], have had their whole genome sequenced. With technique advances and more specific sequencing targets [15, 16], new problems have emerged, such as storing and interpreting these biological datasets. Scientists are no longer satisfied by just getting the raw gnomic information such as DNA and RNA sequences, but are more interested in how these gnomic elements interact with each other and the variable environment. For example, BRAF mutations[17–19] have been widely accepted as an indicator for certain types of cancers such as melanoma[20, 21] and 12 colorectal cancer[22–25]. Another example is the association between EGFR mutations and prostate cancer[26]. With the emergence of genomic testing methods and practice in clinical medicine (some commercialized genomic testings[27, 28] have already been available to physicians and patients), the demand on interpreting genomic data and applying the information to improve medical treatment on patients increases dramatically. Interestingly, the research on gene interactions is not as easy as neuroscience research on acute reactions and living animals (which is another hot topic in the basic science field that may reveal the mechanisms and rules about how human beings do intelligent work such as thinking and learning): neuroscientists may penetrate tiny electrodes into neural tissues such as cerebral cortex or peripheral neural ganglion to record the electrical signals of currently functioning cells (“neurons”) [29–31], and they can use the temporal and strength relations of these neural signals between different groups of neurons to establish their interaction relations; some of the predicted relationships may be supported by the anatomical structures[32]. Compared with electrophysiology studies, molecular genetic research usually depends on the sample extraction from targeted models (animals, plants, bacteria, with some additional treatments or conditions, optional genetic modifications), sequencing the samples to acquire expression levels of genes and biomarkers, and applying bioinformatics tools to analyze and interpret the results[33, 34]. For instance, bioinformatics tools such as BLAST[35, 36], FASTA[37], and ClustalW[38] are widely used for sequences alignment to compare the similarity between biological sequences. Early
Recommended publications
  • A Bivalent Chromatin Structure Marks Key Developmental Genes in Embryonic Stem Cells
    A Bivalent Chromatin Structure Marks Key Developmental Genes in Embryonic Stem Cells Bradley E. Bernstein,1,2,3,* Tarjei S. Mikkelsen,3,4 Xiaohui Xie,3 Michael Kamal,3 Dana J. Huebert,1 James Cuff,3 Ben Fry,3 Alex Meissner,5 Marius Wernig,5 Kathrin Plath,5 Rudolf Jaenisch,5 Alexandre Wagschal,6 Robert Feil,6 Stuart L. Schreiber,3,7 and Eric S. Lander3,5 1 Molecular Pathology Unit and Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA 02129, USA 2 Department of Pathology, Harvard Medical School, Boston, MA 02115, USA 3 Broad Institute of Harvard and MIT, Cambridge, MA 02139, USA 4 Division of Health Sciences and Technology, MIT, Cambridge, MA 02139, USA 5 Whitehead Institute for Biomedical Research, MIT, Cambridge, MA 02139, USA 6 Institute of Molecular Genetics, CNRS UMR-5535 and University of Montpellier-II, Montpellier, France 7 Howard Hughes Medical Institute at the Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA 02138, USA *Contact: [email protected] DOI 10.1016/j.cell.2006.02.041 SUMMARY which in turn modulate chromatin structure (Jenuwein and Allis, 2001; Margueron et al., 2005). The core histones The most highly conserved noncoding ele- H2A, H2B, H3, and H4 are subject to dozens of different ments (HCNEs) in mammalian genomes cluster modifications, including acetylation, methylation, and within regions enriched for genes encoding de- phosphorylation. Histone H3 lysine 4 (Lys4) and lysine velopmentally important transcription factors 27 (Lys27) methylation are of particular interest as these (TFs). This suggests that HCNE-rich regions modifications are catalyzed, respectively, by trithorax- may contain key regulatory controls involved and Polycomb-group proteins, which mediate mitotic in- heritance of lineage-specific gene expression programs in development.
    [Show full text]
  • Transcriptional Control of Tissue-Resident Memory T Cell Generation
    Transcriptional control of tissue-resident memory T cell generation Filip Cvetkovski Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2019 © 2019 Filip Cvetkovski All rights reserved ABSTRACT Transcriptional control of tissue-resident memory T cell generation Filip Cvetkovski Tissue-resident memory T cells (TRM) are a non-circulating subset of memory that are maintained at sites of pathogen entry and mediate optimal protection against reinfection. Lung TRM can be generated in response to respiratory infection or vaccination, however, the molecular pathways involved in CD4+TRM establishment have not been defined. Here, we performed transcriptional profiling of influenza-specific lung CD4+TRM following influenza infection to identify pathways implicated in CD4+TRM generation and homeostasis. Lung CD4+TRM displayed a unique transcriptional profile distinct from spleen memory, including up-regulation of a gene network induced by the transcription factor IRF4, a known regulator of effector T cell differentiation. In addition, the gene expression profile of lung CD4+TRM was enriched in gene sets previously described in tissue-resident regulatory T cells. Up-regulation of immunomodulatory molecules such as CTLA-4, PD-1, and ICOS, suggested a potential regulatory role for CD4+TRM in tissues. Using loss-of-function genetic experiments in mice, we demonstrate that IRF4 is required for the generation of lung-localized pathogen-specific effector CD4+T cells during acute influenza infection. Influenza-specific IRF4−/− T cells failed to fully express CD44, and maintained high levels of CD62L compared to wild type, suggesting a defect in complete differentiation into lung-tropic effector T cells.
    [Show full text]
  • Ontology-Based Methods for Analyzing Life Science Data
    Habilitation a` Diriger des Recherches pr´esent´ee par Olivier Dameron Ontology-based methods for analyzing life science data Soutenue publiquement le 11 janvier 2016 devant le jury compos´ede Anita Burgun Professeur, Universit´eRen´eDescartes Paris Examinatrice Marie-Dominique Devignes Charg´eede recherches CNRS, LORIA Nancy Examinatrice Michel Dumontier Associate professor, Stanford University USA Rapporteur Christine Froidevaux Professeur, Universit´eParis Sud Rapporteure Fabien Gandon Directeur de recherches, Inria Sophia-Antipolis Rapporteur Anne Siegel Directrice de recherches CNRS, IRISA Rennes Examinatrice Alexandre Termier Professeur, Universit´ede Rennes 1 Examinateur 2 Contents 1 Introduction 9 1.1 Context ......................................... 10 1.2 Challenges . 11 1.3 Summary of the contributions . 14 1.4 Organization of the manuscript . 18 2 Reasoning based on hierarchies 21 2.1 Principle......................................... 21 2.1.1 RDF for describing data . 21 2.1.2 RDFS for describing types . 24 2.1.3 RDFS entailments . 26 2.1.4 Typical uses of RDFS entailments in life science . 26 2.1.5 Synthesis . 30 2.2 Case study: integrating diseases and pathways . 31 2.2.1 Context . 31 2.2.2 Objective . 32 2.2.3 Linking pathways and diseases using GO, KO and SNOMED-CT . 32 2.2.4 Querying associated diseases and pathways . 33 2.3 Methodology: Web services composition . 39 2.3.1 Context . 39 2.3.2 Objective . 40 2.3.3 Semantic compatibility of services parameters . 40 2.3.4 Algorithm for pairing services parameters . 40 2.4 Application: ontology-based query expansion with GO2PUB . 43 2.4.1 Context . 43 2.4.2 Objective .
    [Show full text]
  • PREDICTD: Parallel Epigenomics Data Imputation with Cloud-Based Tensor Decomposition
    bioRxiv preprint doi: https://doi.org/10.1101/123927; this version posted April 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. PREDICTD: PaRallel Epigenomics Data Imputation with Cloud-based Tensor Decomposition Timothy J. Durham Maxwell W. Libbrecht Department of Genome Sciences Department of Genome Sciences University of Washington University of Washington J. Jeffry Howbert Jeff Bilmes Department of Genome Sciences Department of Electrical Engineering University of Washington University of Washington William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington April 4, 2017 Abstract The Encyclopedia of DNA Elements (ENCODE) and the Roadmap Epigenomics Project have produced thousands of data sets mapping the epigenome in hundreds of cell types. How- ever, the number of cell types remains too great to comprehensively map given current time and financial constraints. We present a method, PaRallel Epigenomics Data Imputation with Cloud-based Tensor Decomposition (PREDICTD), to address this issue by computationally im- puting missing experiments in collections of epigenomics experiments. PREDICTD leverages an intuitive and natural model called \tensor decomposition" to impute many experiments si- multaneously. Compared with the current state-of-the-art method, ChromImpute, PREDICTD produces lower overall mean squared error, and combining methods yields further improvement. We show that PREDICTD data can be used to investigate enhancer biology at non-coding human accelerated regions. PREDICTD provides reference imputed data sets and open-source software for investigating new cell types, and demonstrates the utility of tensor decomposition and cloud computing, two technologies increasingly applicable in bioinformatics.
    [Show full text]
  • Microrna Profiling of Low-Grade Glial and Glioneuronal Tumors Shows An
    Modern Pathology (2017) 30, 204–216 204 © 2017 USCAP, Inc All rights reserved 0893-3952/17 $32.00 MicroRNA profiling of low-grade glial and glioneuronal tumors shows an independent role for cluster 14q32.31 member miR-487b Heather Marion Ames1,4, Ming Yuan1,4, Maria Adelita Vizcaíno1,3, Wayne Yu2 and Fausto J Rodriguez1,2 1Department of Pathology, Johns Hopkins University School of Medicine, Baltimore, MD, USA; 2Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD, USA and 3Department of Cellular and Tissue Biology, Universidad Nacional Autónoma de México, Mexico City, DF, USA Low-grade (WHO I-II) gliomas and glioneuronal tumors represent the most frequent primary tumors of the central nervous system in children. They often have a good prognosis following total resection, however they can create many neurological complications due to mass effect, and may be difficult to resect depending on anatomic location. MicroRNAs have been identified as molecular regulators of protein expression/translation that can repress multiple mRNAs concurrently through base pairing, and have an important role in cancer, including brain tumors. Using the NanoString digital counting system, we analyzed the expression levels of 800 microRNAs in nine low-grade glial and glioneuronal tumor types (n = 45). A set of 61 of these microRNAs were differentially expressed in tumors compared with the brain, and several showed levels varying by tumor type. The expression differences were more accentuated in subependymal giant cell astrocytoma, compared with other groups, and demonstrated the highest degree of microRNA repression validated by RT-PCR, including miR-129-2-3p, miR-219-5p, miR-338-3p, miR-487b, miR-885-5p, and miR-323a-3p.
    [Show full text]
  • The Bioperl Toolkit: Perl Modules for the Life Sciences
    Downloaded from genome.cshlp.org on January 25, 2012 - Published by Cold Spring Harbor Laboratory Press The Bioperl Toolkit: Perl Modules for the Life Sciences Jason E. Stajich, David Block, Kris Boulez, et al. Genome Res. 2002 12: 1611-1618 Access the most recent version at doi:10.1101/gr.361602 Supplemental http://genome.cshlp.org/content/suppl/2002/10/20/12.10.1611.DC1.html Material References This article cites 14 articles, 9 of which can be accessed free at: http://genome.cshlp.org/content/12/10/1611.full.html#ref-list-1 Article cited in: http://genome.cshlp.org/content/12/10/1611.full.html#related-urls Email alerting Receive free email alerts when new articles cite this article - sign up in the box at the service top right corner of the article or click here To subscribe to Genome Research go to: http://genome.cshlp.org/subscriptions Cold Spring Harbor Laboratory Press Downloaded from genome.cshlp.org on January 25, 2012 - Published by Cold Spring Harbor Laboratory Press Resource The Bioperl Toolkit: Perl Modules for the Life Sciences Jason E. Stajich,1,18,19 David Block,2,18 Kris Boulez,3 Steven E. Brenner,4 Stephen A. Chervitz,5 Chris Dagdigian,6 Georg Fuellen,7 James G.R. Gilbert,8 Ian Korf,9 Hilmar Lapp,10 Heikki Lehva¨slaiho,11 Chad Matsalla,12 Chris J. Mungall,13 Brian I. Osborne,14 Matthew R. Pocock,8 Peter Schattner,15 Martin Senger,11 Lincoln D. Stein,16 Elia Stupka,17 Mark D. Wilkinson,2 and Ewan Birney11 1University Program in Genetics, Duke University, Durham, North Carolina 27710, USA; 2National Research Council of
    [Show full text]
  • Research Computing Facility an Update from Dr
    Research Computing Facility An Update from Dr. Francesca Dominici June 20, 2013 Dear all, We are very excited to provide you some important updates regarding the research computing facility at the Faculty of Arts and Science (FASRC) http://rc.fas.harvard.edu. Please note that we are phasing out the HSPH cluster, and if you are currently leasing nodes on the HSPH cluster we will be working with you to migrate to FASRC. We have developed a FAQ document, which is available at the web link https://rc.fas.harvard.edu/hsph-at-fas-rc-frequently- asked-questions/ and also included in this message. Updates: 1. 158 HSPH accounts have been opened on FASRC, enabling users to run computing jobs on the FAS High Performance Computing Cluster (HPCC), also known as Odyssey 2. Several HSPH faculty have worked with the FASRC team to purchase data storage equipment and hardware that have been deployed at FASRC in Cambridge and linked to Odyssey via a secured network 3. FASRC has developed personalized solutions for our faculty to transfer secure data from HSPH to FAS in accordance with data user agreements. 4. FASRC has been mentioned as a key strength in training and research grant applications from HSPH, and high impact papers have been published that previously were delayed for lack of computing power 5. Please bookmark the web site http://rc.fas.harvard.edu/hsph- overview/ for additional and up to date information To access FASRC you will be charged approximately $3000 per year per account. Access for PhD and ScD students is free.
    [Show full text]
  • Association of Cnvs with Methylation Variation
    www.nature.com/npjgenmed ARTICLE OPEN Association of CNVs with methylation variation Xinghua Shi1,8, Saranya Radhakrishnan2, Jia Wen1, Jin Yun Chen2, Junjie Chen1,8, Brianna Ashlyn Lam1, Ryan E. Mills 3, ✉ ✉ Barbara E. Stranger4, Charles Lee5,6,7 and Sunita R. Setlur 2 Germline copy number variants (CNVs) and single-nucleotide polymorphisms (SNPs) form the basis of inter-individual genetic variation. Although the phenotypic effects of SNPs have been extensively investigated, the effects of CNVs is relatively less understood. To better characterize mechanisms by which CNVs affect cellular phenotype, we tested their association with variable CpG methylation in a genome-wide manner. Using paired CNV and methylation data from the 1000 genomes and HapMap projects, we identified genome-wide associations by methylation quantitative trait locus (mQTL) analysis. We found individual CNVs being associated with methylation of multiple CpGs and vice versa. CNV-associated methylation changes were correlated with gene expression. CNV-mQTLs were enriched for regulatory regions, transcription factor-binding sites (TFBSs), and were involved in long- range physical interactions with associated CpGs. Some CNV-mQTLs were associated with methylation of imprinted genes. Several CNV-mQTLs and/or associated genes were among those previously reported by genome-wide association studies (GWASs). We demonstrate that germline CNVs in the genome are associated with CpG methylation. Our findings suggest that structural variation together with methylation may affect cellular phenotype. npj Genomic Medicine (2020) 5:41 ; https://doi.org/10.1038/s41525-020-00145-w 1234567890():,; INTRODUCTION influence transcript regulation is DNA methylation, which involves The extent of genetic variation that exists in the human addition of a methyl group to cytosine residues within a CpG population is continually being characterized in efforts to identify dinucleotide.
    [Show full text]
  • Convergent Regulatory Evolution and Loss of Flight in Paleognathous Birds
    Convergent regulatory evolution and loss of flight in paleognathous birds The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Sackton, Timothy B., Phil Grayson, Alison Cloutier, Zhirui Hu, Jun S. Liu, Nicole E. Wheeler, Paul P. Gardner, et al. 2019. Convergent Regulatory Evolution and Loss of Flight in Paleognathous Birds. Science 364 (6435): 74–78. Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:39865637 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Open Access Policy Articles, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#OAP Convergent regulatory evolution and loss of flight in palaeognathous birds Timothy B. Sackton* (1,2), Phil Grayson (2,3), Alison Cloutier (2,3), Zhirui Hu (4), Jun S. Liu (4), Nicole E. Wheeler (5,6), Paul P. Gardner (5,7), Julia A. Clarke (8), Allan J. Baker (9,10), Michele Clamp (1), Scott V. Edwards* (2,3) Affiliations: 1) Informatics Group, Harvard University, Cambridge, USA 2) Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, USA 3) Museum of Comparative Zoology, Harvard University, Cambridge, USA 4) Department of Statistics, Harvard University, Cambridge, USA 5) School of Biological Sciences, University of Canterbury, New Zealand 6) Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge, UK 7) Department of Biochemistry, University of Otago, New Zealand 8) Jackson School of Geosciences, The University of Texas at Austin, Austin, USA 9) Department of Natural History, Royal Ontario Museum, Toronto, Canada 10) Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, Canada *correspondence to: TBS ([email protected]) or SVE ([email protected]) 1 Whether convergent phenotypic evolution is driven by convergent molecular changes, in proteins or regulatory regions, are core questions in evolutionary biology.
    [Show full text]
  • Functional Testing of a Human PBX3 Variant in Zebrafish Reveals a Potential Modifier Role in Congenital Heart Defects
    bioRxiv preprint doi: https://doi.org/10.1101/337832; this version posted June 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Functional testing of a human PBX3 variant in zebrafish reveals a potential modifier role in congenital heart defects Gist H. Farr III1, Kimia Imani1,2, Darren Pouv1,2, and Lisa Maves1,3* 1Center for Developmental Biology and Regenerative Medicine, Seattle Children's Research Institute, Seattle, WA 98101, USA 2University of Washington, Seattle, WA, USA 3Department of Pediatrics, University of Washington, Seattle, WA, USA *Correspondence: [email protected] Keywords: CRISPR-Cas, Genetic variant, Heart, Modifier, Pbx, Zebrafish. 1 bioRxiv preprint doi: https://doi.org/10.1101/337832; this version posted June 3, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Summary statement Our study provides a novel example of using genome editing in zebrafish to demonstrate how a human DNA sequence variant of unknown significance may contribute to the complex genetics of congenital heart defects. Abstract Whole-genome and whole-exome sequencing efforts are increasingly identifying candidate genetic variants associated with human disease. However, predicting and testing the pathogenicity of a genetic variant remains challenging. Genome editing allows for the rigorous functional testing of human genetic variants in animal models. Congenital heart defects (CHDs) are a prominent example of a human disorder with complex genetics. An inherited sequence variant in the human PBX3 gene (PBX3 p.A136V) has previously been shown to be enriched in a CHD patient cohort, indicating that the PBX3 p.A136V variant could be a modifier allele for CHDs.
    [Show full text]
  • NKX2-5: an Update on This Hypermutable Homeodomain Protein and Its Role in Human Congenital Heart Disease (CHD) Stella Marie Reamon-Buettner, Juergen T Borlak
    NKX2-5: An Update on this Hypermutable Homeodomain Protein and its Role in Human Congenital Heart Disease (CHD) Stella Marie Reamon-Buettner, Juergen T Borlak To cite this version: Stella Marie Reamon-Buettner, Juergen T Borlak. NKX2-5: An Update on this Hypermutable Home- odomain Protein and its Role in Human Congenital Heart Disease (CHD). Human Mutation, Wiley, 2010, 31 (11), pp.1185. 10.1002/humu.21345. hal-00585168 HAL Id: hal-00585168 https://hal.archives-ouvertes.fr/hal-00585168 Submitted on 12 Apr 2011 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Human Mutation NKX2-5: An Update on this Hypermutable Homeodomain Protein and its Role in Human Congenital Heart Disease (CHD) For Peer Review Journal: Human Mutation Manuscript ID: humu-2010-0256.R1 Wiley - Manuscript type: Review Date Submitted by the 15-Jul-2010 Author: Complete List of Authors: Reamon-Buettner, Stella Marie; Fraunhofer Institute of Toxicology and Experimental Medicine, Molecular Medicine and Medical Biotechnology Borlak, Juergen; Fraunhofer Institute of Toxicology and Experimental Medicine, Molecular Medicine and Medical Biotechnology heart development, congenital heart disease, cardiac Key Words: malformations, transcription factors, NKX2-5, mutations John Wiley & Sons, Inc.
    [Show full text]
  • BMC Biology Biomed Central
    BMC Biology BioMed Central Research article Open Access Classification and nomenclature of all human homeobox genes PeterWHHolland*†1, H Anne F Booth†1 and Elspeth A Bruford2 Address: 1Department of Zoology, University of Oxford, South Parks Road, Oxford, OX1 3PS, UK and 2HUGO Gene Nomenclature Committee, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, UK Email: Peter WH Holland* - [email protected]; H Anne F Booth - [email protected]; Elspeth A Bruford - [email protected] * Corresponding author †Equal contributors Published: 26 October 2007 Received: 30 March 2007 Accepted: 26 October 2007 BMC Biology 2007, 5:47 doi:10.1186/1741-7007-5-47 This article is available from: http://www.biomedcentral.com/1741-7007/5/47 © 2007 Holland et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background: The homeobox genes are a large and diverse group of genes, many of which play important roles in the embryonic development of animals. Increasingly, homeobox genes are being compared between genomes in an attempt to understand the evolution of animal development. Despite their importance, the full diversity of human homeobox genes has not previously been described. Results: We have identified all homeobox genes and pseudogenes in the euchromatic regions of the human genome, finding many unannotated, incorrectly annotated, unnamed, misnamed or misclassified genes and pseudogenes.
    [Show full text]