Approximate Algorithms for Regulatory Motif Discovery in DNA

Total Page:16

File Type:pdf, Size:1020Kb

Approximate Algorithms for Regulatory Motif Discovery in DNA Western Michigan University ScholarWorks at WMU Dissertations Graduate College 6-2019 Approximate Algorithms for Regulatory Motif Discovery in DNA Hasnaa Imad Al-Shaikhli Western Michigan University, [email protected] Follow this and additional works at: https://scholarworks.wmich.edu/dissertations Part of the Computer Sciences Commons Recommended Citation Al-Shaikhli, Hasnaa Imad, "Approximate Algorithms for Regulatory Motif Discovery in DNA" (2019). Dissertations. 3454. https://scholarworks.wmich.edu/dissertations/3454 This Dissertation-Open Access is brought to you for free and open access by the Graduate College at ScholarWorks at WMU. It has been accepted for inclusion in Dissertations by an authorized administrator of ScholarWorks at WMU. For more information, please contact [email protected]. APPROXIMATE ALGORITHMS FOR REGULATORY MOTIF DISCOVERY IN DNA by Hasnaa Imad Al-Shaikhli A dissertation submitted to the Graduate College in partial fulfillment of the requirements for the degree of Doctor of Philosophy Computer Science Western Michigan University June 2019 Doctoral Committee: Dr. Elise de Doncker, Chair Dr. Alvis Fong Dr. Todd Barkman Dr. Nancy Deng Copyright by Hasnaa Imad Al-Shaikhli 2019 ACKNOWLEDGEMENTS I would like to express my deepest appreciation to my committee Chair, Professor Dr. Elise de Doncker. Without her guidance and persistent help, this dissertation would not have been possible. I learned from her how to define a research problem, find a solution to it, and finally publish the results. Besides my advisor, I sincerely thank the other members of my dissertation committee members, Dr. Alvis Fong, Dr. Todd Barkman, and Dr. Nancy Deng for their support and advice on problems I encountered during my research. In addition, my gratitude goes to the Department of Computer Science Chair, Dr. Steven Carr, and staff for the last minute favors. I also thank my sponsor, the Higher Committee for Education Development in Iraq (HCED), for their financial support, without I which would not have been able to study at Western Michigan University. Last but not least, I want to express my deepest gratitude to my family and friends. This dissertation would not have been possible without their warm love, continued patience and endless support. Hasnaa Imad Al-Shaikhli ii APPROXIMATE ALGORITHMS FOR REGULATORY MOTIF DISCOVERY IN DNA Hasnaa Imad Al-Shaikhli, Ph.D. Western Michigan University, 2019 Motif discovery is the problem of finding common substrings within a set of biological strings. Therefore it can be applied to finding Transcription Factor Binding Sites (TFBS) that have common patterns (motifs). A transcription factor molecule can bind to multiple binding sites in the promoter region of different genes to make these genes co-regulating. The Planted (l, d) Motif Problem (PMP) is a classic version of motif discovery where l is the motif length and d represents the maximum allowed mutation distance. The quorum Planted (l, d, q) Motif Problem (qPMP) is a version of PMP where the motif of length l occurs in at least q percent of the sequences with up to d mismatches. In this thesis we develop the Strong Motif Finder (SMF) and quorum Strong Motif Finder (qSMF) algorithms and evaluate their performance. The Strong Motif Finder (SMF) returns a list of its highest ranked (strongest) motifs. The performance of SMF is compared with the APMotif and MEME algorithms with respect to execution time and prediction accuracy. Several performance metrics are used at both the nucleotide and the site level. The algorithms are tested on simulated datasets. The time comparisons show that SMF is faster than the APMotif and the MEME (ANR) and similar in speed to the MEME (ZOOPS). The MEME algorithm with choice OOPS is the fastest but is not practical if no prior knowledge is available. The prediction accuracy results reveal that the SMF outperforms the APMotif, and performs at the level of the best prediction accuracy of the MEME (with OOPS choice), notwithstanding that the SMF is not given a-priori information. In addition, the SMF is tested on real DNA datasets of orthologous regularity regions from multiple species, without using their related phylogenetic tree. The experiments indicate that the SMF results agree with published motifs. The quorum Strong Motif Finder (qSMF) returns a list of highest ranked (strongest) motifs occurring in at least q percent of the data sequences. The algorithm is tested on ChIP- Seq (large) data that was sampled using the SamSelect algorithm. In comparison with the FMotif algorithm, the experimental results show that qSMF is faster and returns predicted motifs similar to results in the literature and to motifs discovered by the ENCODE project tool which uses the established motif finding algorithms of AlignACE, MEME, MDscan, Trawler, and Weeder. In order to determine the strength or the significance of the predicted motifs, a scoring function, the Motif Strength Score (MSS), is proposed for ranking the discovered motifs in both algorithms. In future work, this score can be combined with other statistical scores, such as the complexity score, P-value and information content, to better determine the motif significance. TABLE OF CONTENTS ACKNOWLEDGEMENTS . ii LIST OF TABLES . vii LIST OF FIGURES . viii LIST OF ABBREVIATIONS . ix 1. INTRODUCTION . 1 1.1. Biological Background . 1 1.1.1. What is the Basis of Life? . 1 1.1.2. What is a Genome? . 2 1.1.3. What are Genes? . 2 1.1.4. What is DNA? . 3 1.1.5. DNA Discovery Story . 3 1.1.6. Central Dogma . 3 1.1.7. What is Gene Expression? . 4 1.1.8. What are Motifs? . 5 1.1.9. What are Mutations? . 6 1.2. Motif Discovery Problem Description . 7 1.3. Motif Discovery Problem Versions . 7 1.4. Motif Search Computational Algorithms Categories . 9 1.5. Importance of the Motif Finding Problem . 11 1.6. Summary . 12 iii Table of Contents—Continued 2. LITERATURE REVIEW . 13 2.1. Introduction . 13 2.2. Motif Finding Problem Complexity . 14 2.3. Mostly Used Scoring Functions . 15 2.3.1. Information Content (IC) . 15 2.3.2. Complexity Score (CS) . 16 2.3.3. P-Value . 16 2.3.4. Z-Score . 17 2.4. Algorithms for Solving Planted Motif Problem . 17 2.4.1. MEME . 17 2.4.2. APMotif . 20 2.4.3. YMF . 20 2.4.4. MFMD . 21 2.4.5. Weeder . 22 2.4.6. Trawler . 24 2.4.7. MDscan . 24 2.4.8. FMotif . 25 2.4.9. AlignACE . 25 2.5. Algorithms for Large Datasets . 26 2.6. Useful Online Tools . 28 2.7. Useful Databases . 30 2.8. Ensemble Algorithms . 31 2.9. Prediction Accuracy Evaluation . 32 2.10. Computational Methods: Limitations and Challenges . 35 2.11. Summary . 37 iv Table of Contents—Continued 3. PROPOSED ALGORITHMS . 38 3.1. SMF Algorithm . 38 3.1.1. Planted (l, d) Motif Problem . 38 3.1.2. Proposed Algorithm . 39 3.1.3. Algorithm Input and Output . 39 3.1.4. Algorithm Description . 40 3.1.5. Proposed Scoring Function . 42 3.1.6. Algorithm Time Complexity . 44 3.2. qSMF Algorithm . 46 3.2.1. Quorum Planted (l, d, q) Motif Problem . 46 3.2.2. Proposed Algorithm . 47 3.2.3. Algorithm Input and Output . 47 3.2.4. Algorithm Description . 48 3.2.5. Scoring Functions . 51 3.2.6. Algorithm Time Complexity . 53 4. EXPERIMENTAL RESULTS . 56 4.1. Experimental Results of SMF . 56 4.1.1. Experimental Results on Simulated Data . 56 4.1.2. Experimental Results on Real Data . 59 4.2. Experimental Results of qSMF . 65 4.2.1. Experimental Results on Real Data . 65 5. CONCLUSIONS AND FUTURE WORK . 68 5.1. Conclusions . 68 5.2. Future Work . 72 v Table of Contents—Continued BIBLIOGRAPHY . 74 APPENDICES . 96 A. Algorithms for Motif Discovery Problem . 96 B. Planted (l, d) Motif Finding Problem Solvability Analysis . 97 C. Permissions . 100 vi LIST OF TABLES 2.1 Useful Databases . 31 3.1 Complexity Score and Vectors for Strings of Length l = 10 Derived from a Mono-Nucleotide String for Different Values of N#pos and Ntype . 52 4.1 Top Motifs Returned by SMF When Applied on Real Datasets . 64 4.2 Results of Eight Homo Sapiens Real Datasets Selected from the ENCODE TF ChIP-Seq Dataset . 67 1 List of Current Algorithms for Motif Finding Problem . 96 2 Last Solvable Problem Instances Depending on Et Values . 98 3 Expected Numbers of Motifs t = 20, n = 600, and 8 ≤ l ≤ 30 with the Critical Instances . 99 vii LIST OF FIGURES 1.1 Central Dogma Process Illustration . 4 1.2 Motif Location with its Corresponding Gene in the Promoter Region of a DNA Sequence . 5 2.1 Visual Description of TP, FN, FP, and TN . 32.
Recommended publications
  • A Call for Public Archives for Biological Image Data Public Data Archives Are the Backbone of Modern Biological Research
    comment A call for public archives for biological image data Public data archives are the backbone of modern biological research. Biomolecular archives are well established, but bioimaging resources lag behind them. The technology required for imaging archives is now available, thus enabling the creation of the frst public bioimage datasets. We present the rationale for the construction of bioimage archives and their associated databases to underpin the next revolution in bioinformatics discovery. Jan Ellenberg, Jason R. Swedlow, Mary Barlow, Charles E. Cook, Ugis Sarkans, Ardan Patwardhan, Alvis Brazma and Ewan Birney ince the mid-1970s it has been possible image data to encourage both reuse and in automation and throughput allow this to analyze the molecular composition extraction of knowledge. to be done in a systematic manner. We Sof living organisms, from the individual Despite the long history of biological expect that, for example, super-resolution units (nucleotides, amino acids, and imaging, the tools and resources for imaging will be applied across biological metabolites) to the macromolecules they collecting, managing, and sharing image and biomedical imaging; examples in are part of (DNA, RNA, and proteins), data are immature compared with those multidimensional imaging8 and high- as well as the interactions among available for sequence and 3D structure content screening9 have recently been these components1–3. The cost of such data. In the following sections we reported. It is very likely that imaging data measurements has decreased remarkably outline the current barriers to progress, volume and complexity will continue to over time, and technological development the technological developments that grow. Second, the emergence of powerful has widened their scope from investigations could provide solutions, and how those computer-based approaches for image of gene and transcript sequences (genomics/ infrastructure solutions can meet the interpretation, quantification, and modeling transcriptomics) to proteins (proteomics) outlined need.
    [Show full text]
  • 3 on the Nature of Biological Data
    ON THE NATURE OF BIOLOGICAL DATA 35 3 On the Nature of Biological Data Twenty-first century biology will be a data-intensive enterprise. Laboratory data will continue to underpin biology’s tradition of being empirical and descriptive. In addition, they will provide confirming or disconfirming evidence for the various theories and models of biological phenomena that researchers build. Also, because 21st century biology will be a collective effort, it is critical that data be widely shareable and interoperable among diverse laboratories and computer systems. This chapter describes the nature of biological data and the requirements that scientists place on data so that they are useful. 3.1 DATA HETEROGENEITY An immense challenge—one of the most central facing 21st century biology—is that of managing the variety and complexity of data types, the hierarchy of biology, and the inevitable need to acquire data by a wide variety of modalities. Biological data come in many types. For instance, biological data may consist of the following:1 • Sequences. Sequence data, such as those associated with the DNA of various species, have grown enormously with the development of automated sequencing technology. In addition to the human genome, a variety of other genomes have been collected, covering organisms including bacteria, yeast, chicken, fruit flies, and mice.2 Other projects seek to characterize the genomes of all of the organisms living in a given ecosystem even without knowing all of them beforehand.3 Sequence data generally 1This discussion of data types draws heavily on H.V. Jagadish and F. Olken, eds., Data Management for the Biosciences, Report of the NSF/NLM Workshop of Data Management for Molecular and Cell Biology, February 2-3, 2003, Available at http:// www.eecs.umich.edu/~jag/wdmbio/wdmb_rpt.pdf.
    [Show full text]
  • Visualization of Biomedical Data
    Visualization of Biomedical Data Corresponding author: Seán I. O’Donoghue; email: [email protected] • Data61, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Eveleigh NSW 2015, Australia • Genomics and Epigenetics Division, Garvan Institute of Medical Research, Sydney NSW 2010, Australia • School of Biotechnology and Biomolecular Sciences, UNSW, Kensington NSW 2033, Australia Benedetta Frida Baldi; email: [email protected] • Genomics and Epigenetics Division, Garvan Institute of Medical Research, Sydney NSW 2010, Australia Susan J Clark; email: [email protected] • Genomics and Epigenetics Division, Garvan Institute of Medical Research, Sydney NSW 2010, Australia Aaron E. Darling; email: [email protected] • The ithree institute, University of Technology Sydney, Ultimo NSW 2007, Australia James M. Hogan; email: [email protected] • School of Electrical Engineering and Computer Science, Queensland University of Technology, Brisbane QLD, 4000, Australia Sandeep Kaur; email: [email protected] • School of Computer Science and Engineering, UNSW, Kensington NSW 2033, Australia Lena Maier-Hein; email: [email protected] • Div. Computer Assisted Medical Interventions (CAMI), German Cancer Research Center (DKFZ), 69120 Heidelberg, Germany Davis J. McCarthy; email: [email protected] • European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, CB10 1SD, Hinxton, Cambridge, UK • St. Vincent’s Institute of Medical Research, Fitzroy VIC 3065, Australia William
    [Show full text]
  • When Biology Gets Personal: Hidden Challenges of Privacy and Ethics in Biological Big Data
    Pacific Symposium on Biocomputing 2019 When Biology Gets Personal: Hidden Challenges of Privacy and Ethics in Biological Big Data Gamze Gürsoy* Computational Biology and Bioinformatics Program, Molecular Biophysics & Biochemistry, Yale University, New Haven, CT, 06511, USA Email: [email protected] Arif Harmanci Center for Precision Health, School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, 77030, USA Email: [email protected] Haixu Tang† School of Informatics, Computing and Engineering, Indiana University Bloomington, Bloomington, IN, 47405, USA Email: [email protected] Erman Ayday Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH, 44106, USA Email: [email protected] Steven E. Brenner# University of California Berkeley, CA, 94720-3012, USA Email: [email protected] High-throughput technologies for biological data acquisition are advancing at an increasing pace. Most prominently, the decreasing cost of DNA sequencing has led to an exponential growth of sequence information, including individual human genomes. This session of the 2019 Pacific Symposium on Biocomputing presents the distinctive privacy and ethical challenges related to the generation, storage, processing, study, and sharing of individuals’ biological data generated by multitude of technologies including but not limited to genomics, proteomics, metagenomics, bioimaging, biosensors, and personal health trackers. The mission is to bring together computational biologists, experimental biologists, computer scientists, ethicists, and policy and lawmakers to share ideas, discuss the challenges related to biological data and privacy. Keywords: biological data privacy, genomics, genetic testing * This work is partially supported by NIH grant U01EB023686. † This work is partially supported by NIH grant U01EB023685 and NSF grant CNS-1408874.
    [Show full text]
  • From Big Data Analysis to Personalized Medicine for All: Challenges and Opportunities Akram Alyass1, Michelle Turcotte1 and David Meyre1,2*
    Alyass et al. BMC Medical Genomics (2015) 8:33 DOI 10.1186/s12920-015-0108-y REVIEW Open Access From big data analysis to personalized medicine for all: challenges and opportunities Akram Alyass1, Michelle Turcotte1 and David Meyre1,2* Abstract Recent advances in high-throughput technologies have led to the emergence of systems biology as a holistic science to achieve more precise modeling of complex diseases. Many predict the emergence of personalized medicine in the near future. We are, however, moving from two-tiered health systems to a two-tiered personalized medicine. Omics facilities are restricted to affluent regions, and personalized medicine is likely to widen the growing gap in health systems between high and low-income countries. This is mirrored by an increasing lag between our ability to generate and analyze big data. Several bottlenecks slow-down the transition from conventional to personalized medicine: generation of cost-effective high-throughput data; hybrid education and multidisciplinary teams; data storage and processing; data integration and interpretation; and individual and global economic relevance. This review provides an update of important developments in the analysis of big data and forward strategies to accelerate the global transition to personalized medicine. Keywords: Big data, Omics, Personalized medicine, High-throughput technologies, Cloud computing, Integrative methods, High-dimensionality Introduction of biology and medicine [5, 6]. The use of deterministic Access to large omics (genomics, transcriptomics, proteo- networks for normal and abnormal phenotypes are mics, epigenomic, metagenomics, metabolomics, nutrio- thought to allow for the proactive maintenance of wellness mics, etc.) data has revolutionized biology and has led to specific to the individual, that is predictive, preventive, the emergence of systems biology for a better understand- personalized, and participatory medicine (P4, or more ing of biological mechanisms.
    [Show full text]
  • Personalized Medicine in Big Data
    INTERNATIONAL JOURNAL FOR RESEARCH IN EMERGING SCIENCE AND TECHNOLOGY E-ISSN: 2349-7610 Personalized Medicine in Big Data Mr.A.Kesavan, M.C.A., M.Phil., Dr.C.Jayakumari, M.C.A., Ph.D., Mr.A.Kesavan, M.C.A, M.Phil., Research Scholar, Bharathiar University, India. [email protected] Dr.C.Jayakumari, M.C.A, Ph.D., Associate Professor, Middle East College, Oman, [email protected] ABSTRACT The rapid and ongoing digitalization of society leads to an exponential growth of both structured and unstructured data, so-called Big Data. This wealth of information opens the door to the development of more advanced personalized medicine technologies. The analysis of log data from such applications and wearables provide the opportunity to personalize and to improve their strength and long-term use. Personalized medicine is the use of a patient’s genomic information to form individualized prevention, diagnosis, and treatment plans to combat disease. Keywords: Big Data, Personalized Medicine, Omics Data, High-throughput technologies, Integrative methods, High- dimensionality. 1. INTRODUCTION Big Data refers to the practice of combining huge volumes of The overarching goal of “Personalized Medicine” is to create a diversely sourced information and analyzing them, using more framework that leverage patient EHRs and Omics data to sophisticated algorithms to inform decisions. Big Data relies facilitate clinical decision-making that is predictive, not only on the increasing ability of technology to support the personalized, preventive and participatory [4]. The emphasis is collection and storage of large amounts of data, but also on its to customize medical treatment based on the individual ability to analyses, understand and take advantage of the full characteristics of patients and nature of their disease, value of data.
    [Show full text]
  • Introduction to Bioinformatics
    INTRODUCTION TO BIOINFORMATICS > Please take the initial BIOINF525 questionnaire: < http://tinyurl.com/bioinf525-questions Barry Grant University of Michigan www.thegrantlab.org BIOINF 525 http://bioboot.github.io/bioinf525_w16/ 12-Jan-2016 Barry Grant, Ph.D. [email protected] Ryan Mills, Ph.D. [email protected] Hongyang Li (GSI) [email protected] COURSE LOGISTICS Lectures: Tuesdays 2:30-4:00 PM Rm. 2062 Palmer Commons Labs: Session I: Thursdays 2:30 - 4:00 PM Session II: Fridays 10:30 - 12:00 PM Rm. 2036 Palmer Commons Website: http://tinyurl.com/bioinf525-w16 Lecture, lab and background reading material plus homework and course announcements MODULE OVERVIEW Objective: Provide an introduction to the practice of bioinformatics as well as a practical guide to using common bioinformatics databases and algorithms 1.1. ‣ Introduction to Bioinformatics 1.2. ‣ Sequence Alignment and Database Searching 1.3 ‣ Structural Bioinformatics 1.4 ‣ Genome Informatics: High Throughput Sequencing Applications and Analytical Methods TODAYS MENU Overview of bioinformatics • The what, why and how of bioinformatics? • Major bioinformatics research areas. • Skepticism and common problems with bioinformatics. Bioinformatics databases and associated tools • Primary, secondary and composite databases. • Nucleotide sequence databases (GenBank & RefSeq). • Protein sequence database (UniProt). • Composite databases (PFAM & OMIM). Database usage vignette • Searching with ENTREZ and BLAST. • Reference slides and handout on major databases. HOMEWORK Complete the initial
    [Show full text]
  • Introduction to Computational Biology & Bioinformatics – Part I
    Introduction to Computational Biology & Bioinformatics – Part I ENGR 1166 Biomedical Engineering Computational biology A marriage between biology and computer Comput. biology vs. bioinformatics Biological data sets are large: 1 Comput. biology vs. bioinformatics Biological data sets are large: ⇒ We need to manage “big data” (bioinformatics) Comput. biology vs. bioinformatics Biological data sets are large: ⇒ We need to manage “big data” (bioinformatics) Biological systems (physical) are complex: Comput. biology vs. bioinformatics Biological data sets are large: ⇒ We need to manage “big data” (bioinformatics) Biological systems (physical) are complex: ⇒ We need to perform “big compute” (computational biology) 2 What is big data? source: https://vimeo.com/90017983 What is the goal? To develop computer algorithms and theory to interpret large biological data and to understand complex biological systems What is the goal? To develop computer algorithms and theory to interpret large biological data and to understand complex biological systems An interdisciplinary enterprise: o Biology o Chemistry o Physics o Statistics /applied math o Computer Science o Engineering 3 Why do we need this expertise? Rapid explosion in our ability to acquire biological data o How can we find robust patterns in these data? Why do we need this expertise? Rapid explosion in our ability to acquire biological data o How can we find robust patterns in these data? Recognition that biological phenomena are enormously complex and biological problems benefit from interdisciplinary approaches o How can we understand, predict, and manipulate these systems? Applications Origins of Humanity: DNA sequencing is used to trace the most recent common maternal ancestor of all living humans 4 Applications Personalized medicine: Data about the subject’s own genome and mathematical models are used to tailor therapy to each individual Key concepts The genome is the complete genetic material of an organism.
    [Show full text]
  • Application of Data Mining in Bioinformatics
    Khalid Raza / Indian Journal of Computer Science and Engineering Vol 1 No 2, 114-118 APPLICATION OF DATA MINING IN BIOINFORMATICS KHALID RAZA Centre for Theoretical Physics, Jamia Millia Islamia, New Delhi-110025, India Abstract This article highlights some of the basic concepts of bioinformatics and data mining. The major research areas of bioinformatics are highlighted. The application of data mining in the domain of bioinformatics is explained. It also highlights some of the current challenges and opportunities of data mining in bioinformatics. Keywords: Data Mining, Bioinformatics, Protein Sequences Analysis, Bioinformatics Tools. 1. Introduction In recent years, rapid developments in genomics and proteomics have generated a large amount of biological data. Drawing conclusions from these data requires sophisticated computational analyses. Bioinformatics, or computational biology, is the interdisciplinary science of interpreting biological data using information technology and computer science. The importance of this new field of inquiry will grow as we continue to generate and integrate large quantities of genomic, proteomic, and other data. A particular active area of research in bioinformatics is the application and development of data mining techniques to solve biological problems. Analyzing large biological data sets requires making sense of the data by inferring structure or generalizations from the data. Examples of this type of analysis include protein structure prediction, gene classification, cancer classification based on microarray data, clustering of gene expression data, statistical modeling of protein-protein interaction, etc. Therefore, we see a great potential to increase the interaction between data mining and bioinformatics. 2. Bioinformatics The term bioinformatics was coined by Paulien Hogeweg in 1979 for the study of informatic processes in biotic systems.
    [Show full text]
  • Deep Learning in Mining Biological Data
    Cognitive Computation (2021) 13:1–33 https://doi.org/10.1007/s12559-020-09773-x Deep Learning in Mining Biological Data Mufti Mahmud1,5 · M. Shamim Kaiser2 · T. Martin McGinnity1,3 · Amir Hussain4 Received: 1 May 2020 / Accepted: 28 September 2020 / Published online: 5 January 2021 © The Author(s) 2020 Abstract Recent technological advancements in data acquisition tools allowed life scientists to acquire multimodal data from difer- ent biological application domains. Categorized in three broad types (i.e. images, signals, and sequences), these data are huge in amount and complex in nature. Mining such enormous amount of data for pattern recognition is a big challenge and requires sophisticated data-intensive machine learning techniques. Artifcial neural network-based learning systems are well known for their pattern recognition capabilities, and lately their deep architectures—known as deep learning (DL)—have been successfully applied to solve many complex pattern recognition problems. To investigate how DL—especially its dif- ferent architectures—has contributed and been utilized in the mining of biological data pertaining to those three types, a meta-analysis has been performed and the resulting resources have been critically analysed. Focusing on the use of DL to analyse patterns in data from diverse biological domains, this work investigates diferent DL architectures’ applications to these data. This is followed by an exploration of available open access data sources pertaining to the three data types along with popular open-source DL tools applicable to these data. Also, comparative investigations of these tools from qualita- tive, quantitative, and benchmarking perspectives are provided. Finally, some open research challenges in using DL to mine biological data are outlined and a number of possible future perspectives are put forward.
    [Show full text]
  • Machine Learning and Complex Biological Data Chunming Xu1,2* and Scott A
    Xu and Jackson Genome Biology (2019) 20:76 https://doi.org/10.1186/s13059-019-1689-0 EDITORIAL Open Access Machine learning and complex biological data Chunming Xu1,2* and Scott A. Jackson1* Abstract this reason, the integrated analysis of different data types has been attracting more attention. Integration Machine learning has demonstrated potential in of different data types should, in theory, lead to a analyzing large, complex biological data. In practice, more holistic understanding of complex biological however, biological information is required in addition phenomena, but this is difficult due to the challenges to machine learning for successful application. of heterogeneous data and the implicitly noisy nature of biological data [1]. Another challenge is data di- The revolution of biological techniques and mensionality: omics data are high resolution, or stated another way, highly dimensional. In biological studies, demands for new data mining methods the number of samples is often limited and much In order to more completely understand complex bio- fewer than the number of variables due to costs or logical phenomena, such as many human diseases or available sources (e.g., cancer samples, plant/animal quantitative traits in animals/plants, massive amounts and replicates); this is also referred to as the ‘curse of multiple types of ‘big’ data are generated from compli- dimensionality’, which may lead to data sparsity, mul- cated studies. In the not so distant past, data generation ticollinearity, multiple testing, and overfitting [2]. was the bottleneck, now it is data mining, or extracting useful biological insights from large, complicated datasets. In the past decade, technological advances in data gener- Machine learning versus statistics ation have advanced studies of complex biological phe- The boundary between machine learning and statistics is nomena.
    [Show full text]
  • From Big Data Analysis to Personalized Medicine for All: Challenges and Opportunities Akram Alyass1, Michelle Turcotte1 and David Meyre1,2*
    Alyass et al. BMC Medical Genomics (2015) 8:33 DOI 10.1186/s12920-015-0108-y REVIEW Open Access From big data analysis to personalized medicine for all: challenges and opportunities Akram Alyass1, Michelle Turcotte1 and David Meyre1,2* Abstract Recent advances in high-throughput technologies have led to the emergence of systems biology as a holistic science to achieve more precise modeling of complex diseases. Many predict the emergence of personalized medicine in the near future. We are, however, moving from two-tiered health systems to a two-tiered personalized medicine. Omics facilities are restricted to affluent regions, and personalized medicine is likely to widen the growing gap in health systems between high and low-income countries. This is mirrored by an increasing lag between our ability to generate and analyze big data. Several bottlenecks slow-down the transition from conventional to personalized medicine: generation of cost-effective high-throughput data; hybrid education and multidisciplinary teams; data storage and processing; data integration and interpretation; and individual and global economic relevance. This review provides an update of important developments in the analysis of big data and forward strategies to accelerate the global transition to personalized medicine. Keywords: Big data, Omics, Personalized medicine, High-throughput technologies, Cloud computing, Integrative methods, High-dimensionality Introduction of biology and medicine [5, 6]. The use of deterministic Access to large omics (genomics, transcriptomics, proteo- networks for normal and abnormal phenotypes are mics, epigenomic, metagenomics, metabolomics, nutrio- thought to allow for the proactive maintenance of wellness mics, etc.) data has revolutionized biology and has led to specific to the individual, that is predictive, preventive, the emergence of systems biology for a better understand- personalized, and participatory medicine (P4, or more ing of biological mechanisms.
    [Show full text]