Approximate Algorithms for Regulatory Motif Discovery in DNA

Western Michigan University ScholarWorks at WMU Dissertations Graduate College 6-2019 Approximate Algorithms for Regulatory Motif Discovery in DNA Hasnaa Imad Al-Shaikhli Western Michigan University, [email protected] Follow this and additional works at: https://scholarworks.wmich.edu/dissertations Part of the Computer Sciences Commons Recommended Citation Al-Shaikhli, Hasnaa Imad, "Approximate Algorithms for Regulatory Motif Discovery in DNA" (2019). Dissertations. 3454. https://scholarworks.wmich.edu/dissertations/3454 This Dissertation-Open Access is brought to you for free and open access by the Graduate College at ScholarWorks at WMU. It has been accepted for inclusion in Dissertations by an authorized administrator of ScholarWorks at WMU. For more information, please contact [email protected]. APPROXIMATE ALGORITHMS FOR REGULATORY MOTIF DISCOVERY IN DNA by Hasnaa Imad Al-Shaikhli A dissertation submitted to the Graduate College in partial fulfillment of the requirements for the degree of Doctor of Philosophy Computer Science Western Michigan University June 2019 Doctoral Committee: Dr. Elise de Doncker, Chair Dr. Alvis Fong Dr. Todd Barkman Dr. Nancy Deng Copyright by Hasnaa Imad Al-Shaikhli 2019 ACKNOWLEDGEMENTS I would like to express my deepest appreciation to my committee Chair, Professor Dr. Elise de Doncker. Without her guidance and persistent help, this dissertation would not have been possible. I learned from her how to define a research problem, find a solution to it, and finally publish the results. Besides my advisor, I sincerely thank the other members of my dissertation committee members, Dr. Alvis Fong, Dr. Todd Barkman, and Dr. Nancy Deng for their support and advice on problems I encountered during my research. In addition, my gratitude goes to the Department of Computer Science Chair, Dr. Steven Carr, and staff for the last minute favors. I also thank my sponsor, the Higher Committee for Education Development in Iraq (HCED), for their financial support, without I which would not have been able to study at Western Michigan University. Last but not least, I want to express my deepest gratitude to my family and friends. This dissertation would not have been possible without their warm love, continued patience and endless support. Hasnaa Imad Al-Shaikhli ii APPROXIMATE ALGORITHMS FOR REGULATORY MOTIF DISCOVERY IN DNA Hasnaa Imad Al-Shaikhli, Ph.D. Western Michigan University, 2019 Motif discovery is the problem of finding common substrings within a set of biological strings. Therefore it can be applied to finding Transcription Factor Binding Sites (TFBS) that have common patterns (motifs). A transcription factor molecule can bind to multiple binding sites in the promoter region of different genes to make these genes co-regulating. The Planted (l, d) Motif Problem (PMP) is a classic version of motif discovery where l is the motif length and d represents the maximum allowed mutation distance. The quorum Planted (l, d, q) Motif Problem (qPMP) is a version of PMP where the motif of length l occurs in at least q percent of the sequences with up to d mismatches. In this thesis we develop the Strong Motif Finder (SMF) and quorum Strong Motif Finder (qSMF) algorithms and evaluate their performance. The Strong Motif Finder (SMF) returns a list of its highest ranked (strongest) motifs. The performance of SMF is compared with the APMotif and MEME algorithms with respect to execution time and prediction accuracy. Several performance metrics are used at both the nucleotide and the site level. The algorithms are tested on simulated datasets. The time comparisons show that SMF is faster than the APMotif and the MEME (ANR) and similar in speed to the MEME (ZOOPS). The MEME algorithm with choice OOPS is the fastest but is not practical if no prior knowledge is available. The prediction accuracy results reveal that the SMF outperforms the APMotif, and performs at the level of the best prediction accuracy of the MEME (with OOPS choice), notwithstanding that the SMF is not given a-priori information. In addition, the SMF is tested on real DNA datasets of orthologous regularity regions from multiple species, without using their related phylogenetic tree. The experiments indicate that the SMF results agree with published motifs. The quorum Strong Motif Finder (qSMF) returns a list of highest ranked (strongest) motifs occurring in at least q percent of the data sequences. The algorithm is tested on ChIP- Seq (large) data that was sampled using the SamSelect algorithm. In comparison with the FMotif algorithm, the experimental results show that qSMF is faster and returns predicted motifs similar to results in the literature and to motifs discovered by the ENCODE project tool which uses the established motif finding algorithms of AlignACE, MEME, MDscan, Trawler, and Weeder. In order to determine the strength or the significance of the predicted motifs, a scoring function, the Motif Strength Score (MSS), is proposed for ranking the discovered motifs in both algorithms. In future work, this score can be combined with other statistical scores, such as the complexity score, P-value and information content, to better determine the motif significance. TABLE OF CONTENTS ACKNOWLEDGEMENTS . ii LIST OF TABLES . vii LIST OF FIGURES . viii LIST OF ABBREVIATIONS . ix 1. INTRODUCTION . 1 1.1. Biological Background . 1 1.1.1. What is the Basis of Life? . 1 1.1.2. What is a Genome? . 2 1.1.3. What are Genes? . 2 1.1.4. What is DNA? . 3 1.1.5. DNA Discovery Story . 3 1.1.6. Central Dogma . 3 1.1.7. What is Gene Expression? . 4 1.1.8. What are Motifs? . 5 1.1.9. What are Mutations? . 6 1.2. Motif Discovery Problem Description . 7 1.3. Motif Discovery Problem Versions . 7 1.4. Motif Search Computational Algorithms Categories . 9 1.5. Importance of the Motif Finding Problem . 11 1.6. Summary . 12 iii Table of Contents—Continued 2. LITERATURE REVIEW . 13 2.1. Introduction . 13 2.2. Motif Finding Problem Complexity . 14 2.3. Mostly Used Scoring Functions . 15 2.3.1. Information Content (IC) . 15 2.3.2. Complexity Score (CS) . 16 2.3.3. P-Value . 16 2.3.4. Z-Score . 17 2.4. Algorithms for Solving Planted Motif Problem . 17 2.4.1. MEME . 17 2.4.2. APMotif . 20 2.4.3. YMF . 20 2.4.4. MFMD . 21 2.4.5. Weeder . 22 2.4.6. Trawler . 24 2.4.7. MDscan . 24 2.4.8. FMotif . 25 2.4.9. AlignACE . 25 2.5. Algorithms for Large Datasets . 26 2.6. Useful Online Tools . 28 2.7. Useful Databases . 30 2.8. Ensemble Algorithms . 31 2.9. Prediction Accuracy Evaluation . 32 2.10. Computational Methods: Limitations and Challenges . 35 2.11. Summary . 37 iv Table of Contents—Continued 3. PROPOSED ALGORITHMS . 38 3.1. SMF Algorithm . 38 3.1.1. Planted (l, d) Motif Problem . 38 3.1.2. Proposed Algorithm . 39 3.1.3. Algorithm Input and Output . 39 3.1.4. Algorithm Description . 40 3.1.5. Proposed Scoring Function . 42 3.1.6. Algorithm Time Complexity . 44 3.2. qSMF Algorithm . 46 3.2.1. Quorum Planted (l, d, q) Motif Problem . 46 3.2.2. Proposed Algorithm . 47 3.2.3. Algorithm Input and Output . 47 3.2.4. Algorithm Description . 48 3.2.5. Scoring Functions . 51 3.2.6. Algorithm Time Complexity . 53 4. EXPERIMENTAL RESULTS . 56 4.1. Experimental Results of SMF . 56 4.1.1. Experimental Results on Simulated Data . 56 4.1.2. Experimental Results on Real Data . 59 4.2. Experimental Results of qSMF . 65 4.2.1. Experimental Results on Real Data . 65 5. CONCLUSIONS AND FUTURE WORK . 68 5.1. Conclusions . 68 5.2. Future Work . 72 v Table of Contents—Continued BIBLIOGRAPHY . 74 APPENDICES . 96 A. Algorithms for Motif Discovery Problem . 96 B. Planted (l, d) Motif Finding Problem Solvability Analysis . 97 C. Permissions . 100 vi LIST OF TABLES 2.1 Useful Databases . 31 3.1 Complexity Score and Vectors for Strings of Length l = 10 Derived from a Mono-Nucleotide String for Different Values of N#pos and Ntype . 52 4.1 Top Motifs Returned by SMF When Applied on Real Datasets . 64 4.2 Results of Eight Homo Sapiens Real Datasets Selected from the ENCODE TF ChIP-Seq Dataset . 67 1 List of Current Algorithms for Motif Finding Problem . 96 2 Last Solvable Problem Instances Depending on Et Values . 98 3 Expected Numbers of Motifs t = 20, n = 600, and 8 ≤ l ≤ 30 with the Critical Instances . 99 vii LIST OF FIGURES 1.1 Central Dogma Process Illustration . 4 1.2 Motif Location with its Corresponding Gene in the Promoter Region of a DNA Sequence . 5 2.1 Visual Description of TP, FN, FP, and TN . 32.

Load more