A Comprehensive and High-Performance Motif Finding Approach on Heterogeneous Systems

ABSTRACT A COMPREHENSIVE AND HIGH-PERFORMANCE MOTIF FINDING APPROACH ON HETEROGENEOUS SYSTEMS Unknown regulatory motif finding on DNA sequences is a crucial task for understanding gene expression and the task requires accuracy and efficiency. We propose DMF, a combinatorial approach that uses hash-based heuristics to skip unnecessary computations while retaining the maximum accuracy. Parallelized versions of our DMF approach, called PDMF, have been developed to use CPU, GPU and heterogeneous computing architectures in order to achieve the maximum performance. PDMF also incorporates SIMD instructions to further accelerate the task of unknown motif search. Our experimental results show that the multicore version (PDMFm) achieved 8.87x speedup over DMF. The GPU version (PDMFg) achieved a 41.48x and 9.95x average speedup over the serial version and PDMFm, respectively. Our SIMD enhanced heterogeneous approach (PDMFh) achieved a 3.42x speedup over our fastest GPU model (PDMFg1). The proposed approach was tested for performance against popular approximate and suffix tree-based approaches with various sized real-world datasets and the experimental results showed that the proposed approach achieved the maximum accuracy within a practical time bound for motif lengths 6~14. Sanjay Soundarajan May 2020 A COMPREHENSIVE AND HIGH-PERFORMANCE MOTIF FINDING APPROACH ON HETEROGENEOUS SYSTEMS by Sanjay Soundarajan A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science in the College of Science and Mathematics California State University, Fresno May 2020 © 2020 Sanjay Soundarajan APPROVED For the Department of Computer Science: We, the undersigned, certify that the thesis of the following student meets the required standards of scholarship, format, and style of the university and the student's graduate degree program for the awarding of the master's degree. Sanjay Soundarajan Thesis Author Jin Park (Chair) Computer Science Hubert Cecotti Computer Science David Ruby Computer Science For the University Graduate Committee: Dean, Division of Graduate Studies AUTHORIZATION FOR REPRODUCTION OF MASTER’S THESIS I grant permission for the reproduction of this thesis in part or in its entirety without further authorization from me, on the condition that the person or agency requesting reproduction absorbs the cost and provides proper acknowledgment of authorship. X Permission to reproduce this thesis in part or in its entirety must be obtained from me. Signature of thesis author: ACKNOWLEDGMENTS First, and most of all, I would like to thank my mentor Dr. Jin Park, for his guidance throughout my graduate education. Without his expertise, knowledge and patience this thesis would not have been possible. I would like to thank my committee members Dr. Hubert Cecotti and Dr. David Ruby for their suggestions and encouragement. I would also like to extend my thanks to my family and friends who have supported me throughout my academic career. I wouldn’t be able to complete my education without them. In addition, I would like to thank Dr. Jenny Banh for her unwavering confidence in me and for supporting me at every step of my journey. Last of all, a special thank you goes to my research colleague and friend, Michelle Salomon, for always putting a smile on my face. TABLE OF CONTENTS Page LIST OF TABLES ....................................................................................................... vii LIST OF FIGURES ..................................................................................................... viii INTRODUCTION ...........................................................................................................1 UNKNOWN REGULATORY MOTIF FINDING ...........................................................4 Problem Definition...................................................................................................4 Solution Approaches ................................................................................................6 PROPOSED APPROACH: DMF (DICTIONARY MOTIF FINDER) ........................... 10 Background............................................................................................................ 10 Hash-based Heuristic Approach (DMF) ................................................................. 18 PERFORMANCE OF DMF .......................................................................................... 23 ACHIEVING HIGH PERFORMANCE ......................................................................... 29 Background............................................................................................................ 29 Parallel Dictionary Motif Finder (PDMF) .............................................................. 34 PDMFm (MultiCore version) ................................................................................. 34 PDMFg (GPU version) .......................................................................................... 39 PDMFracing (Heterogeneous Model) ..................................................................... 47 Enhanced Heterogeneous Model with SIMD Vectors - PDMFh ............................. 49 PERFORMANCE OF PDMF ........................................................................................ 59 CONCLUSION ............................................................................................................. 66 REFERENCES .............................................................................................................. 68 LIST OF TABLES Page Table 1. DMF vs. Branch-and-Bound Execution Time ................................................... 25 Table 2. Strength of Hypothetical Consensus Patterns .................................................... 26 Table 3. Accuracy: MEME vs. DMF ............................................................................. 27 Table 4. DMF vs. SPELLER vs. WEEDER Execution Time.......................................... 28 Table 5. DMF vs. PDMFm2 vs. PDMFg1 Execution Times ........................................... 62 Table 6. PDMFg1 vs. PDMFracing vs. PDMFh Execution Times .................................. 64 LIST OF FIGURES Page Figure 1. Consensus score of a motif ...............................................................................5 Figure 2. Tree representation of a motif search .............................................................. 13 Figure 3. Bypass paths on the L-mer tree ....................................................................... 15 Figure 4: Branch and bound performance vs. k and dataset size ..................................... 24 Figure 5. DMF vs. Branch-and-Bound performance ....................................................... 25 Figure 6. Efficiency: DMF vs. MEME ........................................................................... 27 Figure 7. PDMFm1 computation model ......................................................................... 36 Figure 8. PDMFm2 block division computation model .................................................. 38 Figure 9. PDMFm2 cyclic division computation model ................................................. 38 Figure 10. PDMFm3 computation model ....................................................................... 39 Figure 11. GPU parallel architecture .............................................................................. 41 Figure 12. PDMFg1 computation model ........................................................................ 44 Figure 13. PDMFg2 computation model ........................................................................ 46 Figure 14. PDMFracing computation model .................................................................. 48 Figure 15. PDMFh computation model .......................................................................... 49 Figure 16. SIMD Vector execution on multiple sequences ............................................. 54 Figure 17. SSE-based min operation .............................................................................. 56 Figure 18. SSE vector set operation ............................................................................... 57 Figure 19. PDMFm performance: m1 vs. m2 vs. m3 ...................................................... 60 Figure 20. Performance of PDMFm3 with different thread allocations ........................... 61 Figure 21. PDMFg1 vs. PDMFm2 performance ............................................................. 62 Figure 22. Performance: PDMF GPU vs. Heterogeneous Models .................................. 63 Figure 23. PDMFh scalability comparison ..................................................................... 65 INTRODUCTION In the field of bioinformatics, finding recurring patterns of amino acid base pairs in genomic data is a crucial task. These recurring amino acids or transcription factor binding sites are referred to as motifs. Motifs represent possible protein interaction sites within the genome for various chemical reactions to take place. The occurrence of a motif within multiple genetically significant areas can highlight the relationship between the pattern of proteins and its effect on features like cancer and other diseases in organisms [1]. With the cost of DNA sequencing falling rapidly in the last two decades, the amount of biological data that exists in digital format has been rising almost exponentially. This increase in the size of genomic databases has caused problems in the searching and processing

Load more