Fpgas in Bioinformatics
Total Page:16
File Type:pdf, Size:1020Kb
FPGAs in Bioinformatics Implementation and Evaluation of Common Bioinformatics Algorithms in Reconfigurable Logic Dipl.-Inf. Lars Wienbrandt Dissertation zur Erlangung des akademischen Grades Doktor der Ingenieurwissenschaften (Dr.-Ing.) der Technischen Fakultät der Christian-Albrechts-Universität zu Kiel eingereicht im Jahr 2015 Kiel Computer Science Series (KCSS) 2016/2 v1.0 dated 2016-03-15 ISSN 2193-6781 (print version) ISSN 2194-6639 (electronic version) Electronic version, updates, errata available via https://www.informatik.uni-kiel.de/kcss The author can be contacted via http://www.techinf.informatik.uni-kiel.de Published by the Department of Computer Science, Kiel University Technical Computer Science Group Please cite as: Ź Lars Wienbrandt. FPGAs in Bioinformatics Number 2016/2 in Kiel Computer Science Series. Department of Computer Science, 2016. Dissertation, Faculty of Engineering, Kiel University. @book{Wienbrandt16, author = {Lars Wienbrandt}, title = {{FPGAs in Bioinformatics}}, publisher = {Department of Computer Science, Kiel University}, year = {2016}, number = {2016/2}, series = {Kiel Computer Science Series}, note = {Dissertation, Faculty of Engineering, Kiel University.} } © 2016 by Lars Wienbrandt ii About this Series The Kiel Computer Science Series (KCSS) covers dissertations, habilitation theses, lecture notes, textbooks, surveys, collections, handbooks, etc. written at the Department of Computer Science at Kiel University. It was initiated in 2011 to support authors in the dissemination of their work in electronic and printed form, without restricting their rights to their work. The series provides a unified appearance and aims at high-quality typography. The KCSS is an open access series; all series titles are electronically available free of charge at the department’s website. In addition, authors are encouraged to make printed copies available at a reasonable price, typically with a print-on-demand service. Please visit http://www.informatik.uni-kiel.de/kcss for more information, for instructions how to publish in the KCSS, and for access to all existing publications. iii 1. Gutachter: Prof. Dr. rer. nat. Manfred Schimmler Technical Computer Science Group Department of Computer Science Christian-Albrechts-University of Kiel 2. Gutachter: Prof. Dr. rer. nat. Andre Franke Genetics & Bioinformatics Group Institute of Clinical Molecular Biology Christian-Albrechts-University of Kiel 3. Gutachter: Prof. Dr. Olaf Wolkenhauer Systems Biology & Bioinformatics Group Institute of Computer Science University of Rostock Datum der mündlichen Prüfung: 4. März 2016 iv Acknowledgments FPGA technology has fascinated me already during my study of Computer Science. Soon, I got in touch with the area of bioinformatics and saw a great potential in the combination of both. After my graduation I continued research in this area and had the opportunities to publish many of my results. I was allowed to present most of them on several conferences around the world, and some of them even on invitation. Thus, first and foremost, I owe my special thanks to my supervisor Manfred Schimmler who always supported, promoted and motivated my work and gave me the freedom I needed including the approval for travel funding. He also encouraged me to apply for project fundings together with Andre Franke, whom I owe my grateful thanks for supporting me and my work in this manner. This successful project has finally funded my research in the last three years. And I owe my true appreciation to Bertil Schmidt as well, for the numerous lucrative publications and additional motivation and support in the second successfully raised collaboration project fundings that sustained the SNP interaction analysis project and financed Jan Kässens, who turned out to be my most cooperative colleague. His and my work perfectly complemented each other. With his knowledge in writing extremely efficient host code he always squeezed out the last bits of total system performance. With him developing the host part while I was working out the hardware design we perfectly minimized the development time for our applications. I sincerely thank him for not just being an invaluable colleague but also for being my friend. Furthermore, I would like to thank my colleagues and former colleagues for their constructive help, namely Daniel Siebert for his counterpart on the BLASTp application, Jorge González-Domínguez and Jost Bissel for their work in the SNP interaction analysis project, especially Jorge for creating the competitive solutions on GPUs, Ayman Abbas who provided me insight in the fruitful FPGA application area of cryptanalysis, and Vasco Gross- v Acknowledgments mann, Sven Koschnicke and Christoph Starke who tested FPGA technology in stock market analysis. I also sincerely thank Matthias Hübenthal and David Ellinghaus from the ICMB for their help on the mathematics of the SHAPEIT2 genotype phasing tool. My special thanks go to our secretary Brigitte Scheidemann. She helped me with all kinds of administration tasks especially when I again had a complicated travel cost refund application. Moreover, I truly appreciate the valuable return of results from all students whom I supervised for their theses. And thank you, Dirk Nowotka, Reinhard Koch and again Andre Franke for joining my examination committee. Last but not least, I want to thank my mother and my sister for unques- tionably supporting me in all kinds of situations, and, of course, I truly appreciate the backing of my wife. She always encouraged me to go on, when I ran into a situation that seemed to be a dead end, she always told me to keep my head up when I lost my self-confidence, and she always calmed me when I thought I was going insane. Thank you, Ilze, for being there and that I can always rely on you! And finally I do not want to forget to mention my little daughter Alina Annija, who is my sunshine, even when it is raining, and thus always reminds me of what is most important in life. This study makes use of data generated by the Wellcome Trust Case- Control Consortium. A full list of the investigators who contributed to the generation of the data is available from http://www.wtccc.org.uk. Funding for the project was provided by the Wellcome Trust under award 076113 and 085475. vi Zusammenfassung Das Leben. Sehr viel Aufwand wird getrieben um der Menschheit einen Einblick in dieses faszinierende und komplexe, aber fundamentale Thema zu erlauben. Um Zusammenhänge zu verstehen und Folgen ableiten zu können hat der Mensch begonnen sein Genom zu sequenzieren, d.h. seine DNA zu bestimmen um daraus Informationen, z.B. in Bezug auf Erbkrank- heiten folgern zu können. Der Prozess der DNA-Sequenzierung sowie die darauffolgenden Analysen sind schon allein wegen der riesigen Datenmen- gen eine Herausforderung für aktuelle Rechensysteme. Laufzeiten von über einen Tag für die Analyse einfacher Datensätze sind üblich, selbst wenn der Prozess bereits auf einem Computercluster ausgeführt wird. Diese Arbeit zeigt, wie dieses gängige Problem im Bereich der Bioin- formatik mit rekonfigurierbarer Hardware, speziell FPGAs, angegangen werden kann. Es werden drei rechenintensive Themengebiete hervorgeho- ben: Sequenzalignment, SNP-Interaktionsanalyse und Genotyp-Imputation. Beispielhaft wird im Bereich des Sequenzalignments die Software BLASTp für die Suche in Proteinsequenzdatenbanken vorgestellt, implemen- tiert und evaluiert. Die SNP-Interaktionsanalyse wird mit drei Verfahren zur vollständigen Suche von Interaktionen inklusive des dazugehörigen statistischen Tests vorgestellt: die Messung der Kullback-Leibler-Divergenz in BOOST, die r-Differenz in iLOCi und die Messung der Transinformation. Alle Verfahren werden auf FPGA-Hardware implementiert und evaluiert, mit einer bestechenden Beschleunigung im dreistelligen Bereich gegenüber Standard-Rechnern. Das letzte Gebiet der Genotyp-Imputierung ist ein zweiteiliges Verfahren bestehend aus dem Phasing und der eigentlichen Imputation. Der Schwer- punkt liegt im Phasing-Schritt, der mit dem SHAPEIT2-Tool adressiert wird. SHAPEIT2 wird ausführlich mit den zugrunde liegenden mathematischen Methoden diskutiert, und schließlich implementiert und evaluiert. Auch hier wird ein beachtlicher Speedup von 46 erreicht. vii Abstract Life. Much effort is taken to grant humanity a little insight in this fascinating and complex but fundamental topic. In order to understand the relations and to derive consequences humans have begun to sequence their genomes, i.e. to determine their DNA sequences to infer information, e.g. related to genetic diseases. The process of DNA sequencing as well as subsequent analysis presents a computational challenge for recent computing systems due to the large amounts of data alone. Runtimes of more than one day for analysis of simple datasets are common, even if the process is already run on a CPU cluster. This thesis shows how this general problem in the area of bioinformatics can be tackled with reconfigurable hardware, especially FPGAs. Three com- pute intensive problems are highlighted: sequence alignment, SNP interaction analysis and genotype imputation. In the area of sequence alignment the software BLASTp for protein database searches is exemplarily presented, implemented and evaluated. SNP interaction analysis is presented with three applications performing an exhaustive search for interactions including the corresponding statistical tests: BOOST, iLOCi and the mutual information measurement. All appli- cations are implemented in FPGA-hardware and evaluated,