Quantum Methods for Sequence Alignment and Metagenomics

DEGREE PROJECT IN ENGINEERING PHYSICS, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2019 Quantum Methods for Sequence Alignment and Metagenomics OSCAR BULANCEA LINDVALL KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES Quantum Methods for Sequence Alignment and Metagenomics Oscar Bulancea Lindvall [email protected] Quantum and Biophotonics Department of Applied Physics School of Engineering Sciences Royal Institute of Technology, SE-106 91 Stockholm, Sweden Sweden, 2019 Typeset in LATEX TRITA-SCI-GRU 2019:219 Graduation thesis on the subject of Applied Physics for the degree of Master of Science in Engineering Physics. The thesis was presented June 10, 2019 at 11.00 in Albanova University Center, Roslagstulls- backen 21, Stockholm. cb Oscar Bulancea Lindvall, 2019 This work is licensed under a Creative Commons Attribution 4.0 International License Printed in Sweden by KTH, Department of Applied Physics, Stockholm 2019 ii Abstract The analysis of genomic data poses a big data challenge. With sequencing techniques becoming both cheaper and more accessible, genomic data banks are expected to grow exponentially and surpass even the scale of astronomical data. The field of bioinformatics will therefore require more efficient methods of analysis. At the same time, quantum computing isbeing explored for its potential in areas such as chemistry, optimization and machine learning, and there has been recent interest in applying quantum computing in genomic analysis problems to achieve more efficient algorithms. This thesis aims to provide an overview of proposed quantum algorithms for sequence alignment, discussing their efficiency and implementability. Other quantum methods are also explored for their possible use in this field. An Ising formulation of the multiple sequence alignment problem is proposed, with qubit usage scaling only linearly in the number of sequences, and the use of this formulation is demonstrated for small-scale problems on a D-Wave quantum annealer. The properties of quantum associative memories are also discussed and investigated in the context of improving K-mer matching techniques in metagenomics, with simulations highlighting both their capabilities and current limitations. iii Acknowledgements This project was highly interdisciplinary and presented many challenges in understanding the relevant areas of both quantum computing and bioinformatics. It has been a great learning experience and I am thankful to all the people who have helped make this possible. I would like to start by expressing my gratitude toward my supervisor Dr. Xiu Gu. I am deeply grateful for her guidance that helped me navigate this vast field of research and through the uncertain points of my work. I want to thank both her and Prof. Göran Johansson for allowing me to participate in this project, making me feel part of the team and for letting me be part of many interesting discussions. I would also like to thank Per Sikora and Ebba Carbonnier for their contributions to my project. The meetings we had to discuss quantum applications in life sciences gave invaluable insight into bioinformatics and the possible uses of quantum computers, which undoubtedly helped shape my thesis. I am also grateful to everyone at the department of applied quantum physics at Chalmers for letting me be part of such a fun and inspiring workplace, and for giving me so many ideas of techniques to explore during my work. I learned much about quantum computing and quantum technology from the many interesting seminars you provided. And lastly, I want to give thanks to all my friends and family who have supported me through my years at university; my mother and father who have always cheered me on, and Eva Johansson and Mikael Bjurek who provided me with a second home while writing this thesis. Thank you all very much! iv Contents 1 Introduction 1 2 Bioinformatics 1 2.1 DNA-Structure ........................................... 1 2.2 Sequence Alignment ........................................ 2 2.2.1 Global Alignment ..................................... 3 2.2.2 Local Alignment ...................................... 4 2.2.3 Multiple Sequence Alignment ............................... 5 2.3 Metagenomics ............................................ 6 3 Quantum Computing 8 3.1 Computational Model and Implementation ........................... 8 3.2 Quantum Gates ........................................... 9 3.3 Grover Search ............................................ 11 3.4 Quantum Annealing ........................................ 12 4 Quantum Methods for Alignment 12 4.1 Grover-Based Methods ....................................... 12 4.1.1 Hamming Distance Lookup ................................ 12 4.1.2 Conditional Oracle Search ................................. 14 4.1.3 String Matching and Edit Distance ............................ 15 4.2 Fourier-Based Methods ....................................... 15 4.2.1 Multiple Alignment Algorithm .............................. 16 5 Ising Formulation of Multiple Sequence Alignment 17 5.1 Maximum Trace Formulation ................................... 18 5.2 Column Alignment Formulation .................................. 19 5.3 Simulation Results ......................................... 22 5.4 D-WAVE Annealing Results .................................... 24 6 Quantum Associative Memories 26 6.1 Grover-Based Quantum Associative Memory .......................... 26 6.1.1 Ventura Model ....................................... 26 6.1.2 Distributed Queries .................................... 29 6.1.3 Further improvement .................................... 31 6.1.4 Efficiency Considerations ................................. 32 6.2 Trugenberger Quantum Associative Memory ........................... 34 6.2.1 Efficiency Considerations ................................. 36 7 Quantum Applications in Metagenomics 37 7.1 Grover-Based Quantum Associative Memory .......................... 38 7.1.1 Adaption for Metagenomics ................................ 38 7.1.2 Performance Tests ..................................... 39 7.2 Trugenberger Quantum Associative Memory ........................... 44 8 Summary 46 9 Further Research 47 9.1 Quantum Pattern-Matching .................................... 47 9.2 Alignment Methods ........................................ 47 v 10 Appendix 52 10.1 Quantum Associative Memories .................................. 52 10.1.1 The QuAM Class ...................................... 52 10.1.2 The Maximum Probability Test .............................. 56 10.1.3 The Realistic Probability Tests .............................. 61 10.2 Trugenberger Model Calculations ................................. 66 10.3 Quantum and Simulated Annealing Code ............................ 67 10.3.1 Hamiltonian Construction ................................. 67 10.3.2 Annealing Tests ...................................... 69 vi 1 Introduction In the future, genome analysis could provide targeted treatments and aid in understanding and preventing complex diseases. Ever since the start of the human genome project, the sequencing and analysis of a human genome have become increasingly faster and cheaper, opening up more possibilities for medical and research applications. Even when not used for diagnostic purposes, methods for the analysis and comparison of genomic structure contribute to research in diseases, medicine design and evolutionary developments, to name a few. However, the analysis of genetic sequences poses a big data challenge, with sequence data-banks increasing exponentially in size[1], the field is and will be in need of more efficient methods for analysis. At the same time, the field of quantum computing is on the rise and is being explored for its potential use in chemistry, optimization and machine learning, and many others. Quantum computers are being researched by many around the world, and developed by companies such as IBM[2], Rigetti[3] and D- Wave[4], who are already offering public and commercial quantum computing platforms. With this,we are moving closer to the so called era of Noisy Intermediate Scale Quantum (NISQ) computers[5], where quantum devices on the scale of 50–100 qubits will be available, going beyond what any classical computer is capable of simulating and where it might be possible to demonstrate quantum supremacy. In recent years, there has been interest in how quantum computing could be applied in the field of bioinformatics to provide more efficient methods in analysis of genomic data and is being explored by many partsofthe scientific community, as well as companies such as Entropica labs[6]. This thesis aims to provide an overview of methods proposed for quantum applications in bioinformatics, listing useful properties as well as current drawbacks or limitations. Areas and techniques which could be of use in this field in the years to come are also explored. In particular, the thesis explores quantum pattern-matching structures for genomic classification and quantum optimization for the use in sequence alignment. In section 2, a general background is given for the relevant problems in bioinformatics and a brief introduction to quantum computing and its framework is given in section 3. This is followed by an overview of recently proposed quantum methods in section 4, focusing on the sequence alignment problems presented in section 2. The topic of quantum optimization for use in alignment is brought up in section 5 where an Ising formulation for the Multiple Sequence Alignment

Load more