Preprocessing of Nanopore Current Signals for DNA Base Calling

DEGREE PROJECT IN THE FIELD OF TECHNOLOGY ELECTRICAL ENGINEERING AND THE MAIN FIELD OF STUDY COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020 Preprocessing of Nanopore Current Signals for DNA Base Calling JOSEF MALMSTRÖM KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Preprocessing of Nanopore Current Signals for DNA Base Calling JOSEF MALMSTRÖM Master in Machine Learning Date: August 31, 2020 Supervisor: Xuechun Xu Examiner: Joakim Jaldén School of Electrical Engineering and Computer Science Swedish title: Förbehandling av strömsignaler från en nanopor för sekvensiering av DNA iii Abstract DNA is a molecule containing genetic information in all living organisms and many viruses. The process of determining the underlying genetic code in the DNA of an organism is known as DNA sequencing, and is commonly used for instance to study viruses, perform forensic analysis, and for medical diagnosis. One modern sequencing technique is known as nanopore sequencing. In nanopore sequencing, an electrical current signal that varies in amplitude de- pending on the genetic sequence is acquired by feeding a DNA strand through a nanometer scale protein pore, a so-called nanopore. The process of then in- ferring the underlying genetic sequence from the raw current signal is known as base calling. Base calling is commonly modeled as a machine learning problem, typically using Deep Neural Networks (DNNs) or Hidden Markov Models (HMMs). In this thesis, we seek to investigate how preprocessing of the raw electrical current signals can impact the performance of a subsequent base calling model. Specifically, we apply different methods for normalization, filtering, feature extraction, and quantization to the raw current signals, and evaluate the performance of these methods using a base caller built from a so-called Explicit Duration Hidden Markov Model (ED-HMM), a variation of the regular HMM. The results show that the application of various preprocessing techniques can have a moderate impact on the performance of the base caller. With appropriately chosen preprocessing methods, the performance of the studied ED-HMM base caller was improved by 2 - 3 percentage points, compared to a conventional preprocessing scheme. Possible future research directions for instance include exploring the generalizability of the results to deep base calling models, and evaluating other more sophisticated preprocessing methods from adjacent fields. iv Sammanfattning DNA är den molekyl som bär den genetiska informationen i alla levande or- ganismer och många virus. Processen genom vilken man tolkar den underliggande genetiska koden i en DNA-molekyl kallas för DNA-sekvensiering, och används exempelvis för att studera virus, utföra rättsmedicinska under- sökningar, och för att diagnostisera sjukdomar. En modern sekvensieringstek- nik kallas för nanopor-sekvensiering (eng: nanopore sequencing). I nanopor- sekvensiering erhålls en elektrisk strömsignal som varierar i amplitud bero- ende på den underliggande genetiska koden genom att inmata en DNA-sträng genom en proteinpor i nanometer-skala, en så kallad nanopor. Processen genom vilken den underliggande genetiska koden bestäms från den obearbetade strömsignalen kallas för basbestämning (eng: base calling). Basbestämning modelleras vanligen som ett maskininlärningsproblem, exempelvis med hjälp av djupa artificiella neuronnät (DNNs) eller dolda Markovmodeller (HMMs). I det här examensarbetet ämnar vi att utforska hur förbehandling av de elekt- riska strömsignalerna kan påverka prestandan hos en basbestämningsmodell. Vi applicerar olika metoder för normalisering, filtrering, mönsterextraktion (eng: feature extraction), och kvantisering på de obearbetade strömsignaler- na, och utvärderar prestandan av metoderna med en basbestämningsmodell som använder sig av en så kallad dold Markovmodell med explicit varaktighet (ED-HMM), en variant av en vanlig HMM. Resultaten visar att tillämpning- en av olika förbehandlingsmetoder kan ha en måttlig inverkan på basbestäm- ningsmodellens prestanda. Med lämpliga val av förbehandlingsmetoder öka- de prestandan hos den studerade ED-HMM-basbestämningsmodellen med 2 - 3 procentenheter jämfört med en konventionell förbehandlingskonfiguration. Möjliga framtida forskningsriktningar inkluderar att undersöka hur väl dessa resultat generaliserar till basbestämmningsmodeller som använder djupa neu- ronnät, och att utforska andra mer sofistikerade förbehandlingsmetoder från närliggande forskningsområden. v Acknowledgment I would like to express my sincerest gratitude to my supervisor Xuechun Xu, as well as Professor Joakim Jaldén. Firstly for welcoming me into this project at such short notice, and secondly for going above and beyond their duties to provide me with excellent guidance and advice for the thesis. Contents 1 Introduction1 1.1 Problem statement.......................2 1.2 Outline.............................2 2 Background3 2.1 DNA sequencing........................3 2.1.1 The DNA molecule...................3 2.1.2 Methods for DNA sequencing.............4 2.1.3 Nanopore sequencing..................4 2.2 Base calling...........................5 2.2.1 Challenges.......................6 2.2.2 Past methods......................6 2.2.3 State-of-the-art methods................7 2.2.4 Preprocessing for base calling.............7 2.3 Machine learning for base calling................8 2.3.1 Machine learning....................8 2.3.2 Hidden Markov Models (HMMs)...........8 2.3.3 Hidden Semi Markov Models (HSMMs)........ 10 2.3.4 Explicit Duration HMM as a DNA base caller..... 11 2.4 Time series analysis....................... 14 2.4.1 Normalization..................... 14 2.4.2 Filtering......................... 16 2.5 Quantization........................... 17 2.5.1 Uniform quantization.................. 17 2.5.2 k-means quantization.................. 18 2.5.3 Information loss optimized quantization........ 20 3 Experiments and Results 23 3.1 Experiment setting....................... 23 vi CONTENTS vii 3.1.1 Dataset......................... 23 3.1.2 Implementation and environment........... 24 3.1.3 Model configurations.................. 25 3.2 Preprocessing experiments................... 25 3.2.1 Normalization..................... 26 3.2.2 Filtering......................... 29 3.2.3 Feature extraction.................... 31 3.2.4 Quantization...................... 33 3.2.5 Best performing preprocessing configuration..... 38 4 Discussion 49 4.1 Result summary......................... 49 4.2 Limitations........................... 50 4.3 Future work........................... 51 4.4 Ethics and society........................ 52 5 Conclusion 54 Bibliography 55 A Detailed background on the ED-HMM base caller 58 A.1 Hidden Markov Models (HMMs)................ 58 A.1.1 The Baum-Welch algorithm.............. 59 A.1.2 The Viterbi algorithm................. 60 A.2 Hidden Semi Markov Models (HSMMs)............ 60 A.2.1 Explicit Duration HMM................ 61 A.3 Explicit Duration HMM as a DNA base caller......... 61 A.3.1 Model definition.................... 61 A.3.2 Model parameters.................... 63 A.3.3 Initialization...................... 64 A.3.4 Training......................... 65 A.3.5 Inference........................ 66 A.3.6 Evaluation....................... 66 B Observed shift in DNA translocation speed 68 C Supplemental results 70 D Further experiments with standard deviation feature 72 Chapter 1 Introduction In a world full of viruses, genetic disorders, and serial killers, the study of de- oxyribonucleic acid (DNA) can be the difference between life and death. The process of determining the underlying genetic code in the DNA of an organism is known as DNA sequencing, and is commonly used to study viruses, provide medical diagnosis, perform forensic analysis, and for various other applica- tions. Over the past decades, advancements in DNA sequencing technology have made the process easier, and significantly cheaper. This has increased the availability of the technology both in research and commercial settings. The speed at which genes can be sequenced has also increased by several orders of magnitude, meaning larger sections of DNA (or even full genomes) can be sequenced in a reasonable time frame [1]. A state-of-the-art method for DNA sequencing is known as nanopore sequencing [2]. In nanopore sequencing, a strand of DNA is fed through nanometer-scale protein pore, a so-called nanopore, and through an adjacent membrane over which a voltage is placed. As the DNA travels through the pore, its sequence of building blocks that make up the genetic code, the so- called nucleobases, affect the flow of ions across the membrane. Thus, by measuring the electrical current through the membrane the sequence of nucleobases can be determined. The process of determining the underlying sequence of nucleobases given an acquired electrical current signal is known as base calling. The problem of achieving accurate base calling is a research problem of its own, typically approached using concepts from the field of machine learning, for instance Hidden Markov Models (HMMs) and Deep Neural Networks (DNNs). 1 2 CHAPTER 1. INTRODUCTION An aspect of the base calling problem that has been left mostly unex- plored is whether or not preprocessing of the electrical current signals from a nanopore could play an important role in achieving competitive performance. In this thesis, we therefore seek to evaluate different preprocessing techniques

Load more