DEGREE PROJECT IN THE FIELD OF TECHNOLOGY ELECTRICAL ENGINEERING AND THE MAIN FIELD OF STUDY COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020
Preprocessing of Nanopore Current Signals for DNA Base Calling
JOSEF MALMSTRÖM
KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Preprocessing of Nanopore Current Signals for DNA Base Calling
JOSEF MALMSTRÖM
Master in Machine Learning Date: August 31, 2020 Supervisor: Xuechun Xu Examiner: Joakim Jaldén School of Electrical Engineering and Computer Science Swedish title: Förbehandling av strömsignaler från en nanopor för sekvensiering av DNA
iii
Abstract
DNA is a molecule containing genetic information in all living organisms and many viruses. The process of determining the underlying genetic code in the DNA of an organism is known as DNA sequencing, and is commonly used for instance to study viruses, perform forensic analysis, and for medical diagno- sis. One modern sequencing technique is known as nanopore sequencing. In nanopore sequencing, an electrical current signal that varies in amplitude de- pending on the genetic sequence is acquired by feeding a DNA strand through a nanometer scale protein pore, a so-called nanopore. The process of then in- ferring the underlying genetic sequence from the raw current signal is known as base calling. Base calling is commonly modeled as a machine learning problem, typically using Deep Neural Networks (DNNs) or Hidden Markov Models (HMMs). In this thesis, we seek to investigate how preprocessing of the raw electrical current signals can impact the performance of a subsequent base calling model. Specifically, we apply different methods for normaliza- tion, filtering, feature extraction, and quantization to the raw current signals, and evaluate the performance of these methods using a base caller built from a so-called Explicit Duration Hidden Markov Model (ED-HMM), a variation of the regular HMM. The results show that the application of various prepro- cessing techniques can have a moderate impact on the performance of the base caller. With appropriately chosen preprocessing methods, the performance of the studied ED-HMM base caller was improved by 2 - 3 percentage points, compared to a conventional preprocessing scheme. Possible future research directions for instance include exploring the generalizability of the results to deep base calling models, and evaluating other more sophisticated preprocess- ing methods from adjacent fields. iv
Sammanfattning
DNA är den molekyl som bär den genetiska informationen i alla levande or- ganismer och många virus. Processen genom vilken man tolkar den under- liggande genetiska koden i en DNA-molekyl kallas för DNA-sekvensiering, och används exempelvis för att studera virus, utföra rättsmedicinska under- sökningar, och för att diagnostisera sjukdomar. En modern sekvensieringstek- nik kallas för nanopor-sekvensiering (eng: nanopore sequencing). I nanopor- sekvensiering erhålls en elektrisk strömsignal som varierar i amplitud bero- ende på den underliggande genetiska koden genom att inmata en DNA-sträng genom en proteinpor i nanometer-skala, en så kallad nanopor. Processen ge- nom vilken den underliggande genetiska koden bestäms från den obearbetade strömsignalen kallas för basbestämning (eng: base calling). Basbestämning modelleras vanligen som ett maskininlärningsproblem, exempelvis med hjälp av djupa artificiella neuronnät (DNNs) eller dolda Markovmodeller (HMMs). I det här examensarbetet ämnar vi att utforska hur förbehandling av de elekt- riska strömsignalerna kan påverka prestandan hos en basbestämningsmodell. Vi applicerar olika metoder för normalisering, filtrering, mönsterextraktion (eng: feature extraction), och kvantisering på de obearbetade strömsignaler- na, och utvärderar prestandan av metoderna med en basbestämningsmodell som använder sig av en så kallad dold Markovmodell med explicit varaktighet (ED-HMM), en variant av en vanlig HMM. Resultaten visar att tillämpning- en av olika förbehandlingsmetoder kan ha en måttlig inverkan på basbestäm- ningsmodellens prestanda. Med lämpliga val av förbehandlingsmetoder öka- de prestandan hos den studerade ED-HMM-basbestämningsmodellen med 2 - 3 procentenheter jämfört med en konventionell förbehandlingskonfiguration. Möjliga framtida forskningsriktningar inkluderar att undersöka hur väl dessa resultat generaliserar till basbestämmningsmodeller som använder djupa neu- ronnät, och att utforska andra mer sofistikerade förbehandlingsmetoder från närliggande forskningsområden. v
Acknowledgment
I would like to express my sincerest gratitude to my supervisor Xuechun Xu, as well as Professor Joakim Jaldén. Firstly for welcoming me into this project at such short notice, and secondly for going above and beyond their duties to provide me with excellent guidance and advice for the thesis. Contents
1 Introduction1 1.1 Problem statement...... 2 1.2 Outline...... 2
2 Background3 2.1 DNA sequencing...... 3 2.1.1 The DNA molecule...... 3 2.1.2 Methods for DNA sequencing...... 4 2.1.3 Nanopore sequencing...... 4 2.2 Base calling...... 5 2.2.1 Challenges...... 6 2.2.2 Past methods...... 6 2.2.3 State-of-the-art methods...... 7 2.2.4 Preprocessing for base calling...... 7 2.3 Machine learning for base calling...... 8 2.3.1 Machine learning...... 8 2.3.2 Hidden Markov Models (HMMs)...... 8 2.3.3 Hidden Semi Markov Models (HSMMs)...... 10 2.3.4 Explicit Duration HMM as a DNA base caller..... 11 2.4 Time series analysis...... 14 2.4.1 Normalization...... 14 2.4.2 Filtering...... 16 2.5 Quantization...... 17 2.5.1 Uniform quantization...... 17 2.5.2 k-means quantization...... 18 2.5.3 Information loss optimized quantization...... 20
3 Experiments and Results 23 3.1 Experiment setting...... 23
vi CONTENTS vii
3.1.1 Dataset...... 23 3.1.2 Implementation and environment...... 24 3.1.3 Model configurations...... 25 3.2 Preprocessing experiments...... 25 3.2.1 Normalization...... 26 3.2.2 Filtering...... 29 3.2.3 Feature extraction...... 31 3.2.4 Quantization...... 33 3.2.5 Best performing preprocessing configuration..... 38
4 Discussion 49 4.1 Result summary...... 49 4.2 Limitations...... 50 4.3 Future work...... 51 4.4 Ethics and society...... 52
5 Conclusion 54
Bibliography 55
A Detailed background on the ED-HMM base caller 58 A.1 Hidden Markov Models (HMMs)...... 58 A.1.1 The Baum-Welch algorithm...... 59 A.1.2 The Viterbi algorithm...... 60 A.2 Hidden Semi Markov Models (HSMMs)...... 60 A.2.1 Explicit Duration HMM...... 61 A.3 Explicit Duration HMM as a DNA base caller...... 61 A.3.1 Model definition...... 61 A.3.2 Model parameters...... 63 A.3.3 Initialization...... 64 A.3.4 Training...... 65 A.3.5 Inference...... 66 A.3.6 Evaluation...... 66
B Observed shift in DNA translocation speed 68
C Supplemental results 70
D Further experiments with standard deviation feature 72
Chapter 1
Introduction
In a world full of viruses, genetic disorders, and serial killers, the study of de- oxyribonucleic acid (DNA) can be the difference between life and death. The process of determining the underlying genetic code in the DNA of an organism is known as DNA sequencing, and is commonly used to study viruses, provide medical diagnosis, perform forensic analysis, and for various other applica- tions. Over the past decades, advancements in DNA sequencing technology have made the process easier, and significantly cheaper. This has increased the availability of the technology both in research and commercial settings. The speed at which genes can be sequenced has also increased by several orders of magnitude, meaning larger sections of DNA (or even full genomes) can be sequenced in a reasonable time frame [1]. A state-of-the-art method for DNA sequencing is known as nanopore sequencing [2]. In nanopore sequencing, a strand of DNA is fed through nanometer-scale protein pore, a so-called nanopore, and through an adjacent membrane over which a voltage is placed. As the DNA travels through the pore, its sequence of building blocks that make up the genetic code, the so- called nucleobases, affect the flow of ions across the membrane. Thus, by measuring the electrical current through the membrane the sequence of nu- cleobases can be determined. The process of determining the underlying se- quence of nucleobases given an acquired electrical current signal is known as base calling. The problem of achieving accurate base calling is a research problem of its own, typically approached using concepts from the field of ma- chine learning, for instance Hidden Markov Models (HMMs) and Deep Neural Networks (DNNs).
1 2 CHAPTER 1. INTRODUCTION
An aspect of the base calling problem that has been left mostly unex- plored is whether or not preprocessing of the electrical current signals from a nanopore could play an important role in achieving competitive performance. In this thesis, we therefore seek to evaluate different preprocessing techniques and their effects on the performance of a particular base caller.
1.1 Problem statement
The aim of the thesis project was to investigate whether the application of different preprocessing techniques on the signals from a nanopore DNA se- quencer can have a significant impact on the performance of a subsequent base calling model. Specifically, the effects of applying different methods in the fol- lowing preprocessing domains were to be explored:
• Normalization of the electrical current signals.
• Filtering of the signals.
• Quantization of the raw signal samples into a digital representation.
• Feature extraction on the raw current signals, to obtain alternate feature representations of the signals for the base calling model.
1.2 Outline
The rest of the thesis is structured as follows:
• In Chapter 2, relevant background on DNA sequencing, base calling, machine learning, time series analysis, and quantization is provided.
• In Chapter 3, the methods and results of the performed preprocessing experiments are presented and analyzed.
• In Chapter 4, a summary of the result analysis is provided, the signifi- cance of this work in a broader context is discussed, and potential future research directions are suggested.
• In Chapter 5, a summary of the conclusions drawn from this thesis is provided. Chapter 2
Background
2.1 DNA sequencing
DNA is a molecule containing genetic information in all organisms and many viruses. The study of DNA has various applications in biological and medical research, medical diagnosis, forensics, and virology. The process of extracting the genetic code of a DNA molecule is commonly referred to as DNA sequenc- ing. This section will provide relevant background information on the topic of DNA sequencing, starting with a brief review of the structure of DNA, there- after proceeding to historic and state-of-the-art methods for DNA sequencing.
2.1.1 The DNA molecule DNA consists of two strands that coil around each other to form the structure of a double-helix. Each strand has a number of smaller units known as nu- cleotides. Upon each nucleotide, one of four so-called nucleobases attaches. The four nucleobases are known as cytosine (C), guanine (G), adenine (A), and thymine (T). It is the sequence of nucleobases in the DNA strand that encodes the genetic information. Additionally, each nucleobase in a strand bonds to the nucleobase in the corresponding position on the opposing strand, forming a so-called base pair. The structure of the nucleotides are such that C can only bond with G, and A can only bond with T, see Figure 2.1. This means that given only one of the two DNA strands, the sequence of nucleobases on the second strand can be directly inferred.
3 4 CHAPTER 2. BACKGROUND
Figure 2.1: Schematic visualization of the DNA double helix. Adenine (A) bonds only to thymine (T), and cytosine (C) bonds only to guanine (G).
2.1.2 Methods for DNA sequencing The first methods for DNA sequencing, proposed in the 1970s, utilized various chemical modifications of the DNA molecule. An approach known as Maxam- Gilbert sequencing allowed identification of the sequence of base pairs through radioactive labeling of the molecule that could be detected with X-ray imaging. Another method, known as the chain-termination method, chemically modi- fied the DNA in such a way that the different nucleobases appeared different in color. Modern methods for DNA sequencing also utilize a variety of different techniques. A number of popular methods use various approaches to fluores- cent labeling of the nucleotides, that can be detected with cameras. Others chemically modify the DNA molecule in such a way that it releases hydrogen ions which are detected by a sensor. A state-of-the-art method is known as nanopore sequencing, described in greater detail in the following subsection.
2.1.3 Nanopore sequencing In nanopore sequencing a nanopore is utilized to detect the sequence of nucle- obases. The DNA molecule is separated into its two strands, and one strand is fed through the nanopore. The nanopore also sits on a membrane over which a voltage is placed. As the DNA strand is fed through the nanopore the nu- cleobases currently inside the pore affect the electrical resistance of the mem- brane, causing variations in the electrical current. By measuring and analyz- ing the current signal, the sequence of nucleobases can then be inferred. A schematic illustration of the process is provided in Figure 2.2. The leading producer of nanopore sequencing technology is the UK-based company Oxford Nanopore Technologies (ONT) [2]. ONT provides a num- ber of different machines for nanopore sequencing, ranging in performance in terms of speed, read length, and on-board computational capability. For small-scale or field experiments, ONT provide the so-called Flongle, a DNA sequencer in dongle format, that plugs in to a regular PC. On the other end of CHAPTER 2. BACKGROUND 5
Figure 2.2: Schematic illustration of nanopore sequencing, reproduced from [3]. Individual nucleotides passing through the pore affect the resistance in the membrane, resulting in current variations. the spectrum is the PromethION sequencer which is designed for large-scale experiments and provides, among other things, a much higher throughput as well as on-board computation. A popular mid-range sequencer that provides a trade-off between these two machines in terms of speed, portability, and cost is known as the MinION.
2.2 Base calling
In nanopore sequencing, the process of translating the measured current sig- nals into a sequence of nucleobases is commonly referred to as base calling. This task is typically framed as a machine learning problem, where the input is a time series of electrical current samples, and the target is the corresponding sequence of nucleobases. 6 CHAPTER 2. BACKGROUND
2.2.1 Challenges Several complicating factors make achieving high accuracy base calling a dif- ficult problem. Multiple adjacent nucleobases affect the current signal simul- taneously as they travel through the nanopore, meaning different bases cannot be trivially mapped to an individual current level [4]. Accurately base calling long sequences of the same nucleobase, so called homopolymers, can be espe- cially challenging, as it is difficult to detect a transition from one set of nucle- obases, to another identical set of nucleobases. The speed at which the DNA strand travels through the nanopore is also non-uniform, which further compli- cates the mapping from current samples to bases [4]. Since the translocation speed is not constant, the number of electrical current samples originating from each nucleobase (or set of nucleobases) will vary over time. The inher- ent nature of the nanopore sequencing technology also results in the presence of random noise and disruptions in the current signal. Since the modulations of the current are caused by single molecules, or small groups of molecules, they are low in amplitude and therefore sensitive to noise from the surrounding environment.
2.2.2 Past methods Historically, the problem of base calling has typically been modeled using Hidden Markov Models (HMMs) [4]. Background on HMMs and their use as base calling models is provided in section 2.3. In the typical HMM model of the base calling problem, nucleobase subsequences of length k are con- sidered and commonly referred to as k-mers. The choice of k is generally made in relation to how many nucleotides are assumed to affect the nanopore simultaneously (typically 3-7). The hidden state of the HMM represents the k- mer currently passing through the nanopore, while the observations are simply the current signal samples, or some representation thereof. Several variations adding additional complexities also exist, such as including a time duration for the k-mer’s presence in the nanopore in the hidden state. The earliest base calling models applied an initial processing step that translated the current samples into an event-based signal, that was then used as input to the base calling model [4]. Each event in this representation summarized the segments of the raw current signals in a set of statistical metrics (e.g. mean, standard deviation, and duration). In contrast, modern base callers typically utilize an end-to-end approach, that takes the raw current signal as input and predicts the nucleobase sequence. CHAPTER 2. BACKGROUND 7
2.2.3 State-of-the-art methods A recent review of the state-of-the-art approaches to nanopore base calling is provided by [5]. The top performers include models named Guppy and Flappie developed by Oxford Nanopore Technologies (ONT), the leading company in nanopore sequencing technology [6]. A competitor is the independently devel- oped model referred to as Chiron [7]. As stated by [5], all of the state-of-the-art base callers employ deep neural networks. Chiron utilizes an architecture that couples a convolutional neural network (CNN) with a recurrent neural net- work (RNN) and a connectionist temporal classification (CTC) decoder [7]. ONTs base callers have not been made public, but are known to be based on deep networks [5]. However, parts of the research community argue that a HMM-based approach could still be favorable over the use of deep networks. According to the review by [5], Guppy generally achieves the best results of the available base callers. The most commonly used performance metric in base calling is known as identity rate, of which a definition is provided in section 2.3.4. Guppy generally attains identity rates in the range 87 - 91 %, depending on model configuration and the data used for training.
2.2.4 Preprocessing for base calling To the best of the author’s knowledge, there have been no previous publications dedicated to studying preprocessing of nanopore signals for the purpose of DNA base calling. Several previous works on base calling do however state the preprocessing techniques that were used for the data. Most base callers that operate directly on the raw input signals apply some form of normalization before calling the signals. Several works including [8], [9], [10] apply median absolute deviation (MAD) normalization to each signal read. In [7], on the other hand, each read is z-score normalized. Definitions of these normalization methods are provided in section 2.4.1. Base callers that require an initial processing of the signals into a segmented or event based representation, such as [11], often apply more specialized normalization. In this work however, we focus on preprocessing in the case where the base caller uses a regular signal representation, rather than a segmented or event based encoding. Other applicable preprocessing techniques such as feature extraction, filter- ing and quantization of the signals appear to be largely unexplored in the field of DNA base calling. The author has not been able to identify any publication where these techniques were mentioned, indicating that uniform quantization was likely used without any preceding feature extraction or filtering. 8 CHAPTER 2. BACKGROUND
2.3 Machine learning for base calling
This section provides the necessary background on machine learning (espe- cially HMMs and variations thereof) to gain a basic understanding of the stud- ied base caller. For the interested reader, a more detailed background on this topic is provided in AppendixA.
2.3.1 Machine learning Machine learning is the study of algorithms that learn from data without ex- plicit instruction. The field is generally broadly categorized into two sub-fields based on the task to be solved: supervised learning and unsupervised learn- (i) (i) N ing. Supervised learning assumes access to a dataset D = {(x , y )}i=1 of N inputs x(i) and corresponding targets y(i). This dataset is referred to as the training set, as it is what is used by the machine learning algorithm in its learning procedure. Given the training set, the task of supervised learning is to find a general mapping from inputs x(i) to targets y(i), such that for new data which was not seen during the training phase, the machine learning algorithm can accurately infer the target given only the input. In unsupervised learning, (i) N the training set consists only of inputs D = {x }i=1. The task is then instead to infer information about the inherent structure of the input data, for instance by identifying clusters or groupings of the samples.
2.3.2 Hidden Markov Models (HMMs) In many unsupervised learning settings, where only inputs are observed, there is a notion of additional, hidden variables (so-called latent variables) which are assumed to affect the inputs in a way that cannot be observed directly from the data. In such settings, one can attempt to explicitly account for these latent variables in the machine learning model, either as a means to improve the per- formance, or because inference of the latent variable itself might be of interest. A frequently used model for sequence data that has this structure is the Hidden Markov Model (HMM). In a HMM, the modeled system is assumed to be a Markov process with hidden states. This means that at each time step, the system takes one of a num- ber of possible hidden states with a probability dependent only on the previous state. Additionally it assumes that the observed data at time step t is depen- dent only on the state of the model at time step t. This observed data, which we previously referred to as inputs, is typically denoted as observations in this CHAPTER 2. BACKGROUND 9
Figure 2.3: A Hidden Markov Model with the hidden states S1,S2,... , and observations X1,X2,... represented as a Bayesian network. context. A Hidden Markov Model is commonly represented as a Bayesian net- work, as illustrated in Figure 2.3. There exists several variations of the HMM. A more detailed review is provided in [12].
The Baum-Welch algorithm In order to use a Hidden Markov Model for inference, its parameters must be learned from data. The procedure for training a HMM is known as the Baum-Welch algorithm, and has its basis in a more general machine learning algorithm known as the Expectation Maximization (EM) algorithm. The EM algorithm is used to train models in which some of the variables are unob- served (i.e. latent). By taking the expectation over all latent variables (known as the E-step), a lower bound for the log-likelihood of the data in the vicinity of the current estimate of the model parameters can be found. The estimate of the model parameters is then updated to maximize this lower bound (known as the M-step). By iteratively performing the E-step and M-step, given some initial guess for the model parameters, the EM-algorithm converges to a local maximum. In the Baum-Welch algorithm, expectation maximization is used in combination with recursive computations of probabilities in the HMM to learn the parameters of the model. A more detailed account of the Baum- Welch algorithm can be found in [12].
The Viterbi algorithm Given a trained HMM, one often wants to compute the most probable sequence of hidden states given a particular sequence of observations. Formally, this can be formulated as finding the state sequence (ˆi1,...,ˆiT ) such that
(ˆi1,...,ˆiT ) = arg max P [S1 = i1,...,ST = iT |x1, . . . , xT , λ], (2.1) (i1,...,iT ) 10 CHAPTER 2. BACKGROUND
where λ denotes the parameters of the model. With N different possible states, there are N T possible hidden state sequences that could have generated an ob- served sequence of length T . Thus, in general, this problem cannot be rea- sonably solved with a brute-force search. Instead, one can use a procedure known as the Viterbi algorithm which utilizes dynamic programming to cover only the necessary parts of the full search space. The algorithm operates re- cursively for t = 1,...,T . The key idea is that at time step t, only the most probable path that results in each state i for i = 1,...,N needs to be consid- ered further, since the state at time step t + 1 is dependent only on the state at time step t. Thus, at any time t there are only N possible candidates for the most probable sequence. As the algorithm reaches t = T , the most probable sequence can be selected from the N candidates. A more detailed account of the Viterbi algorithm can be found in [12].
2.3.3 Hidden Semi Markov Models (HSMMs) In the traditional HMM, the only way to model an extended stay in some state is through repeated self-transitions. Since the self-transition probability for any given state is modeled by a single scalar, the level of nuance in the modeling of state durations is highly restricted. A variation of the common HMM where the duration of states are mod- eled explicitly is known as a Hidden Semi Markov Model (HSMM) [13]. In a HSMM, each state has an associated variable duration, during which the model will remain in the state and produce additional observations. The addi- tion of state durations add significant complexity to the model, as the durations must be incorporated in the transition and observation probabilities. However, training and inference with the model can still be performed using extended Baum-Welch and Viterbi algorithms that follow the same principles as in the regular HMM. A detailed account of these algorithms for the HSMM is pro- vided in [13].
Explicit Duration HMM Different simplifying assumptions can be made with regards to how the distri- bution of durations is modeled in the HSMM. For instance, one could assume that 1. a transition to the current state is independent of the duration of the previous state and,
2. the duration is conditioned only on the current state. CHAPTER 2. BACKGROUND 11
A HSMM utilizing these assumptions is known as an Explicit Duration HMM (ED-HMM) [13]. Due to its simplicity compared to other variations of the HSMM, the ED-HMM is the most popular HSMM in many applications [13].
2.3.4 Explicit Duration HMM as a DNA base caller In nanopore DNA sequencing, HMMs (and especially ED-HMMs) can be uti- lized to model the sequence of nucleobases in a DNA strand, given the ob- served current signal. Since the sequence of nucleobases is the unobserved quantity that we wish to infer, they correspond to the sequence of states in the model. The measured current signal corresponds to the observations in the model. As the translocation speed of the DNA strand through the nanopore is non-uniform, ED-HMMs that can model the duration of states with more nu- ance are especially well suited to this application. In this section we proceed to describe one such ED-HMM model that can function as a base caller for nanopore sequencing.
Model definition It is generally assumed that a number of nucleobases are inside the pore at any given time (typically between 3 and 7) and thus affect the signal simultane- ously. The state representation is therefore chosen to be the sequence of k nu- cleobases, a so-called k-mer, currently affecting the signal. For a model using a 5-mer state representation, for instance, the current state might be the se- quence ‘ATACG’. The model remains in the same state for a duration d ∈ Z+, up to a maximum of d = D steps. Henceforth, we refer to the tuple (S, d) of a state S and its associated duration d, as a so-called super-state. The du- ration d functions as a timer for the state in the following way. For durations d = 2,...,D − 1, super-state (S, d) always transitions into the super-state (S, d − 1), i.e. the same state but with a duration one step shorter. From the super-state (S, 1), the model transitions into some new super-state (S,ˆ d), i.e. a new state with a reset timer. This transition models a shift of a single nu- cleobase in/out of the k-mer representing state S. Therefore, compared to a standard ED-HMM, there are additional constraints on what state transitions are possible, since the new k-mer must omit the first nucleobase of the previous k-mer, and append a new nucleobase at the end. For instance, in a 3-mer state representation, ‘TTT’ can only transition to ‘TTA’, ‘TTC’, ‘TTG’ or ‘TTT’. Figure 2.4 shows an illustration of a lattice representation of the model, taking these constraint into account. 12 CHAPTER 2. BACKGROUND
Figure 2.4: The ED-HMM illustrated in a lattice representation with added k-mer transition constraints. The k-mer corresponding to the new state must omit the first nucleobase of the current k-mer, and append a new nucleobase at the end. Note that the figure shows a 3-mer state representation for the sake of brevity. In practice, a 5-mer or 7-mer state is used.
Training Much like with the regular HMM, the parameters of a ED-HMM can be learned from data using a variation of the Baum-Welch algorithm. A detailed account how the algorithm works for a ED-HMM and other HSMMs can be found in [13]. In contrast to the Baum-Welch algorithm that utilizes probabilistic mod- eling of the alignments, another training algorithm that considers only the most probable alignment at any given training step, may also be used. We hence- forth refer to training with the latter algorithm as hard training, and training with the modified Baum-Welch algorithm as soft training.
Inference Base calling consists of finding the most probable sequence of nucleobases (i.e. states) given samples of the nanopore current signal (i.e. observations). Given a trained ED-HMM, much like in the regular HMM, this sequence can be found using the Viterbi algorithm. A more detailed account of how the Viterbi algorithm works for ED-HMM and other HSMMs can be found in [13].
Evaluation After performing inference with any base caller, one commonly wants to com- pare the inferred sequence of nucleobases to the ground-truth reference se- quence in order to evaluate the performance of the base calling model in ques- CHAPTER 2. BACKGROUND 13
tion. Performing such a comparison between nucleobase sequences requires that the sequences are first aligned, so that each base is correctly compared to the corresponding base in the other sequence. This alignment is herein per- formed using a collection of publicly available sequence alignment software tools known as minimap2 [14]. Once the inferred sequence has been aligned with the reference sequence, performance is measured in a metric known as identity rate, which in turn can be split into an insertion rate, a deletion rate, and a substitution rate. The in- sertion, deletion, and substitution rates measure how many bases have been incorrectly inserted, deleted, and substituted in the inferred sequence, in rela- tion to the reference sequence. They are respectively defined as
ic insertion rate = (2.2) mc + ic + dc + sc dc deletion rate = (2.3) mc + ic + dc + sc sc substitution rate = (2.4) mc + ic + dc + sc where mc = count(matches), ic = count(insertions), dc = count(deletions), and sc = count(substitutions). The identity rate is defined as
mc identity rate = (2.5) mc + ic + dc + sc and thus provides a measure of similarity between the inferred sequence and the reference sequence, such that the identity rate is 1 if the sequences are ex- actly identical, and 0 if they do not match in any base. An example of how the different metrics are computed when applied to two short nucleobase se- quences is provided in Figure 2.5.
Figure 2.5: The different performance metrics and their relation illustrated on two short example sequences. 14 CHAPTER 2. BACKGROUND
2.4 Time series analysis
A time series is a sequence of data points that are listed, or otherwise referred to, in time order. All time series are discrete in time, and points are most commonly evenly spaced in time. Points in a time series may however take values from a continuous domain. Time series analysis is the practice of pro- cessing time series with the purpose of extracting useful statistics or other characteristics. Methods for time series analysis are commonly divided into two categories: time-domain methods, i.e. methods that operate directly on the time ordered samples, and frequency-domain methods, i.e. methods that instead operate on a time series in terms of its frequency content. Below, we selectively provide brief background on some relevant time se- ries analysis methods in the time domain.
2.4.1 Normalization Normalization, in the broad context of statistics and adjacent topics, com- monly refers to the rescaling of a set of values into some common scale. Such a rescaling can be done in any number of ways depending on the applica- tion, but often seeks to transform values into a representation that simplifies comparisons between values, or other relevant operations. In many settings, the application of normalization also serves the simple purpose of eliminating constant factors or terms for notational simplicity. In the setting of machine learning it has been shown, for instance by [15] and [16], that applying appro- priate normalization to the input data can play a critical role in attaining stable model training, and achieving optimal performance. Intuitively, the process of normalizing input data ensures that different dimensions, or features, of the input do not have vastly different scales, mitigating dominance of the partic- ular features that happen to have a broader range of values. The importance of normalization can therefore vary vastly, depending on the type of features used in the input data. A variety of different techniques are commonly used for normalization of time series data. Given a time series x(t), t = 0,...,T , a normalization method produces a corresponding normalized series xnorm(t), t = 0,...,T . Below follows a brief review of some of the most common normalization methods. Most methods extract statistical metrics from the entirety of the series, and operate on each value of the series using these metrics. Note that any such method can also be applied in sliding windows of the series, thus normalizing segments of the series separately, based only on local properties. CHAPTER 2. BACKGROUND 15
In this thesis, the z-score and median absolute deviation (MAD) normalization methods specifically, will be studied as candidate normalization methods for the input data in DNA base calling.
Min-max normalization With min-max normalization, the data can be scaled to an arbitrary range [l, h] (e.g. [0, 1] or [−1, 1])[17]. Each value xnorm(t) in the normalized series is computed as
x(t) − mint x(t) xnorm(t) = (h − l) + l. (2.6) maxt x(t) − mint x(t)
Decimal scaling normalization Decimal scaling normalization, moves the decimal of values by scaling them in such a way that maxt |xnorm(t)| < 1 [17]. Specifically, each value xnorm(t) in the normalized series is computed as x(t) x (t) = (2.7) norm 10d where d is the smallest integer such that maxt |xnorm(t)| < 1.
Median normalization In median normalization, values are normalized by the median of the series [17]. This has the advantage of not being affected by the magnitude of extreme outliers. Each value xnorm(t) in the normalized series is computed simply as x(t) x (t) = (2.8) norm median where median is the median of x(t), t = 0,...,T . z-score normalization In z-score or standard scale normalization, values are scaled to an interval centered on 0 by subtracting the sample mean, and dividing by the sample standard deviation [17]. Thus, each value xnorm(t) in the normalized series is computed as x(t) − µ x (t) = (2.9) norm σ where µ is the sample mean and σ is the sample standard deviation of x(t), t = 0,...,T . 16 CHAPTER 2. BACKGROUND
Median absolute deviation normalization MAD normalization scales values similarly to z-score normalization, but us- ing median and MAD instead of mean and standard deviation [18]. This has the advantage of being more resilient to outliers. Each value xnorm(t) in the normalized series is computed as x(t) − median x (t) = (2.10) norm MAD where median is the median of x(t), t = 0,...,T , and MAD = mediant|x(t)− median|
Sigmoid normalization Sigmoid normalization is arguable the simplest form of normalization, as it does not account for any statistical properties of the time series, but simply maps all values to the interval [0, 1] using the sigmoid function [17]. Each value xnorm(t) in the normalized series is computed simply as 1 x (t) = (2.11) norm 1 + e−x(t)
2.4.2 Filtering In order to mitigate the effects of noise or other artefacts, a variety of filtering techniques can be applied to time series. Filters can commonly be defined in the frequency domain, so as to filter out artefacts of a known frequency using for instance a low-pass or a high-pass filter. Here, we instead choose to focus on filtering methods defined in the time domain. Two simple forms of such filtering methods are median and mean filtering. In median filtering, the values in the filtered series are computed as the median of the raw series in a sliding window of fixed size. Specifically, if the + size of the sliding window is w ∈ Z , the values of the filtered series xf can be expressed as median({x(t0) | 0 ≤ t0 < w}), 0 ≤ t < ceil( w ) 2 x (t) = 0 0 w f median({x(t ) | T − w < t ≤ T }),T − ceil( 2 ) < t ≤ T 0 w 0 w median({x(t ) | t − floor( 2 ) ≤ t ≤ t + floor( 2 )}), otherwise. Here we use the convention that at the start (end), where the sliding window only partially overlaps with the series, the filtered series is set to be the median of the first (last) w values. CHAPTER 2. BACKGROUND 17
Analogously, a mean filter can be defined in an identical way, replacing only the median with the mean. The mean filter can be made more computa- tionally efficient by utilizing a running sum to which values can be iteratively added and subtracted to represent the current window.
2.5 Quantization
Quantization is the process of mapping values in a large (often continuous set) into a discrete, countable set. In digital signal processing, analog signals such as electrical currents, must be quantized in order to be processed digitally. Quantization is also a means of data compression, and inherently forms the basis of nearly all lossy compression algorithms. The application of quantiza- tion to multi-dimensional data is commonly referred to as vector quantization. Intuitively, the process of defining a quantization scheme can be inter- preted as finding a partitioning of the input space (usually R or Rn) into a number of regions. In the context of quantization, the regions that make up this partitioning of the input space are commonly referred to as quantization bins. By choosing a fixed but arbitrary ordering of the quantization bins, each bin can be identified by an integer index. To get the quantized representation of a point in the input space, one simply checks which bin the point resides in, and selects the corresponding index. In this setting, we will describe each n quantizer as a function Q : R → {0,...,Nq − 1}, mapping each input value to the index of one of the Nq quantization bins. Below we provide a brief review of two relevant and common quantization and vector quantization methods. We also briefly review a method developed specifically for the purpose of quantizing multi-dimensional decision variables for classification tasks. In our experiments, all three methods will be evaluated as candidate quantization methods for the base calling task.
2.5.1 Uniform quantization Arguably, uniform quantization is the simplest possible method for quantizing data. In this quantization method, the input space is partitioned into regions of uniform size, up to a set of specified maximum/minimum levels. In the scalar setting, given a choice of maximum level lmax ∈ R, minimum level lmin ∈ R, and a number of quantization bins Nq, a uniform quantizer quantizes a sample x as max(0, x − lmin) Q(x) = min Nq − 1, round Nq (2.12) lmax − lmin 18 CHAPTER 2. BACKGROUND
where round denotes rounding to the nearest integer. Here the max and min operators ensure that any input value x < lmin, or x > lmax is mapped to the bin corresponding to lmin and lmax respectively. While the above definition is only applicable in the scalar setting, it can trivially be extended to multi-dimensional input, where bins are created by uniformly partitioning along each dimension. For instance, a uniform vector n Qn (i) quantizer in R with i=1 Nq bins can be implemented by first applying a scalar quantizer separately to each of the n dimensions of the input, resulting (1) (n) in output that lies in Sq = {0,...,Nq − 1} × · · · × {0,...,Nq − 1}, where × denotes the Cartesian product. By then deciding an arbitrary ordering of all vectors in Sq, and assigning a bin index to each output sample according Qn (i) to this ordering, output samples in {0,..., i=1 Nq − 1} are obtained. This scheme results in hyperrectangle quantization bins, as illustrated in Figure 2.6.
2 (1) Figure 2.6: Illustration of a uniform vector quantizer in R , with Nq = 3 and (2) Nq = 6. The point x is quantized as the index of the rectangular quantization bin in which it resides.
2.5.2 k-means quantization k-means is an unsupervised learning algorithm that finds a segmentation of a set of multidimensional data points into k ∈ Z+ clusters. In the resulting clusters, each data point belongs to the cluster for which the squared Euclidean distance between the point and the mean of the points in the cluster is the smallest. A formal description of the basic k-means algorithm can be found in Algorithm1. Given some initialization of cluster means c1, . . . , ck the algo- rithm iteratively assigns points to corresponding sets C1,...,Ck, updates the CHAPTER 2. BACKGROUND 19
cluster means as the mean of all points in the corresponding set, and repeats this process until the point assignments no longer change. The cluster means can in theory be initialized randomly, however in practice more deliberate ini- tialization, such as the scheme proposed by [19] in the so-called k-means++ algorithm, yields faster convergence and more accurate clusters.
Algorithm 1: The standard k-means algorithm. Input: Number of desired clusters k ∈ Z+, n Data points x1, . . . , xN ∈ R , (0) (0) n Initial cluster means c1 , . . . , ck ∈ R . Output: Learned cluster means c1, . . . , ck and corresponding disjoint sets C1,...,Ck jointly containing the points x1, . . . , xN . while assignment step yields change in point assignments do Assignment step. Assign each point to the cluster with the closest cluster mean:
(t) (t) 2 (t) 2 Ci = {xp : ||xp − ci || ≤ ||xp − cj || ∀j ∈ {1, 2, . . . , k}
Update step. Recompute the cluster means with the new point assignments: (t+1) 1 X ci = (t) xp |Ci | (t) xp∈Ci end
The k-means algorithm was first proposed as an approach to vector quan- tization. By running the k-means algorithm with k = Nq on some set of input samples, the set of cluster means m1, . . . , mNq is acquired. The quantization function of new inputs x can then be defined as