DEGREE PROJECT IN THE FIELD OF TECHNOLOGY ELECTRICAL ENGINEERING AND THE MAIN FIELD OF STUDY COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020

Preprocessing of Nanopore Current Signals for DNA Base Calling

JOSEF MALMSTRÖM

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Preprocessing of Nanopore Current Signals for DNA Base Calling

JOSEF MALMSTRÖM

Master in Machine Learning Date: August 31, 2020 Supervisor: Xuechun Xu Examiner: Joakim Jaldén School of Electrical Engineering and Computer Science Swedish title: Förbehandling av strömsignaler från en nanopor för sekvensiering av DNA

iii

Abstract

DNA is a molecule containing genetic information in all living organisms and many viruses. The process of determining the underlying genetic code in the DNA of an organism is known as DNA sequencing, and is commonly used for instance to study viruses, perform forensic analysis, and for medical diagno- sis. One modern sequencing technique is known as nanopore sequencing. In nanopore sequencing, an electrical current signal that varies in amplitude de- pending on the genetic sequence is acquired by feeding a DNA strand through a nanometer scale protein pore, a so-called nanopore. The process of then in- ferring the underlying genetic sequence from the raw current signal is known as base calling. Base calling is commonly modeled as a machine learning problem, typically using Deep Neural Networks (DNNs) or Hidden Markov Models (HMMs). In this thesis, we seek to investigate how preprocessing of the raw electrical current signals can impact the performance of a subsequent base calling model. Specifically, we apply different methods for normaliza- tion, filtering, feature extraction, and quantization to the raw current signals, and evaluate the performance of these methods using a base caller built from a so-called Explicit Duration Hidden Markov Model (ED-HMM), a variation of the regular HMM. The results show that the application of various prepro- cessing techniques can have a moderate impact on the performance of the base caller. With appropriately chosen preprocessing methods, the performance of the studied ED-HMM base caller was improved by 2 - 3 percentage points, compared to a conventional preprocessing scheme. Possible future research directions for instance include exploring the generalizability of the results to deep base calling models, and evaluating other more sophisticated preprocess- ing methods from adjacent fields. iv

Sammanfattning

DNA är den molekyl som bär den genetiska informationen i alla levande or- ganismer och många virus. Processen genom vilken man tolkar den under- liggande genetiska koden i en DNA-molekyl kallas för DNA-sekvensiering, och används exempelvis för att studera virus, utföra rättsmedicinska under- sökningar, och för att diagnostisera sjukdomar. En modern sekvensieringstek- nik kallas för nanopor-sekvensiering (eng: nanopore sequencing). I nanopor- sekvensiering erhålls en elektrisk strömsignal som varierar i amplitud bero- ende på den underliggande genetiska koden genom att inmata en DNA-sträng genom en proteinpor i nanometer-skala, en så kallad nanopor. Processen ge- nom vilken den underliggande genetiska koden bestäms från den obearbetade strömsignalen kallas för basbestämning (eng: base calling). Basbestämning modelleras vanligen som ett maskininlärningsproblem, exempelvis med hjälp av djupa artificiella neuronnät (DNNs) eller dolda Markovmodeller (HMMs). I det här examensarbetet ämnar vi att utforska hur förbehandling av de elekt- riska strömsignalerna kan påverka prestandan hos en basbestämningsmodell. Vi applicerar olika metoder för normalisering, filtrering, mönsterextraktion (eng: feature extraction), och kvantisering på de obearbetade strömsignaler- na, och utvärderar prestandan av metoderna med en basbestämningsmodell som använder sig av en så kallad dold Markovmodell med explicit varaktighet (ED-HMM), en variant av en vanlig HMM. Resultaten visar att tillämpning- en av olika förbehandlingsmetoder kan ha en måttlig inverkan på basbestäm- ningsmodellens prestanda. Med lämpliga val av förbehandlingsmetoder öka- de prestandan hos den studerade ED-HMM-basbestämningsmodellen med 2 - 3 procentenheter jämfört med en konventionell förbehandlingskonfiguration. Möjliga framtida forskningsriktningar inkluderar att undersöka hur väl dessa resultat generaliserar till basbestämmningsmodeller som använder djupa neu- ronnät, och att utforska andra mer sofistikerade förbehandlingsmetoder från närliggande forskningsområden. v

Acknowledgment

I would like to express my sincerest gratitude to my supervisor Xuechun Xu, as well as Professor Joakim Jaldén. Firstly for welcoming me into this project at such short notice, and secondly for going above and beyond their duties to provide me with excellent guidance and advice for the thesis. Contents

1 Introduction1 1.1 Problem statement...... 2 1.2 Outline...... 2

2 Background3 2.1 DNA sequencing...... 3 2.1.1 The DNA molecule...... 3 2.1.2 Methods for DNA sequencing...... 4 2.1.3 Nanopore sequencing...... 4 2.2 Base calling...... 5 2.2.1 Challenges...... 6 2.2.2 Past methods...... 6 2.2.3 State-of-the-art methods...... 7 2.2.4 Preprocessing for base calling...... 7 2.3 Machine learning for base calling...... 8 2.3.1 Machine learning...... 8 2.3.2 Hidden Markov Models (HMMs)...... 8 2.3.3 Hidden Semi Markov Models (HSMMs)...... 10 2.3.4 Explicit Duration HMM as a DNA base caller..... 11 2.4 Time series analysis...... 14 2.4.1 Normalization...... 14 2.4.2 Filtering...... 16 2.5 Quantization...... 17 2.5.1 Uniform quantization...... 17 2.5.2 k-means quantization...... 18 2.5.3 Information loss optimized quantization...... 20

3 Experiments and Results 23 3.1 Experiment setting...... 23

vi CONTENTS vii

3.1.1 Dataset...... 23 3.1.2 Implementation and environment...... 24 3.1.3 Model configurations...... 25 3.2 Preprocessing experiments...... 25 3.2.1 Normalization...... 26 3.2.2 Filtering...... 29 3.2.3 Feature extraction...... 31 3.2.4 Quantization...... 33 3.2.5 Best performing preprocessing configuration..... 38

4 Discussion 49 4.1 Result summary...... 49 4.2 Limitations...... 50 4.3 Future work...... 51 4.4 Ethics and society...... 52

5 Conclusion 54

Bibliography 55

A Detailed background on the ED-HMM base caller 58 A.1 Hidden Markov Models (HMMs)...... 58 A.1.1 The Baum-Welch algorithm...... 59 A.1.2 The Viterbi algorithm...... 60 A.2 Hidden Semi Markov Models (HSMMs)...... 60 A.2.1 Explicit Duration HMM...... 61 A.3 Explicit Duration HMM as a DNA base caller...... 61 A.3.1 Model definition...... 61 A.3.2 Model parameters...... 63 A.3.3 Initialization...... 64 A.3.4 Training...... 65 A.3.5 Inference...... 66 A.3.6 Evaluation...... 66

B Observed shift in DNA translocation speed 68

C Supplemental results 70

D Further experiments with standard deviation feature 72

Chapter 1

Introduction

In a world full of viruses, genetic disorders, and serial killers, the study of de- oxyribonucleic acid (DNA) can be the difference between life and death. The process of determining the underlying genetic code in the DNA of an organism is known as DNA sequencing, and is commonly used to study viruses, provide medical diagnosis, perform forensic analysis, and for various other applica- tions. Over the past decades, advancements in DNA sequencing technology have made the process easier, and significantly cheaper. This has increased the availability of the technology both in research and commercial settings. The speed at which genes can be sequenced has also increased by several orders of magnitude, meaning larger sections of DNA (or even full genomes) can be sequenced in a reasonable time frame [1]. A state-of-the-art method for DNA sequencing is known as nanopore sequencing [2]. In nanopore sequencing, a strand of DNA is fed through nanometer-scale protein pore, a so-called nanopore, and through an adjacent membrane over which a voltage is placed. As the DNA travels through the pore, its sequence of building blocks that make up the genetic code, the so- called nucleobases, affect the flow of ions across the membrane. Thus, by measuring the electrical current through the membrane the sequence of nu- cleobases can be determined. The process of determining the underlying se- quence of nucleobases given an acquired electrical current signal is known as base calling. The problem of achieving accurate base calling is a research problem of its own, typically approached using concepts from the field of ma- chine learning, for instance Hidden Markov Models (HMMs) and Deep Neural Networks (DNNs).

1 2 CHAPTER 1. INTRODUCTION

An aspect of the base calling problem that has been left mostly unex- plored is whether or not preprocessing of the electrical current signals from a nanopore could play an important role in achieving competitive performance. In this thesis, we therefore seek to evaluate different preprocessing techniques and their effects on the performance of a particular base caller.

1.1 Problem statement

The aim of the thesis project was to investigate whether the application of different preprocessing techniques on the signals from a nanopore DNA se- quencer can have a significant impact on the performance of a subsequent base calling model. Specifically, the effects of applying different methods in the fol- lowing preprocessing domains were to be explored:

• Normalization of the electrical current signals.

• Filtering of the signals.

• Quantization of the raw signal samples into a digital representation.

• Feature extraction on the raw current signals, to obtain alternate feature representations of the signals for the base calling model.

1.2 Outline

The rest of the thesis is structured as follows:

• In Chapter 2, relevant background on DNA sequencing, base calling, machine learning, time series analysis, and quantization is provided.

• In Chapter 3, the methods and results of the performed preprocessing experiments are presented and analyzed.

• In Chapter 4, a summary of the result analysis is provided, the signifi- cance of this work in a broader context is discussed, and potential future research directions are suggested.

• In Chapter 5, a summary of the conclusions drawn from this thesis is provided. Chapter 2

Background

2.1 DNA sequencing

DNA is a molecule containing genetic information in all organisms and many viruses. The study of DNA has various applications in biological and medical research, medical diagnosis, forensics, and virology. The process of extracting the genetic code of a DNA molecule is commonly referred to as DNA sequenc- ing. This section will provide relevant background information on the topic of DNA sequencing, starting with a brief review of the structure of DNA, there- after proceeding to historic and state-of-the-art methods for DNA sequencing.

2.1.1 The DNA molecule DNA consists of two strands that coil around each other to form the structure of a double-helix. Each strand has a number of smaller units known as nu- cleotides. Upon each nucleotide, one of four so-called nucleobases attaches. The four nucleobases are known as cytosine (C), guanine (G), adenine (A), and thymine (T). It is the sequence of nucleobases in the DNA strand that encodes the genetic information. Additionally, each nucleobase in a strand bonds to the nucleobase in the corresponding position on the opposing strand, forming a so-called base pair. The structure of the nucleotides are such that C can only bond with G, and A can only bond with T, see Figure 2.1. This means that given only one of the two DNA strands, the sequence of nucleobases on the second strand can be directly inferred.

3 4 CHAPTER 2. BACKGROUND

Figure 2.1: Schematic visualization of the DNA double helix. Adenine (A) bonds only to thymine (T), and cytosine (C) bonds only to guanine (G).

2.1.2 Methods for DNA sequencing The first methods for DNA sequencing, proposed in the 1970s, utilized various chemical modifications of the DNA molecule. An approach known as Maxam- Gilbert sequencing allowed identification of the sequence of base pairs through radioactive labeling of the molecule that could be detected with X-ray imaging. Another method, known as the chain-termination method, chemically modi- fied the DNA in such a way that the different nucleobases appeared different in color. Modern methods for DNA sequencing also utilize a variety of different techniques. A number of popular methods use various approaches to fluores- cent labeling of the nucleotides, that can be detected with cameras. Others chemically modify the DNA molecule in such a way that it releases hydrogen ions which are detected by a sensor. A state-of-the-art method is known as nanopore sequencing, described in greater detail in the following subsection.

2.1.3 Nanopore sequencing In nanopore sequencing a nanopore is utilized to detect the sequence of nucle- obases. The DNA molecule is separated into its two strands, and one strand is fed through the nanopore. The nanopore also sits on a membrane over which a voltage is placed. As the DNA strand is fed through the nanopore the nu- cleobases currently inside the pore affect the electrical resistance of the mem- brane, causing variations in the electrical current. By measuring and analyz- ing the current signal, the sequence of nucleobases can then be inferred. A schematic illustration of the process is provided in Figure 2.2. The leading producer of nanopore sequencing technology is the UK-based company Oxford Nanopore Technologies (ONT) [2]. ONT provides a num- ber of different machines for nanopore sequencing, ranging in performance in terms of speed, read length, and on-board computational capability. For small-scale or field experiments, ONT provide the so-called Flongle, a DNA sequencer in dongle format, that plugs in to a regular PC. On the other end of CHAPTER 2. BACKGROUND 5

Figure 2.2: Schematic illustration of nanopore sequencing, reproduced from [3]. Individual nucleotides passing through the pore affect the resistance in the membrane, resulting in current variations. the spectrum is the PromethION sequencer which is designed for large-scale experiments and provides, among other things, a much higher throughput as well as on-board computation. A popular mid-range sequencer that provides a trade-off between these two machines in terms of speed, portability, and cost is known as the MinION.

2.2 Base calling

In nanopore sequencing, the process of translating the measured current sig- nals into a sequence of nucleobases is commonly referred to as base calling. This task is typically framed as a machine learning problem, where the input is a time series of electrical current samples, and the target is the corresponding sequence of nucleobases. 6 CHAPTER 2. BACKGROUND

2.2.1 Challenges Several complicating factors make achieving high accuracy base calling a dif- ficult problem. Multiple adjacent nucleobases affect the current signal simul- taneously as they travel through the nanopore, meaning different bases cannot be trivially mapped to an individual current level [4]. Accurately base calling long sequences of the same nucleobase, so called homopolymers, can be espe- cially challenging, as it is difficult to detect a transition from one set of nucle- obases, to another identical set of nucleobases. The speed at which the DNA strand travels through the nanopore is also non-uniform, which further compli- cates the mapping from current samples to bases [4]. Since the translocation speed is not constant, the number of electrical current samples originating from each nucleobase (or set of nucleobases) will vary over time. The inher- ent nature of the nanopore sequencing technology also results in the presence of random noise and disruptions in the current signal. Since the modulations of the current are caused by single molecules, or small groups of molecules, they are low in amplitude and therefore sensitive to noise from the surrounding environment.

2.2.2 Past methods Historically, the problem of base calling has typically been modeled using Hidden Markov Models (HMMs) [4]. Background on HMMs and their use as base calling models is provided in section 2.3. In the typical HMM model of the base calling problem, nucleobase subsequences of length k are con- sidered and commonly referred to as k-mers. The choice of k is generally made in relation to how many nucleotides are assumed to affect the nanopore simultaneously (typically 3-7). The hidden state of the HMM represents the k- mer currently passing through the nanopore, while the observations are simply the current signal samples, or some representation thereof. Several variations adding additional complexities also exist, such as including a time duration for the k-mer’s presence in the nanopore in the hidden state. The earliest base calling models applied an initial processing step that translated the current samples into an event-based signal, that was then used as input to the base calling model [4]. Each event in this representation summarized the segments of the raw current signals in a set of statistical metrics (e.g. mean, standard deviation, and duration). In contrast, modern base callers typically utilize an end-to-end approach, that takes the raw current signal as input and predicts the nucleobase sequence. CHAPTER 2. BACKGROUND 7

2.2.3 State-of-the-art methods A recent review of the state-of-the-art approaches to nanopore base calling is provided by [5]. The top performers include models named Guppy and Flappie developed by Oxford Nanopore Technologies (ONT), the leading company in nanopore sequencing technology [6]. A competitor is the independently devel- oped model referred to as Chiron [7]. As stated by [5], all of the state-of-the-art base callers employ deep neural networks. Chiron utilizes an architecture that couples a convolutional neural network (CNN) with a recurrent neural net- work (RNN) and a connectionist temporal classification (CTC) decoder [7]. ONTs base callers have not been made public, but are known to be based on deep networks [5]. However, parts of the research community argue that a HMM-based approach could still be favorable over the use of deep networks. According to the review by [5], Guppy generally achieves the best results of the available base callers. The most commonly used performance metric in base calling is known as identity rate, of which a definition is provided in section 2.3.4. Guppy generally attains identity rates in the range 87 - 91 %, depending on model configuration and the data used for training.

2.2.4 Preprocessing for base calling To the best of the author’s knowledge, there have been no previous publications dedicated to studying preprocessing of nanopore signals for the purpose of DNA base calling. Several previous works on base calling do however state the preprocessing techniques that were used for the data. Most base callers that operate directly on the raw input signals apply some form of normalization before calling the signals. Several works including [8], [9], [10] apply median absolute deviation (MAD) normalization to each signal read. In [7], on the other hand, each read is z-score normalized. Definitions of these normalization methods are provided in section 2.4.1. Base callers that require an initial processing of the signals into a segmented or event based representation, such as [11], often apply more specialized normalization. In this work however, we focus on preprocessing in the case where the base caller uses a regular signal representation, rather than a segmented or event based encoding. Other applicable preprocessing techniques such as feature extraction, filter- ing and quantization of the signals appear to be largely unexplored in the field of DNA base calling. The author has not been able to identify any publication where these techniques were mentioned, indicating that uniform quantization was likely used without any preceding feature extraction or filtering. 8 CHAPTER 2. BACKGROUND

2.3 Machine learning for base calling

This section provides the necessary background on machine learning (espe- cially HMMs and variations thereof) to gain a basic understanding of the stud- ied base caller. For the interested reader, a more detailed background on this topic is provided in AppendixA.

2.3.1 Machine learning Machine learning is the study of algorithms that learn from data without ex- plicit instruction. The field is generally broadly categorized into two sub-fields based on the task to be solved: supervised learning and unsupervised learn- (i) (i) N ing. Supervised learning assumes access to a dataset D = {(x , y )}i=1 of N inputs x(i) and corresponding targets y(i). This dataset is referred to as the training set, as it is what is used by the machine learning algorithm in its learning procedure. Given the training set, the task of supervised learning is to find a general mapping from inputs x(i) to targets y(i), such that for new data which was not seen during the training phase, the machine learning algorithm can accurately infer the target given only the input. In unsupervised learning, (i) N the training set consists only of inputs D = {x }i=1. The task is then instead to infer information about the inherent structure of the input data, for instance by identifying clusters or groupings of the samples.

2.3.2 Hidden Markov Models (HMMs) In many unsupervised learning settings, where only inputs are observed, there is a notion of additional, hidden variables (so-called latent variables) which are assumed to affect the inputs in a way that cannot be observed directly from the data. In such settings, one can attempt to explicitly account for these latent variables in the machine learning model, either as a means to improve the per- formance, or because inference of the latent variable itself might be of interest. A frequently used model for sequence data that has this structure is the Hidden Markov Model (HMM). In a HMM, the modeled system is assumed to be a Markov process with hidden states. This means that at each time step, the system takes one of a num- ber of possible hidden states with a probability dependent only on the previous state. Additionally it assumes that the observed data at time step t is depen- dent only on the state of the model at time step t. This observed data, which we previously referred to as inputs, is typically denoted as observations in this CHAPTER 2. BACKGROUND 9

Figure 2.3: A Hidden Markov Model with the hidden states S1,S2,... , and observations X1,X2,... represented as a Bayesian network. context. A Hidden Markov Model is commonly represented as a Bayesian net- work, as illustrated in Figure 2.3. There exists several variations of the HMM. A more detailed review is provided in [12].

The Baum-Welch algorithm In order to use a Hidden Markov Model for inference, its parameters must be learned from data. The procedure for training a HMM is known as the Baum-Welch algorithm, and has its basis in a more general machine learning algorithm known as the Expectation Maximization (EM) algorithm. The EM algorithm is used to train models in which some of the variables are unob- served (i.e. latent). By taking the expectation over all latent variables (known as the E-step), a lower bound for the log-likelihood of the data in the vicinity of the current estimate of the model parameters can be found. The estimate of the model parameters is then updated to maximize this lower bound (known as the M-step). By iteratively performing the E-step and M-step, given some initial guess for the model parameters, the EM-algorithm converges to a local maximum. In the Baum-Welch algorithm, expectation maximization is used in combination with recursive computations of probabilities in the HMM to learn the parameters of the model. A more detailed account of the Baum- Welch algorithm can be found in [12].

The Viterbi algorithm Given a trained HMM, one often wants to compute the most probable sequence of hidden states given a particular sequence of observations. Formally, this can be formulated as finding the state sequence (ˆi1,...,ˆiT ) such that

(ˆi1,...,ˆiT ) = arg max P [S1 = i1,...,ST = iT |x1, . . . , xT , λ], (2.1) (i1,...,iT ) 10 CHAPTER 2. BACKGROUND

where λ denotes the parameters of the model. With N different possible states, there are N T possible hidden state sequences that could have generated an ob- served sequence of length T . Thus, in general, this problem cannot be rea- sonably solved with a brute-force search. Instead, one can use a procedure known as the Viterbi algorithm which utilizes dynamic programming to cover only the necessary parts of the full search space. The algorithm operates re- cursively for t = 1,...,T . The key idea is that at time step t, only the most probable path that results in each state i for i = 1,...,N needs to be consid- ered further, since the state at time step t + 1 is dependent only on the state at time step t. Thus, at any time t there are only N possible candidates for the most probable sequence. As the algorithm reaches t = T , the most probable sequence can be selected from the N candidates. A more detailed account of the Viterbi algorithm can be found in [12].

2.3.3 Hidden Semi Markov Models (HSMMs) In the traditional HMM, the only way to model an extended stay in some state is through repeated self-transitions. Since the self-transition probability for any given state is modeled by a single scalar, the level of nuance in the modeling of state durations is highly restricted. A variation of the common HMM where the duration of states are mod- eled explicitly is known as a Hidden Semi Markov Model (HSMM) [13]. In a HSMM, each state has an associated variable duration, during which the model will remain in the state and produce additional observations. The addi- tion of state durations add significant complexity to the model, as the durations must be incorporated in the transition and observation probabilities. However, training and inference with the model can still be performed using extended Baum-Welch and Viterbi algorithms that follow the same principles as in the regular HMM. A detailed account of these algorithms for the HSMM is pro- vided in [13].

Explicit Duration HMM Different simplifying assumptions can be made with regards to how the distri- bution of durations is modeled in the HSMM. For instance, one could assume that 1. a transition to the current state is independent of the duration of the previous state and,

2. the duration is conditioned only on the current state. CHAPTER 2. BACKGROUND 11

A HSMM utilizing these assumptions is known as an Explicit Duration HMM (ED-HMM) [13]. Due to its simplicity compared to other variations of the HSMM, the ED-HMM is the most popular HSMM in many applications [13].

2.3.4 Explicit Duration HMM as a DNA base caller In nanopore DNA sequencing, HMMs (and especially ED-HMMs) can be uti- lized to model the sequence of nucleobases in a DNA strand, given the ob- served current signal. Since the sequence of nucleobases is the unobserved quantity that we wish to infer, they correspond to the sequence of states in the model. The measured current signal corresponds to the observations in the model. As the translocation speed of the DNA strand through the nanopore is non-uniform, ED-HMMs that can model the duration of states with more nu- ance are especially well suited to this application. In this section we proceed to describe one such ED-HMM model that can function as a base caller for nanopore sequencing.

Model definition It is generally assumed that a number of nucleobases are inside the pore at any given time (typically between 3 and 7) and thus affect the signal simultane- ously. The state representation is therefore chosen to be the sequence of k nu- cleobases, a so-called k-mer, currently affecting the signal. For a model using a 5-mer state representation, for instance, the current state might be the se- quence ‘ATACG’. The model remains in the same state for a duration d ∈ Z+, up to a maximum of d = D steps. Henceforth, we refer to the tuple (S, d) of a state S and its associated duration d, as a so-called super-state. The du- ration d functions as a timer for the state in the following way. For durations d = 2,...,D − 1, super-state (S, d) always transitions into the super-state (S, d − 1), i.e. the same state but with a duration one step shorter. From the super-state (S, 1), the model transitions into some new super-state (S,ˆ d), i.e. a new state with a reset timer. This transition models a shift of a single nu- cleobase in/out of the k-mer representing state S. Therefore, compared to a standard ED-HMM, there are additional constraints on what state transitions are possible, since the new k-mer must omit the first nucleobase of the previous k-mer, and append a new nucleobase at the end. For instance, in a 3-mer state representation, ‘TTT’ can only transition to ‘TTA’, ‘TTC’, ‘TTG’ or ‘TTT’. Figure 2.4 shows an illustration of a lattice representation of the model, taking these constraint into account. 12 CHAPTER 2. BACKGROUND

Figure 2.4: The ED-HMM illustrated in a lattice representation with added k-mer transition constraints. The k-mer corresponding to the new state must omit the first nucleobase of the current k-mer, and append a new nucleobase at the end. Note that the figure shows a 3-mer state representation for the sake of brevity. In practice, a 5-mer or 7-mer state is used.

Training Much like with the regular HMM, the parameters of a ED-HMM can be learned from data using a variation of the Baum-Welch algorithm. A detailed account how the algorithm works for a ED-HMM and other HSMMs can be found in [13]. In contrast to the Baum-Welch algorithm that utilizes probabilistic mod- eling of the alignments, another training algorithm that considers only the most probable alignment at any given training step, may also be used. We hence- forth refer to training with the latter algorithm as hard training, and training with the modified Baum-Welch algorithm as soft training.

Inference Base calling consists of finding the most probable sequence of nucleobases (i.e. states) given samples of the nanopore current signal (i.e. observations). Given a trained ED-HMM, much like in the regular HMM, this sequence can be found using the Viterbi algorithm. A more detailed account of how the Viterbi algorithm works for ED-HMM and other HSMMs can be found in [13].

Evaluation After performing inference with any base caller, one commonly wants to com- pare the inferred sequence of nucleobases to the ground-truth reference se- quence in order to evaluate the performance of the base calling model in ques- CHAPTER 2. BACKGROUND 13

tion. Performing such a comparison between nucleobase sequences requires that the sequences are first aligned, so that each base is correctly compared to the corresponding base in the other sequence. This alignment is herein per- formed using a collection of publicly available sequence alignment software tools known as minimap2 [14]. Once the inferred sequence has been aligned with the reference sequence, performance is measured in a metric known as identity rate, which in turn can be split into an insertion rate, a deletion rate, and a substitution rate. The in- sertion, deletion, and substitution rates measure how many bases have been incorrectly inserted, deleted, and substituted in the inferred sequence, in rela- tion to the reference sequence. They are respectively defined as

ic insertion rate = (2.2) mc + ic + dc + sc dc deletion rate = (2.3) mc + ic + dc + sc sc substitution rate = (2.4) mc + ic + dc + sc where mc = count(matches), ic = count(insertions), dc = count(deletions), and sc = count(substitutions). The identity rate is defined as

mc identity rate = (2.5) mc + ic + dc + sc and thus provides a measure of similarity between the inferred sequence and the reference sequence, such that the identity rate is 1 if the sequences are ex- actly identical, and 0 if they do not match in any base. An example of how the different metrics are computed when applied to two short nucleobase se- quences is provided in Figure 2.5.

Figure 2.5: The different performance metrics and their relation illustrated on two short example sequences. 14 CHAPTER 2. BACKGROUND

2.4 Time series analysis

A time series is a sequence of data points that are listed, or otherwise referred to, in time order. All time series are discrete in time, and points are most commonly evenly spaced in time. Points in a time series may however take values from a continuous domain. Time series analysis is the practice of pro- cessing time series with the purpose of extracting useful statistics or other characteristics. Methods for time series analysis are commonly divided into two categories: time-domain methods, i.e. methods that operate directly on the time ordered samples, and frequency-domain methods, i.e. methods that instead operate on a time series in terms of its frequency content. Below, we selectively provide brief background on some relevant time se- ries analysis methods in the time domain.

2.4.1 Normalization Normalization, in the broad context of statistics and adjacent topics, com- monly refers to the rescaling of a set of values into some common scale. Such a rescaling can be done in any number of ways depending on the applica- tion, but often seeks to transform values into a representation that simplifies comparisons between values, or other relevant operations. In many settings, the application of normalization also serves the simple purpose of eliminating constant factors or terms for notational simplicity. In the setting of machine learning it has been shown, for instance by [15] and [16], that applying appro- priate normalization to the input data can play a critical role in attaining stable model training, and achieving optimal performance. Intuitively, the process of normalizing input data ensures that different dimensions, or features, of the input do not have vastly different scales, mitigating dominance of the partic- ular features that happen to have a broader range of values. The importance of normalization can therefore vary vastly, depending on the type of features used in the input data. A variety of different techniques are commonly used for normalization of time series data. Given a time series x(t), t = 0,...,T , a normalization method produces a corresponding normalized series xnorm(t), t = 0,...,T . Below follows a brief review of some of the most common normalization methods. Most methods extract statistical metrics from the entirety of the series, and operate on each value of the series using these metrics. Note that any such method can also be applied in sliding windows of the series, thus normalizing segments of the series separately, based only on local properties. CHAPTER 2. BACKGROUND 15

In this thesis, the z-score and median absolute deviation (MAD) normalization methods specifically, will be studied as candidate normalization methods for the input data in DNA base calling.

Min-max normalization With min-max normalization, the data can be scaled to an arbitrary range [l, h] (e.g. [0, 1] or [−1, 1])[17]. Each value xnorm(t) in the normalized series is computed as

x(t) − mint x(t) xnorm(t) = (h − l) + l. (2.6) maxt x(t) − mint x(t)

Decimal scaling normalization Decimal scaling normalization, moves the decimal of values by scaling them in such a way that maxt |xnorm(t)| < 1 [17]. Specifically, each value xnorm(t) in the normalized series is computed as x(t) x (t) = (2.7) norm 10d where d is the smallest integer such that maxt |xnorm(t)| < 1.

Median normalization In median normalization, values are normalized by the median of the series [17]. This has the advantage of not being affected by the magnitude of extreme outliers. Each value xnorm(t) in the normalized series is computed simply as x(t) x (t) = (2.8) norm median where median is the median of x(t), t = 0,...,T . z-score normalization In z-score or standard scale normalization, values are scaled to an interval centered on 0 by subtracting the sample mean, and dividing by the sample standard deviation [17]. Thus, each value xnorm(t) in the normalized series is computed as x(t) − µ x (t) = (2.9) norm σ where µ is the sample mean and σ is the sample standard deviation of x(t), t = 0,...,T . 16 CHAPTER 2. BACKGROUND

Median absolute deviation normalization MAD normalization scales values similarly to z-score normalization, but us- ing median and MAD instead of mean and standard deviation [18]. This has the advantage of being more resilient to outliers. Each value xnorm(t) in the normalized series is computed as x(t) − median x (t) = (2.10) norm MAD where median is the median of x(t), t = 0,...,T , and MAD = mediant|x(t)− median|

Sigmoid normalization Sigmoid normalization is arguable the simplest form of normalization, as it does not account for any statistical properties of the time series, but simply maps all values to the interval [0, 1] using the sigmoid function [17]. Each value xnorm(t) in the normalized series is computed simply as 1 x (t) = (2.11) norm 1 + e−x(t)

2.4.2 Filtering In order to mitigate the effects of noise or other artefacts, a variety of filtering techniques can be applied to time series. Filters can commonly be defined in the frequency domain, so as to filter out artefacts of a known frequency using for instance a low-pass or a high-pass filter. Here, we instead choose to focus on filtering methods defined in the time domain. Two simple forms of such filtering methods are median and mean filtering. In median filtering, the values in the filtered series are computed as the median of the raw series in a sliding window of fixed size. Specifically, if the + size of the sliding window is w ∈ Z , the values of the filtered series xf can be expressed as  median({x(t0) | 0 ≤ t0 < w}), 0 ≤ t < ceil( w )  2 x (t) = 0 0 w f median({x(t ) | T − w < t ≤ T }),T − ceil( 2 ) < t ≤ T  0 w 0 w median({x(t ) | t − floor( 2 ) ≤ t ≤ t + floor( 2 )}), otherwise. Here we use the convention that at the start (end), where the sliding window only partially overlaps with the series, the filtered series is set to be the median of the first (last) w values. CHAPTER 2. BACKGROUND 17

Analogously, a mean filter can be defined in an identical way, replacing only the median with the mean. The mean filter can be made more computa- tionally efficient by utilizing a running sum to which values can be iteratively added and subtracted to represent the current window.

2.5 Quantization

Quantization is the process of mapping values in a large (often continuous set) into a discrete, countable set. In digital signal processing, analog signals such as electrical currents, must be quantized in order to be processed digitally. Quantization is also a means of data compression, and inherently forms the basis of nearly all lossy compression algorithms. The application of quantiza- tion to multi-dimensional data is commonly referred to as vector quantization. Intuitively, the process of defining a quantization scheme can be inter- preted as finding a partitioning of the input space (usually R or Rn) into a number of regions. In the context of quantization, the regions that make up this partitioning of the input space are commonly referred to as quantization bins. By choosing a fixed but arbitrary ordering of the quantization bins, each bin can be identified by an integer index. To get the quantized representation of a point in the input space, one simply checks which bin the point resides in, and selects the corresponding index. In this setting, we will describe each n quantizer as a function Q : R → {0,...,Nq − 1}, mapping each input value to the index of one of the Nq quantization bins. Below we provide a brief review of two relevant and common quantization and vector quantization methods. We also briefly review a method developed specifically for the purpose of quantizing multi-dimensional decision variables for classification tasks. In our experiments, all three methods will be evaluated as candidate quantization methods for the base calling task.

2.5.1 Uniform quantization Arguably, uniform quantization is the simplest possible method for quantizing data. In this quantization method, the input space is partitioned into regions of uniform size, up to a set of specified maximum/minimum levels. In the scalar setting, given a choice of maximum level lmax ∈ R, minimum level lmin ∈ R, and a number of quantization bins Nq, a uniform quantizer quantizes a sample x as    max(0, x − lmin) Q(x) = min Nq − 1, round Nq (2.12) lmax − lmin 18 CHAPTER 2. BACKGROUND

where round denotes rounding to the nearest integer. Here the max and min operators ensure that any input value x < lmin, or x > lmax is mapped to the bin corresponding to lmin and lmax respectively. While the above definition is only applicable in the scalar setting, it can trivially be extended to multi-dimensional input, where bins are created by uniformly partitioning along each dimension. For instance, a uniform vector n Qn (i) quantizer in R with i=1 Nq bins can be implemented by first applying a scalar quantizer separately to each of the n dimensions of the input, resulting (1) (n) in output that lies in Sq = {0,...,Nq − 1} × · · · × {0,...,Nq − 1}, where × denotes the Cartesian product. By then deciding an arbitrary ordering of all vectors in Sq, and assigning a bin index to each output sample according Qn (i) to this ordering, output samples in {0,..., i=1 Nq − 1} are obtained. This scheme results in hyperrectangle quantization bins, as illustrated in Figure 2.6.

2 (1) Figure 2.6: Illustration of a uniform vector quantizer in R , with Nq = 3 and (2) Nq = 6. The point x is quantized as the index of the rectangular quantization bin in which it resides.

2.5.2 k-means quantization k-means is an unsupervised learning algorithm that finds a segmentation of a set of multidimensional data points into k ∈ Z+ clusters. In the resulting clusters, each data point belongs to the cluster for which the squared Euclidean distance between the point and the mean of the points in the cluster is the smallest. A formal description of the basic k-means algorithm can be found in Algorithm1. Given some initialization of cluster means c1, . . . , ck the algo- rithm iteratively assigns points to corresponding sets C1,...,Ck, updates the CHAPTER 2. BACKGROUND 19

cluster means as the mean of all points in the corresponding set, and repeats this process until the point assignments no longer change. The cluster means can in theory be initialized randomly, however in practice more deliberate ini- tialization, such as the scheme proposed by [19] in the so-called k-means++ algorithm, yields faster convergence and more accurate clusters.

Algorithm 1: The standard k-means algorithm. Input: Number of desired clusters k ∈ Z+, n Data points x1, . . . , xN ∈ R , (0) (0) n Initial cluster means c1 , . . . , ck ∈ R . Output: Learned cluster means c1, . . . , ck and corresponding disjoint sets C1,...,Ck jointly containing the points x1, . . . , xN . while assignment step yields change in point assignments do Assignment step. Assign each point to the cluster with the closest cluster mean:

(t) (t) 2 (t) 2 Ci = {xp : ||xp − ci || ≤ ||xp − cj || ∀j ∈ {1, 2, . . . , k}

Update step. Recompute the cluster means with the new point assignments: (t+1) 1 X ci = (t) xp |Ci | (t) xp∈Ci end

The k-means algorithm was first proposed as an approach to vector quan- tization. By running the k-means algorithm with k = Nq on some set of input samples, the set of cluster means m1, . . . , mNq is acquired. The quantization function of new inputs x can then be defined as

2 Q(x) = arg min ||x − mi|| − 1 (2.13) i where the subtraction of 1 has been added to comply with our convention that Q maps to values in {0,...,Nq − 1}. This approach provides a partitioning of the input space in terms of so-called Voronoi regions. A partitioning of Rn into Nq Voronoi regions is defined by a set of Nq points. The Voronoi region of each point is the point itself, along with all points that are closer to the point than any of the other Nq − 1 points. This is exactly what (2.13) describes. 20 CHAPTER 2. BACKGROUND

2.5.3 Information loss optimized quantization A specific setting for designing a vector quantizer is when one wishes to quan- tize data for a classification task. A method for optimizing the design of such a quantizer, to the classification task at hand, was proposed in [20]. The ap- proach is rooted in the concepts of information theory, and specifically seeks a quantizer that maximizes the information about the target variable contained in the quantized representation of the data. In this context, finding an appropriate quantization scheme is framed as a supervised learning problem. The algo- rithm thus assumes the availability of a training set containing both samples of the quantity to be quantized and samples of the classification target vari- able. In the application of base calling, training sets consisting of aligned se- quences of electrical current samples and a ground truth nucleobase sequence are used. A quantizer could then be trained on this data in a supervised manner by considering a set of current samples and their corresponding ground truth nucleobases.

Problem definition Formally, [20] describes the problem as follows. Consider a multi-dimensional random variable X ∈ Rn, a scalar random variable Y ∈ Y ⊆ R, and a N set {(Xi,Yi)}i=1 of independent samples drawn from the joint distribution of n (X,Y ). A partitioning of R into Nq disjoint sets is sought, such that a random variable K ∈ {0,...,Nq − 1} giving the index of the set that X belongs to, is a so-called sufficient statistic of X for Y . K is said to be a sufficient statistic of X for Y , if and only if I(K; Y ) = I(X; Y ) (2.14) where I(X; Y ) denotes the mutual information between X and Y , which is a symmetric construct from information theory that describes the amount of information obtained about a random variable by observing another. How- ever, since the quantized representation K necessarily contains less informa- tion about Y than X does, the strict equality in (2.14) cannot be achieved in practice. Instead, one can seek to minimize the so-called information loss over all possible choices of K, which is defined to be L = I(X; Y ) − I(K; Y ) (2.15) i.e. the difference in information obtained about Y when observing X, and in- formation obtained about Y when observing K. The approach outlined in [20] consists of performing this minimization under a set of necessary constraints on the choice of K. CHAPTER 2. BACKGROUND 21

Constraints In order to permit quantization of samples outside of the training set (i.e. samples Xi for which the corresponding target Yi is not known), certain con- straints are added to the optimization problem described above. Much like in k-means quantization, the sets indexed by K are constrained to be defined by a set of cluster means {m1, . . . , mNq }, where each mi represents the set n Mi = {x ∈ R : ||x − mi|| ≤ ||x − mj||, ∀j ∈ {1, 2,...,Nq}}, i.e. the Voronoi region of mi. Additionally, in order to make the optimization problem tractable, a relaxation of the information loss function is introduced, which al- lows a “soft” partition of the input space. Instead of strictly assigning a sample x to a cluster mean mi, wi(x) denotes the “weight” of the assignment of x to PNq mi, where the weights are such that i=1 wi(x) = 1. The weights are set according to a Gibbs distribution such that

2 e−β||x−mi|| /2 wi(x) = 2 (2.16) P −β||x−mj || /2 j e where β is a hyperparameter controlling the “softness” of the assignments. Smaller values of β correspond to softer assignments, and letting β approach infinity yields hard clustering.

Algorithm Given the constraints, it can be shown that the empirical version of the in- N formation loss L in (2.15) on a set of samples {Xi,Yi}i=1, can be expressed as N Nq X X L = wk(Xi)DKL(PXi ||πk) (2.17) i=1 k=1 P p(x) where DKL(p||q) = x p(x) log( q(x) ) is the so-called Kullback-Leibler di- vergence between probability distributions p and q, PXi denotes an empirical estimate of the distribution P (Y |X = Xi), and πk an empirical estimate of the target distribution P (Y |X ∈ Mk). Further, it can be shown that a local minimum of L can be found through an EM style algorithm described in Al- gorithm2. The cluster means and target distributions are iteratively updated until the loss has converged. It is noted by [20], that good convergence has been observed when sim- ply initializing cluster means with a k-means clustering, and then initializing the distributions with the point mass PXi = δYi , and averaging rule πk = 1 P PX respectively. |Mk| Xi∈Mk i 22 CHAPTER 2. BACKGROUND

In [20], the described algorithm and initialization is evaluated on a bag-of- features image classification task. It is shown that quantizing input data using the information-loss quantization algorithm yields better performance than k- means quantization for both a Naive Bayes (NB) model and a Support Vector Machine (SVM) model on the studied task [20].

Algorithm 2: Information loss optimization algorithm [20] + Input: Number of desired clusters Nq ∈ Z , n Data points X1,...,XN ∈ R , Targets Y1,...,YN ∈ Y ⊆ R,

Estimates of PXi , ∀ i = 1,...,N, Initial estimates of πk, ∀ k = 1,...,Nq, m(0), . . . , m(0) ∈ n Initial cluster means 1 Nq R , Assignment softness parameter β ≥ 0, Learning rate α > 0.

Output: Learned cluster means m1, . . . , mNq and corresponding target distribution estimates πk = P (Y |K = k), ∀ k = 1,...,Nq. while not converged do Update cluster means

N Nq (t) (t+1) (t) X X (t) ∂wj (Xi) mk = mk − α DKL(PXi ||πj ) (t) i=1 j=1 ∂mk

where

(t) ∂wj (Xi) (t) = β [δjkwk(Xi) − wk(Xi)wj(Xi)] (Xi − mk) ∂mk Update cluster target distributions

PN (t+1) (t+1) i=1 wk (Xi)PXi (y) πk = , ∀ y ∈ Y P PN (t+1) 0 y0 i=1 wk (Xi)PXi (y )

end Chapter 3

Experiments and Results

This chapter will provide a detailed description of each of the performed pre- processing experiments. After the description of the method used in each ex- periment, results in terms of base calling performance given the studied pre- processing scheme, will be presented. To simplify reading, all result tables are placed at the end of the chapter. We begin with a review of the experiment setting, including the dataset, model implementation, and environment used.

3.1 Experiment setting

3.1.1 Dataset

Background The data used in the study was an existing dataset acquired in a collabora- tion between KTH Royal Institute of Technology and SciLifeLab, an insti- tution for the advancement of molecular biosciences in Stockholm, Sweden. Researchers at SciLifeLab sequenced the DNA of an E. coli bacteria using the MinION sequencer, developed by Oxford Nanopore Technologies (ONT). They used a second sequencer by Illumina [21] to acquire a reference genome of the same organism. The Illumina sequencer utilizes a fluorescence based sequencing technique which is slower but more accurate. Researchers at SciL- ifeLab found an alignment between the reference genome and the acquired cur- rent signal reads by first base calling each read with ONTs base caller Guppy, currently the top-performing publicly available base caller. Researchers at KTH then used the software tool Taiyaki [22] by ONT to get an estimated rough alignment between current signal samples and bases in the reference.

23 24 CHAPTER 3. EXPERIMENTS AND RESULTS

Training, validation, and test split The acquired dataset in its entirety consists of roughly 500 000 signal reads, of varying length, organized sequentially in files of about 4 000 reads each. Each read consists of three time series: signal (containing the electrical cur- rent signal), reference (containing the ground-truth nucleobase sequence), and ref_to_signal (containing the rough alignment between the signal and the ref- erence sequence). From the full dataset, three subsets were selected by the author to be used as training, validation, and test sets respectively for the pre- processing experiments. An observed pattern in the data, where the mean translocation speed of the DNA strand drifted unexpectedly over time, put re- strictions on what data could be used for the experiments. In order to not invalidate the assumption of the ED-HMM model that the mean translocation speed is relatively stable, only the first 10 files (of roughly 40 000 reads in total) were used for the preprocessing experiments. A more detailed description of the unexpected observed pattern and its implications is provided in Appendix B. For the 10 chosen files, the following split into validation, test, and training sets was made

• Validation set: File #1 (roughly 4 000 reads).

• Test set: Files #2 - #5 (roughly 16 000 reads).

• Small training set: File #6 (roughly 4 000 reads)

• Large training set: Files #6 - #10 (roughly 20 000 reads)

Training sets of two different sizes were used due to uncertainty in the amount of training data needed when adding feature extraction preprocessing tech- niques that increase the dimensionality of the input data. Since past exper- iments by KTH researchers who developed the ED-HMM base caller have shown that a training set of a single file is sufficient in a standard model con- figuration, the small training set was used by default, only applying the larger training set when suspected it may improve performance.

3.1.2 Implementation and environment For all experiments, an existing implementation of the ED-HMM model de- scribed in section 2.3.4 was used. The model implementation is written in C and supports training and inference on a GPU. For all experiments, training and inference with the model was run on a NVIDIA Tesla V100 DGXS 32 GB graphics card. CHAPTER 3. EXPERIMENTS AND RESULTS 25

A preprocessing module was built by the author, separate from the base caller. The preprocessing module was implemented in Python 3.5 using vari- ous popular scientific computing packages (e.g. numpy, matplotlib, and scikit- learn).

3.1.3 Model configurations Two different versions of the ED-HMM base caller were used: one utilizing a 5-mer state representation and one utilizing a 7-mer state representation. While the 7-mer version of the model provides significantly better perfor- mance compared to the 5-mer version, the 5-mer version is still of interest due to its high base calling speed. The same settings and initialization scheme were used for both versions of the model, differing only in the dimensions. Details on the initialization scheme and hyperparameters used can be found in AppendixA. The small training set and the validation set were used to find a training iteration configuration with good convergence. For both sets, all reads were z-score normalized, using the length of the full signal as window size. A num- ber of training configurations were run on the small training set, and perfor- mance was evaluated on the validation set. It was found that, without addi- tional features, 7 iterations of soft training, followed by 5 iterations of hard training yielded the optimal results. With additional added features, better re- sults were achieved by running 20 iterations of soft training, followed again by 5 iterations of hard training. These two training iteration configurations where therefore used for all preprocessing experiments, with and without additional features, respectively.

3.2 Preprocessing experiments

Four different aspects of the preprocessing of current signals were studied: normalization, filtering, feature extraction, and quantization. This section will provide detailed descriptions of the methods applied, along with results in terms of the base caller’s performance on the test set for different preprocess- ing configurations. Base calling performance is reported in tables containing mean identity rate on the test set, along with mean insertion rate, mean dele- tion rate, and mean substitution rate. We mark the best value achieved, for all metrics, in bold in each table. In addition to the mean performance rates, the base calling software allows generation of histograms showing the identity rate (and/or insertion, deletion, 26 CHAPTER 3. EXPERIMENTS AND RESULTS

and substitution rate) on each read of the test set separately. As these his- tograms look very similar for all experiment results (apart from a shift in mean), we omit them from the main text for the sake of brevity. For exam- ples of these histograms that provide metrics for the base calling performance on a per-read basis, refer to AppendixC.

3.2.1 Normalization The electrical current signal produced by a nanopore is known to be strongly affected by noise, and to therefore contain sudden large spikes in amplitude, as seen in Figure 3.1. Normalization techniques that are more sensitive to large outliers, such as min-max normalization, decimal scaling normalization, and sigmoid normalization, were thus immediately rejected, and not experimented with. As stated in section 2.2.4, previous work indicates that MAD normaliza- tion and z-score normalization are likely the most appropriate techniques for nanopore signals. These methods were therefore selected as the two normal- ization method candidates. For a background description of these, and other normalization methods, refer to section 2.4.1.

Figure 3.1: Example segment of a raw nanopore current signal from the dataset used. Sudden spikes in amplitude are common throughout the signals.

The normalization experiments then consisted of normalizing each read of the datasets using z-score normalization and MAD normalization, and com- paring the base calling performance for the two different methods. While the default method, utilized in previous work, would be to normalize each full read CHAPTER 3. EXPERIMENTS AND RESULTS 27

at once (i.e. with the full signal length as window size), the effect of normal- izing reads in sliding windows of varying sizes with the two methods was also evaluated. The length of the signals vary greatly in the dataset used, as seen in Table 3.1. Given the statistics in Table 3.1, a range of 9 different sliding win- dow sizes in the interval [1024, 131072] were selected to be evaluated, along with the full signal length normalization. All normalization experiments were performed using the small training set as described in section 3.1.1. However, an additional experiment utilizing the large training set was performed, using full signal length z-score normalization, to confirm that training on the small training set does not lead to significant overfitting, or otherwise compromise the performance on the test set. In all normalization experiments, the signals were uniformly quantized into Nq = 101 bins after normalization was applied. Further details on the quantization methodology can be found in section 3.2.4.

mean stddev min max signal length 75 877 56 521 2 256 481 982 Table 3.1: Metrics describing the distribution of signal lengths in the valida- tion set file. Signal lengths were similarly distributed in all files used.

Tables 3.3 and 3.4 show the resulting performance metrics on the test set, for the 5-mer base calling model and 7-mer base calling model respectively, when normalizing data with MAD normalization and varying sliding window size. Tables 3.5 and 3.6 contain the corresponding results for z-score nor- malization. Tables 3.7 and 3.8 contain a summary of the results, along with performance when training on the large training set, for the 5-mer and 7-mer model respectively. We see in Tables 3.3 and 3.4, that a larger window size generally yields bet- ter results when using MAD normalization. However, using full signal length normalization was suboptimal and the highest identity rate was achieved with a window size of 98034, for both the 5-mer and 7-mer models. For z-score normalization, full signal length normalization was not optimal either, as seen in Tables 3.5 and 3.6. Here, a window size of 32768 yielded the highest iden- tity rate, again for both the 5-mer and 7-mer models. The fact that the lengths of the signals vary greatly, as seen in Table 3.1, makes it difficult to analyze what an optimal normalization window size would be. However, it is clear that a window size larger than the mean signal length was preferable for MAD nor- malization, while for z-score normalization a shorter window than the mean signal length yielded the best results. Intuitively, this conclusion aligns well 28 CHAPTER 3. EXPERIMENTS AND RESULTS

with what we know about the two normalization methods. Since MAD nor- malization computes its normalization parameters from the median of sam- ples in the signal window, the inclusion of more outliers has little effect on the resulting parameters. In z-score normalization, on the other hand, the nor- malization parameters are computed from the mean of samples in the signal window, meaning the inclusion of additional outliers can skew the resulting parameters severely. Therefore, the fact that MAD normalization works better with a larger window size than z-score is likely due to its higher resilience to outliers. However, a closer comparison between the performance of the normaliza- tion methods, as seen in Tables 3.7 and 3.8, shows that z-score normalization consistently outperforms MAD normalization, both in full signal length nor- malization and when using the optimal window size for each of the methods. The fact that z-score normalization consistently performs better than MAD normalization seems to indicate that the influence of outliers in the signals is not as significant as feared. Since normalization using the mean and standard deviation works well, it could be the case that large outliers are few enough, or small enough to not have a significant impact on the normalization parameters. Comparing the best performing window sizes for normalization to the dis- tribution of signal lengths in Table 3.1 it seems that a trade off between near- global features and more local features yields the best results when normaliz- ing the signals. A probable explanation for this result is that the signal win- dows need to be large enough to capture a broad enough range of amplitudes to accurately represent the global distribution of amplitudes, but small enough to handle a slow drift in mean current amplitude, which have been known to occur in nanopore signals. We see in Tables 3.7 and 3.8 that using a larger training set did not improve performance in the case of the 5-mer model, but in fact made it slightly worse. However, for the 7-mer model, using a larger training set yielded a marginal improvement in base calling performance. It is concluded that the difference in performance between using the small and large training sets are deemed to be small enough that using a small training set will be sufficient to evaluate different preprocessing methods, given that the preprocessing methods do not add significant additional complexity or dimensionality to the data or to the model. A probable explanation for the drop in performance when training the 5-mer model on a larger training set is that the model overfits when trained on this larger set. The model is more likely to overfit to the larger training set only under the assumption that the data in the added files is very similarly distributed to the data in the smaller training set. However, this is likely the CHAPTER 3. EXPERIMENTS AND RESULTS 29

case, since DNA strands are randomly chopped and fed into the sequencer in a random order. This reasoning could also explain why performance did not drop for the 7-mer model. Since the 7-mer model is significantly more complex than the 5-mer model (with 47 possible states compared to the 45 of the 5-mer model) it is far less likely to overfit. Note, however, that it has not been confirmed, for instance by comparing performance on the test set to that on the training set, whether the trained models have in fact overfitted. Further experiments would thus be needed to draw convincing conclusions on this matter.

3.2.2 Filtering Since the electrical current signals from a nanopore are subject to a significant amount of noise, it is conceivable that filtering of the signals may improve the base calling performance. As the underlying base sequence is encoded in the signal amplitude, and the frequency content of the noise is largely unknown, filters acting in the frequency domain were not considered. Instead, experi- ments were performed with simple median and mean filters, as described in section 2.4.2. Mean and median filtering with the three window sizes 3, 5, and 7 were evaluated. As the current signal contains disrupt fluctuations in amplitude, only small window sizes were considered. It was reasoned that the signal would otherwise be excessively smoothed, leading to a loss of information about the underlying base sequence. Additionally, the mean number of current samples per k-mer was found to be roughly 9.5 for the files used, as shown in AppendixB. The window sizes were therefore also chosen so as to not exceed this number. Before applying each filter, the reads were z-score normalized us- ing window size 32768, as this was the normalization technique that yielded the best base calling performance, see section 3.2.1. After being normalized and filtered, the signals were uniformly quantized into Nq = 101 bins. Tables 3.9 and 3.10 show the resulting performance metrics on the test set, for the 5-mer base calling model and 7-mer base calling model respectively, when applying median and mean filtering with different window sizes to the input data. We see clearly, from both Table 3.9 and Table 3.10, that the filtering does not have a positive effect on base calling performance. For both the 5-mer and the 7-mer models, better performance was achieved when using no filtering at all, compared to all tested filter configurations. It is clear however that applying the filters in shorter windows is favorable over longer windows, for both the 30 CHAPTER 3. EXPERIMENTS AND RESULTS

median and mean filter. A likely explanation for the drop in performance when applying this type of filtering in general is that for even the shortest evaluated window size, the filtering leads to excessive smoothing, as shown in Figure 3.2. It is probable that replacing the signal samples with the median or mean, even in a small proximity, obscures abrupt changes in signal amplitude that could for instance be indicative of a shift to a new k-mer in the nanopore.

Figure 3.2: Example of a signal segment in its plain form, compared to result- ing signal when taking the mean or median in windows of size w = 3. CHAPTER 3. EXPERIMENTS AND RESULTS 31

3.2.3 Feature extraction While a quantized representation of the raw electrical current signals are typ- ically used directly as observations in the ED-HMM base calling model, there are other possible ways to represent the observed signals. A potential means for improving base calling performance is therefore to consider other such possible representations. Specifically, since the model allows for multidimen- sional observations, multiple features could potentially be extracted from the signals and stacked in vectors, so that each observation is a point in Znf rather than a scalar in Z, where nf is the number of features. In addition to the raw current signal, several features were considered that could potentially contain more explicit information regarding the underlying base sequence. Firstly, while median or mean filtering of the signals did not yield an improvement in base calling performance, the possibility was consid- ered that adding the median or mean in sliding windows of the signal as an ad- ditional feature in the observations, may still be beneficial. Secondly, features that capture variance or dispersion in the signal amplitude were considered. As sudden fluctuations in the current signal often indicate the transition from one k-mer to another, it was reasoned that representing these fluctuations more explicitly may convey more information about the underlying base sequence to the base calling model. One dispersion based feature considered is the sample standard deviation s¯ in sliding windows of size w of the signal, computed as  stddev({x(t0) | 0 ≤ t0 < w}), 0 ≤ t < ceil( w )  2 s¯(t) = 0 0 w stddev({x(t ) | T − w < t ≤ T }),T − ceil( 2 ) < t ≤ T  0 w 0 w stddev({x(t ) | t − floor( 2 ) ≤ t ≤ t + floor( 2 )}), otherwise. where x is the electrical current signal of length T , and stddev(X) denotes the sample standard deviation of a set X, such that

1 X 2 stddev(X) = (x − µX ) (3.1) |X| x∈X

1 P where µX = |X| x∈X x. A second dispersion based feature considered is the difference d¯between adjacent samples of the current signal, such that ( x(t) − x(t + 1), t = 0,...,T − 1 d¯(t) = (3.2) x(T − 1) − x(T ), t = T 32 CHAPTER 3. EXPERIMENTS AND RESULTS

where x is the electrical current signal of length T . Note that we henceforth refer collectively to the standard deviation and difference features as dispersion based features, and to the mean and median features as average based features. The feature extraction experiments proceeded as follows. For each of the four features (mean, median, standard deviation, and difference), reads were first normalized using z-score normalization with window size 32768. The feature in question was then extracted and each sample was stacked in a vec- tor together with the plain signal sample, [signal(t), feature(t)]T . These vec- tors were then uniformly quantized into Nq,1 = 101 along the first dimen- sion, and Nq,2 bins along the second, where Nq,2 ∈ {21, 51, 101}, result- ing in a quantized representation with a number of bins Nq = Nq,1Nq,2 ∈ {2121, 5151, 10201}. The mean, median, and standard deviation features were extracted from sliding windows with window size w = 5. In all feature ex- traction experiments, the models were trained with 20 soft iterations, followed by 5 hard iterations. Each model was trained on the small training set. In an additional experiment, a model was trained on the large training set in order to investigate whether the addition of extra features in the observations implied a need for more training data. Tables 3.11 and 3.12 show the resulting performance metrics on the test set, for the 5-mer base calling model and 7-mer base calling model respectively, when applying the extraction of different additional features to the data. Tables 3.13 and 3.14 contain a summary of the best results for each feature, along with performance when training on the small versus the large training set with the difference feature quantized to Nq,2 = 51 bins, for the 5-mer and 7-mer model respectively. We see in Tables 3.11 and 3.12, that the addition of the median and the mean feature both lead to worse base calling performance, no matter the num- ber of quantization bins. However, the addition of the difference feature yields a significant boost in performance, both for the 5-mer and the 7-mer model. In- terestingly, for 5-mer the largest boost in performance is achieved when quan- tizing the difference feature to Nq,2 = 101 bins, while for 7-mer the best per- formance is achieved for Nq,2 = 51. Another notable difference between the 5- mer and 7-mer models is that for 5-mer, the addition of the standard deviation feature yields a similar boost in performance as when adding the difference feature, while for 7-mer adding the standard deviation feature leads to a de- cline in performance. As this decline in performance was unexpected, further experiments were made to verify that the drop in performance was not caused by poor convergence, or the use of an excessively large number of quantiza- tion bins. The details of these experiments are provided in AppendixD. It CHAPTER 3. EXPERIMENTS AND RESULTS 33

was concluded that neither the convergence or the quantization appear to be the cause of the performance drop. There are other potential hypotheses as to why the addition of the standard deviation would lead to a decline in perfor- mance for the 7-mer model. For instance, it could be that different window sizes w for the extraction of the standard deviation could work better depend- ing on the choice of k-mer. Further work would however be needed to draw reliable conclusions. Even with the performance drop seen when adding the standard deviation feature in the 7-mer model, it is clear that, in general, the dispersion based features are favorable over the average based features. As previously described this is not entirely unexpected, since disrupt changes in the amplitude in the current signal are common, and can for instance be indicative of a shift from one k-mer to another in the nanopore. Given the poor performance seen in the filtering experiments, described in section 3.2.2, the decline in performance seen when adding the average based features is not entirely unexpected either. Further, one can conclude from the results in Tables 3.13 and 3.14, that using a larger training set when adding the additional difference feature, does not lead to an improvement in base calling performance, but rather results in marginally worse performance. As previously discussed in section 3.2.1, this may be a result of overfitting but further experimentation would be needed to draw convincing conclusions on the matter.

3.2.4 Quantization Another aspect of the preprocessing that could potentially have a substantial effect on base calling performance is the question of how to optimally quan- tize the raw signal into a representation that can be handled by the ED-HMM model. To study this aspect, two sets of experiments were conducted: one for the case of a scalar observation representation (i.e. with only the plain signal samples as observations), and one for the case of multi-dimensional obser- vations (i.e. with additional features added). In both cases, three different quantization methods were evaluated: uniform quantization, k-means quanti- zation, and the information loss optimized quantization proposed in [20]. For a background description of these methods, refer to section 2.5.

Hyperparameters A number of different hyperparameters exist in the three evaluated quantiza- tion methods. Here we describe how the values for these hyperparameters were selected for our experiments. 34 CHAPTER 3. EXPERIMENTS AND RESULTS

Scalar uniform quantization For the default uniform quantization of the z- score normalized scalar signal, the minimum level lmin and maximum level lmax were selected by plotting a histogram of signal levels (rounded to the closest integer) for a random subset of reads in a signal file, and observing in what interval most signal samples lie. The resulting histogram is provided in Figure 3.3. Given that most samples lie in the interval [−5.0, 5.0], the minimum and maximum levels were selected as lmin = −5.0 and lmax = 5.0, respectively.

Figure 3.3: Histogram of rounded signal levels for the plain, z-score normal- ized signal. We see that most samples lie in the interval [−5.0, 5.0].

A number of bins Nq was selected by inspecting the resulting quantized signal for a number of iteratively larger values of Nq and seeing at what point the quantized signal has no discernible visual differences from the raw signal. It was found that this occurred for Nq = 101. However, to investigate whether a smaller or larger number of bins could be better suited, experiments where made with Nq ∈ {21, 51, 101, 201}

Vector uniform quantization In the vector quantization setting, where a vector containing the plain signal, as well as an additional feature were to be quantized, the same parameters were used for the plain signal dimension, i.e. Nq,1 = 101, lmin,1 = −5.0, and lmax,1 = 5.0. For the additional feature, the same procedure as in the scalar case was performed to find appropriate values of Nq,2, lmin,2, and lmax,2. It was found that Nq,2 = 101 was a reasonable value CHAPTER 3. EXPERIMENTS AND RESULTS 35

for all features, and therefore this value, along with Nq,2 = 51, and Nq,2 = 21 which gave some quantization artefacts, were tested for all features. Figure 3.4 provides an example of a segment of the difference feature in its raw form, and quantized to the three different values of Nq,2. We see that the feature quantized to 101 bins is nearly identical to the raw feature, while the feature quantized to 21 or 51 bins contains some artefacts. For each of the features, the

Figure 3.4: The difference feature (i.e. difference between adjacent signal samples) in its raw form, and quantized to 21, 51, and 101 bins. same histogram approach as for the plain signal was utilized to find appropriate values of lmin,2 and lmax,2. The values used for each feature is provided in Table 3.2. 36 CHAPTER 3. EXPERIMENTS AND RESULTS

feature lmin,2 lmax,2 median -4.0 4.0 mean -5.0 5.0 difference -5.0 5.0 stddev 0.0 4.0 Table 3.2: The minimum levels and maximum levels used for uniform quanti- zation of the different features. k-means quantization For the k-means quantization, the scikit-learn imple- mentation of mini-batch k-means clustering was used [23]. The same num- ber of bins were used in the scalar and vector settings, as in the scalar uni- form quantization and vector uniform quantization respectively. Each k-means model was trained on the small training set, on batches of 100 000 samples at a time. Every sample in the small training set was used, and each batch was only seen once during training.

Information loss optimized quantization For the information loss opti- mized quantization, a custom implementation of Algorithm2, as described in section 2.5, was used. The same number of bins were used in the scalar and vector settings, as in the scalar uniform quantization and vector uniform quantization respectively. The quantization loss model was trained for 30 it- erations, and fed a different batch of 10 000 samples from the small training set on each iteration. The soft cluster assignment parameter β was set by grid search on the values {0.01, 0.1, 1, 10, 100}, and the value β = 0.1 that yielded the best base calling performance on the validation set was selected.

Scalar quantization For the scalar quantization experiments, the reads were first normalized using z-score normalization with window size 32768. Signal samples were then quantized into Nq bins, where Nq ∈ {21, 51, 101, 201}, for each of the three quantization methods. Tables 3.15 and 3.16 show the resulting performance metrics on the test set, for the 5-mer base calling model and 7-mer base calling model respectively, when quantizing the current signal with different quantization methods, and varying number of quantization bins. We see in both Table 3.15 and Table 3.16, that k-means quantization and information loss optimized quantization outperforms uniform quantization in terms of resulting base calling performance for any choice of the number of CHAPTER 3. EXPERIMENTS AND RESULTS 37

quantization bins. However, we also note that while the gain in performance from using k-means quantization over simple uniform quantization is signif- icant when the number of bins is small (e.g. for Nq = 21), the difference is nearly negligible for a large number of bins (e.g. for Nq = 201). The use of k-means quantization adds significant computational complexity at infer- ence time, as the closest cluster mean must be found for each sample we want to quantize. Therefore, using k-means for quantization may not be justified if uniform quantization with a larger number of bins can yield approximately the same level of performance, which the results indicate is the case. The difference in performance between k-means quantization and infor- mation loss optimized quantization is largely non-existent. The application of the information loss optimization only occasionally results in a fraction of a percentage unit increase in the identity rate. Since this optimization adds additional computational overhead compared to k-means (as the information loss optimization algorithm is initialized from the k-means quantization), it is concluded that in the scalar case, using the information loss optimized quanti- zation is not worth the effort. However, it should be noted that in the original algorithm, proposed by [20], each training update is performed using all of the data in the training set. In our experiments, each update is done using a random batch of training data, since the full training set is too large to be used in its entirety for each update. How well the information loss optimized quan- tization algorithm performs with this methodology is previously unexplored, and it could therefore be the case that the algorithm is not as effective when applied in this manner. The choice of hyperparameters is another factor that could not be mimicked exactly from [20], and thus this could potentially also explain the poor performance of the algorithm.

Vector quantization The two most promising features from the experiments on feature extraction, described in section 3.2.3, were selected for further experimentation on quan- tization when additional features are present. The two evaluated features were the difference between adjacent signal samples, and the standard deviation in sliding windows of the signal. The vector quantization experiments proceeded as follows. For each of the two features, reads were first normalized using z- score normalization with window size 32768. For the uniform quantizer, the signal and features were then quantized using the same exact methodology as in the feature extraction experiments, described in section 3.2.3. For the other quantization methods (k-means, and information loss quantization) the 38 CHAPTER 3. EXPERIMENTS AND RESULTS

equivalent total number of bins Nq ∈ {2121, 5151, 10201} were used. Tables 3.17 and 3.18 show the resulting performance metrics on the test set, for the 5-mer base calling model and 7-mer base calling model respec- tively when quantizing the signals, with added difference feature, with differ- ent quantization methods, and varying number of quantization bins. Tables 3.19 and 3.20 show the corresponding results for the standard deviation fea- ture. We see in Tables 3.17-3.20, that k-means quantization and information loss optimized quantization provides some improvement in performance when the number of quantization bins is sufficiently small (i.e. Nq = 2121) as compared to uniform quantization. However, for a larger number of bins, uniform quan- tization performs significantly better. Since the information loss quantization algorithm, as well as k-means, aims to find an optimal set of quantization bins according to some metric, one might expect these methods to perform at least as well as a uniform quantizer. There could be several reasons why this is not what we observe in the results. The most likely explanation is that for the cases where Nq > 2121, the number of bins becomes too large for the k-means al- gorithm and the information loss quantization algorithm to handle. Further, the training configurations of both the k-means and information loss quantiz- ers have not been tuned rigorously. Therefore, the possibility cannot be ruled out that the tested quantizers could be undertrained or overtrained, or that a different choice of hyperparameters would yield different results.

3.2.5 Best performing preprocessing configuration In summary, we provide performance of the best performing preprocessing configuration as compared to the performance of the previous default approach, for the 5-mer and 7-mer model respectively, in Tables 3.21 and 3.22. We see that the best configurations found yield a 2.59 percentage point increase and 2.22 percentage point increase, for 5-mer and 7-mer respectively, compared to only applying full signal MAD normalization. For the 5-mer model, the addition of the standard deviation feature provided a significant boost in per- formance while for the 7-mer model, the addition of the difference feature yielded a larger improvement. For both models, applying z-score normaliza- tion in windows of size w = 32768 yielded the best performance. For the 5-mer model, the best performance was achieved with the number of bins Nq = 5151, with uniform quantization. For 7-mer, the optimal number of bins was Nq = 10201, once again with uniform quantization. CHAPTER 3. EXPERIMENTS AND RESULTS 39

window size identity insertion deletion substitution full signal 76.33% 5.53% 9.47% 8.67% 131072 76.41% 5.51% 9.45% 8.63% 98034 76.44% 5.49% 9.46% 8.61% 65536 76.43% 5.47% 9.48% 8.61% 32768 76.27% 5.47% 9.55% 8.71% 16384 75.79% 5.46% 9.78% 8.97% 8192 74.87% 5.52% 10.15% 9.46% 4096 73.39% 5.63% 10.73% 10.25% 2048 71.36% 5.83% 11.47% 11.34% 1024 69.04% 6.06% 12.24% 12.66%

Table 3.3: Base calling performance of 5-mer model when using MAD nor- malization in varying window sizes.

window size identity insertion deletion substitution full signal 80.25% 5.12% 7.47% 7.16% 131072 80.36% 5.09% 7.45% 7.09% 98034 80.41% 5.07% 7.45% 7.07% 65536 80.40% 5.06% 7.46% 7.08% 32768 80.28% 5.03% 7.53% 7.16% 16384 79.64% 5.08% 7.80% 7.48% 8192 78.46% 5.19% 8.25% 8.09% 4096 76.56% 5.39% 8.97% 9.09% 2048 73.87% 5.67% 9.98% 10.48% 1024 70.58% 5.99% 11.25% 12.18%

Table 3.4: Base calling performance of 7-mer model when using MAD nor- malization in varying window sizes. 40 CHAPTER 3. EXPERIMENTS AND RESULTS

window size identity insertion deletion substitution full signal 77.02% 5.44% 9.26% 8.28% 131072 77.13% 5.41% 9.24% 8.23% 98034 77.18% 5.39% 9.23% 8.20% 65536 77.22% 5.39% 9.22% 8.17% 32768 77.27% 5.36% 9.22% 8.15% 16384 77.18% 5.34% 9.28% 8.21% 8192 76.83% 5.36% 9.41% 8.41% 4096 75.99% 5.45% 9.71% 8.85% 2048 74.66% 5.59% 10.18% 9.57% 1024 72.66% 5.81% 10.86% 10.67%

Table 3.5: Base calling performance of 5-mer model when using z-score nor- malization in varying window sizes.

window size identity insertion deletion substitution full signal 81.18% 4.95% 7.22% 6.65% 131072 81.34% 4.91% 7.19% 6.56% 98034 81.35% 5.57% 6.58% 6.50% 65536 81.50% 4.85% 7.15% 6.50% 32768 81.61% 4.82% 7.14% 6.44% 16384 81.54% 4.78% 7.19% 6.49% 8192 81.11% 4.85% 7.33% 6.71% 4096 80.05% 4.99% 7.71% 7.25% 2048 78.25% 5.92% 7.72% 8.11% 1024 75.81% 5.54% 9.18% 9.47%

Table 3.6: Base calling performance of 7-mer model when using z-score nor- malization in varying window sizes. CHAPTER 3. EXPERIMENTS AND RESULTS 41

method identity insertion deletion substitution MAD (full signal) 76.33% 5.53% 9.47% 8.67% z-score (full signal) 77.02% 5.44% 9.26% 8.28%

MAD (w=98034) 76.44% 5.49% 9.46% 8.61% z-score (w=32768) 77.27% 5.36% 9.22% 8.15%

z-score (full signal, small train) 77.02% 5.44% 9.26% 8.28% z-score (full signal, large train) 76.95% 5.44% 9.31% 8.30%

Table 3.7: Base calling performance for 5-mer model with different normaliza- tion methods. Results shown are for full signal length normalization, optimal window size normalization (98034 for MAD and 32768 for z-score), and full signal length z-score normalization with different size training sets.

method identity insertion deletion substitution MAD (full signal) 80.25% 5.12% 7.47% 7.16% z-score (full signal) 81.18% 4.95% 7.22% 6.65%

MAD (w=98034) 80.41% 5.07% 7.45% 7.07% z-score (w=32768) 81.61% 4.82% 7.14% 6.44%

z-score (full signal, small train) 81.18% 4.95% 7.22% 6.65% z-score (full signal, large train) 81.36% 4.82% 7.27% 6.54%

Table 3.8: Base calling performance for 7-mer model with different normaliza- tion methods. Results shown are for full signal length normalization, optimal window size normalization (98034 for MAD and 32768 for z-score), and full signal length z-score normalization with different size training sets. 42 CHAPTER 3. EXPERIMENTS AND RESULTS

method identity insertion deletion substitution no filtering 77.27% 5.36% 9.22% 8.15%

median, w=3 75.87% 6.63% 8.56% 8.94% median, w=5 73.96% 7.37% 8.83% 9.84% median, w=7 71.47% 7.95% 9.59% 10.99%

mean, w=3 72.58% 7.46% 9.69% 10.27% mean, w=5 67.96% 9.58% 9.91% 12.55% mean, w=7 68.39% 10.00% 8.79% 12.82%

Table 3.9: Base calling performance of 5-mer model when applying median and mean filtering with varying window sizes.

method identity insertion deletion substitution no filtering 81.61% 4.82% 7.14% 6.44%

median, w=3 79.89% 6.40% 6.48% 7.22% median, w=5 77.45% 7.44% 6.83% 8.29% median, w=7 74.26% 8.32% 7.70% 9.71%

mean, w=3 76.27% 7.30% 7.77% 8.66% mean, w=5 69.91% 9.89% 8.69% 11.51% mean, w=7 66.91% 11.33% 8.54% 13.22%

Table 3.10: Base calling performance of 7-mer model when applying median and mean filtering with varying window sizes. CHAPTER 3. EXPERIMENTS AND RESULTS 43

method identity insertion deletion substitution no extra features 77.27% 5.36% 9.22% 8.15%

median, 21 bins 76.04% 6.51% 8.62% 8.83% median, 51 bins 75.71% 7.12% 8.18% 8.99% median, 101 bins 75.61% 7.35% 8.01% 9.03%

mean, 21 bins 76.23% 7.27% 7.27% 8.54% mean, 51 bins 75.43% 8.66% 7.22% 8.68% mean 101 bins 75.05% 9.34% 6.89% 8.72%

difference, 21 bins 78.44% 6.63% 7.12% 7.81% difference, 51 bins 78.70% 7.08% 6.52% 7.70% difference, 101 bins 78.78% 7.21% 6.36% 7.65%

stddev, 21 bins 78.53% 6.50% 7.26% 7.71% stddev, 51 bins 78.81% 6.69% 6.95% 7.54% stddev, 101 bins 78.92% 6.73% 6.86% 7.49% Table 3.11: Base calling performance of 5-mer model when adding different additional extracted features to the observations.

method identity insertion deletion substitution no extra features 81.61% 4.82% 7.14% 6.44%

median, 21 bins 80.04% 6.35% 6.44% 7.17% median, 51 bins 79.28% 7.41% 5.88% 7.43% median, 101 bins 78.78% 8.02% 5.63% 7.57%

mean, 21 bins 80.06% 6.89% 6.10% 6.94% mean, 51 bins 79.07% 8.45% 5.41% 7.06% mean 101 bins 78.67% 9.18% 5.07% 7.07%

difference, 21 bins 82.37% 6.25% 5.19% 6.19% difference, 51 bins 82.47% 6.94% 4.50% 6.09% difference, 101 bins 82.21% 7.44% 4.23% 6.12%

stddev, 21 bins 80.91% 7.02% 5.58% 6.49% stddev, 51 bins 81.07% 7.49% 5.10% 6.35% stddev, 101 bins 80.80% 7.92% 4.88% 6.41% Table 3.12: Base calling performance of 7-mer model when adding different additional extracted features to the observations. 44 CHAPTER 3. EXPERIMENTS AND RESULTS

method identity insertion deletion substitution no extra features 77.27% 5.36% 9.22% 8.15% median 76.04% 6.51% 8.62% 8.83% mean 76.23% 7.27% 7.27% 8.54% difference 78.78% 7.21% 6.36% 7.65% stddev 78.92% 6.73% 6.86% 7.49%

difference (small train) 78.70% 7.08% 6.52% 7.70% difference (large train) 77.90% 8.31% 5.96% 7.83%

Table 3.13: Summary of base calling performance of 5-mer model when adding different additional extracted features to the observations. Here we pro- vide the best achieved results for each feature, along with results when training with the small versus large training set with the difference feature quantized into Nq,2 = 51 bins.

method identity insertion deletion substitution no extra features 81.61% 4.82% 7.14% 6.44% median 80.04% 6.35% 6.44% 7.17% mean 80.06% 6.89% 6.10% 6.94% difference 82.47% 6.94% 4.50% 6.09% stddev 81.07% 7.49% 5.10% 6.35%

difference (small train) 82.47% 6.94% 4.50% 6.09% difference (large train) 82.43% 7.44% 4.21% 5.93%

Table 3.14: Summary of base calling performance of 7-mer model when adding different additional extracted features to the observations. Here we pro- vide the best achieved results for each feature, along with results when training with the small versus large training set with the difference feature quantized into Nq,2 = 51 bins. CHAPTER 3. EXPERIMENTS AND RESULTS 45

method identity insertion deletion substitution uniform, 21 bins 74.02% 5.22% 11.35% 9.41% uniform, 51 bins 77.02% 5.25% 9.52% 8.20% uniform, 101 bins 77.27% 5.36% 9.22% 8.15% uniform, 201 bins 77.31% 5.39% 9.17% 8.13%

k-means, 21 bins 76.94% 5.22% 9.59% 8.25% k-means, 51 bins 77.27% 5.36% 9.22% 8.15% k-means, 101 bins 77.33% 5.38% 9.16% 8.13% k-means, 201 bins 77.34% 5.40% 9.14% 8.13%

info-loss, 21 bins 76.94% 5.22% 9.58% 8.26% info-loss, 51 bins 77.29% 5.35% 9.22% 8.15% info-loss, 101 bins 77.34% 5.38% 9.15% 8.13% info-loss, 201 bins 77.34% 5.39% 9.14% 8.13%

Table 3.15: Base calling performance of 5-mer model when quantizing the electrical current signal with different methods, and varying number of quan- tization bins.

method identity insertion deletion substitution uniform, 21 bins 77.00% 4.91% 9.88% 8.22% uniform, 51 bins 81.18% 4.73% 7.51% 6.58% uniform, 101 bins 81.61% 4.82% 7.13% 6.44% uniform, 201 bins 81.69% 4.86% 7.03% 6.41%

k-means, 21 bins 81.09% 4.69% 7.60% 6.62% k-means, 51 bins 81.60% 4.82% 7.14% 6.44% k-means, 101 bins 81.71% 4.85% 7.03% 6.41% k-means, 201 bins 81.70% 4.92% 6.97% 6.41%

info-loss, 21 bins 81.09% 4.69% 7.59% 6.63% info-loss, 51 bins 81.65% 4.82% 7.11% 6.42% info-loss, 101 bins 81.72% 4.87% 7.01% 6.40% info-loss, 201 bins 81.70% 4.93% 6.96% 6.41%

Table 3.16: Base calling performance of 7-mer model when quantizing the electrical current signal with different methods, and varying number of quan- tization bins. 46 CHAPTER 3. EXPERIMENTS AND RESULTS

method identity insertion deletion substitution difference, uniform, 21 bins 78.44% 6.63% 7.12% 7.81% difference, uniform, 51 bins 78.70% 7.08% 6.52% 7.70% difference, uniform, 101 bins 78.78% 7.21% 6.36% 7.65%

difference, k-means, 21 bins 78.54% 7.38% 6.34% 7.75% difference, k-means, 51 bins 78.53% 7.52% 6.22% 7.73% difference, k-means, 101 bins 78.46% 7.71% 6.07% 7.76%

difference, info-loss, 21 bins 78.63% 7.37% 6.31% 7.70% difference, info-loss, 51 bins 78.56% 7.51% 6.20% 7.73% difference, info-loss, 101 bins 78.45% 7.74% 6.07% 7.75%

Table 3.17: Base calling performance of 5-mer model when quantizing the signals and added difference feature, with different methods, and varying number of quantization bins. Note that the number of quantization bins Nq ∈ 101 × {21, 51, 101} = {2121, 5151, 10201}.

method identity insertion deletion substitution difference, uniform, 21 bins 82.37% 6.25% 5.19% 6.19% difference, uniform, 51 bins 82.47% 6.94% 4.50% 6.09% difference, uniform, 101 bins 82.21% 7.44% 4.23% 6.12%

difference, k-means, 21 bins 81.62% 7.97% 4.14% 6.27% difference, k-means, 51 bins 79.57% 9.86% 3.79% 6.78% difference, k-means, 101 bins 75.79% 13.10% 3.45% 7.66%

difference, info-loss, 21 bins 81.78% 7.96% 4.04% 6.21% difference, info-loss, 51 bins 79.79% 9.76% 3.74% 6.71% difference, info-loss, 101 bins 76.08% 12.91% 3.43% 7.58%

Table 3.18: Base calling performance of 7-mer model when quantizing the signals and added difference feature, with different methods, and varying number of quantization bins. Note that the number of quantization bins Nq ∈ 101 × {21, 51, 101} = {2121, 5151, 10201}. CHAPTER 3. EXPERIMENTS AND RESULTS 47

method identity insertion deletion substitution stddev, uniform, 21 bins 78.53% 6.50% 7.26% 7.71% stddev, uniform, 51 bins 78.81% 6.69% 6.95% 7.54% stddev, uniform, 101 bins 78.92% 6.73% 6.86% 7.49%

stddev, k-means, 21 bins 78.05% 8.05% 6.41% 7.49% stddev, k-means, 51 bins 78.05% 8.16% 6.32% 7.47% stddev, k-means, 101 bins 78.01% 8.32% 6.19% 7.48%

stddev, info-loss, 21 bins 78.10% 8.04% 6.39% 7.47% stddev, info-loss, 51 bins 78.08% 8.16% 6.30% 7.46% stddev, info-loss, 101 bins 78.04% 8.31% 6.19% 7.46%

Table 3.19: Base calling performance of 5-mer model when quantizing the sig- nals and added standard deviation feature, with different methods, and vary- ing number of quantization bins. Note that the number of quantization bins Nq ∈ 101 × {21, 51, 101} = {2121, 5151, 10201}.

method identity insertion deletion substitution stddev, uniform, 21 bins 80.91% 7.02% 5.58% 6.49% stddev, uniform, 51 bins 81.07% 7.49% 5.10% 6.35% stddev, uniform, 101 bins 80.80% 7.92% 4.88% 6.41%

stddev, k-means, 21 bins 80.82% 8.63% 4.37% 6.18% stddev, k-means, 51 bins 78.85% 10.45% 4.00% 6.69% stddev, k-means, 101 bins 75.30% 13.59% 3.61% 7.50%

stddev, info-loss, 21 bins 81.01% 8.50% 4.33% 6.16% stddev, info-loss, 51 bins 79.00% 10.36% 3.96% 6.67% stddev, info-loss, 101 bins 75.46% 13.45% 3.62% 7.47%

Table 3.20: Base calling performance of 7-mer model when quantizing the sig- nals and added standard deviation feature, with different methods, and vary- ing number of quantization bins. Note that the number of quantization bins Nq ∈ 101 × {21, 51, 101} = {2121, 5151, 10201}. 48 CHAPTER 3. EXPERIMENTS AND RESULTS

method identity insertion deletion substitution default preprocessing 76.33% 5.53% 9.47% 8.67% (MAD-norm only) best preprocessing 78.92% 6.73% 6.86% 7.49% (z-score-norm, stddev feature)

Table 3.21: Base calling performance of 5-mer model for the conventional preprocessing configuration (full signal MAD normalization), and the best performing preprocessing configuration found (z-score normalization with w = 2678, and added standard deviation feature).

method identity insertion deletion substitution default preprocessing 80.25% 5.12% 7.47% 7.16% (MAD-norm only) best preprocessing 82.47% 6.94% 4.50% 6.09% (z-score-norm, difference feature)

Table 3.22: Base calling performance of 7-mer model for the conventional preprocessing configuration (full signal MAD normalization), and the best performing preprocessing configuration found (z-score normalization with w = 2678, and added difference feature). Chapter 4

Discussion

4.1 Result summary

In general, the results of the experiments show that applying appropriate pre- processing techniques to the electrical current signals can yield moderate im- provements in performance of a subsequent base caller. In particular, the re- sults indicate that the application of z-score normalization in fairly large win- dows of the signals (w = 2678), compared to other techniques such as MAD normalization, is particularly impactful on the base calling performance. The fact that z-score normalization consistently performed better than MAD nor- malization could be an indication that the influence of outliers in the signals is not as significant as feared. Since normalization using the mean and standard deviation worked well, it could be the case that large outliers are few enough, or small enough to not have a significant impact on the normalization param- eters. The application of mean or median filters on the signals, in varying win- dow sizes, led to worse base calling performance. We believe that a likely explanation for this drop in performance is that for even the shortest evaluated window size, the filtering leads to excessive smoothing. Similarly, the inclu- sion of the mean or median as additional features also led to worse base calling performance. On the other hand, the inclusion of dispersion based features, either in the form of the difference between adjacent signal samples, or the standard deviation in windows of the signals, generally had a positive impact on the base calling performance. However, a clear exception was that for the 7-mer model, the addition of the standard deviation feature led to worse per- formance. Nonetheless, given all other results it is clear that, in general, the dispersion based features are favorable over the average based features. This is

49 50 CHAPTER 4. DISCUSSION

not entirely unexpected, since disrupt changes in the amplitude in the current signal are common, and can for instance be indicative of a shift from one k- mer to another in the nanopore. Further work is however needed to investigate why the standard deviation feature was not beneficial in all cases. The application of more sophisticated quantization techniques (such as k- means quantization or information loss optimized quantization) had a moder- ate positive impact on the base calling performance in the scalar case (i.e. with no added features). However, the performance gains are so small compared to the added computational overhead that using these techniques are most likely not worth the effort in practice. In the multidimensional case (i.e. with added features) the non-uniform quantization techniques did not have any significant positive impact on the base calling performance, but instead led to worse per- formance compared to simple uniform quantization. The underlying reasons for the poor performance of the non-uniform quantization methods is largely unknown, but the results appear to indicate that the number of quantization bins in the tested configurations was too large for the non-uniform quantiza- tion algorithms to handle. Further, the possibility cannot be ruled out that the quantization models were undertrained or overtrained, or that a different choice of hyperparameters would yield different results, since the training pro- cedures for these models were not thoroughly explored.

4.2 Limitations

While the obtained results are valuable, it is important to note that the scope of this thesis is fairly limited. For instance, the effects of the preprocessing techniques were only evaluated for a single base calling model (the ED-HMM base caller). To draw any general conclusions regarding the optimal prepro- cessing scheme for an arbitrary base caller, the techniques would need to be evaluated on more models, especially since most modern base callers utilize deep networks in place of HMMs. Another limitation is that only one dataset was used for the preprocessing experiments. In order to draw conclusions regarding the generalizability of the results to data acquired with other sequencers, or from other organisms, the preprocessing methods would need to be evaluated on multiple datasets. Further, only a small fraction of the full dataset was used, due to the limited capability of the base calling model to take a varying DNA translocation speed into account (see AppendixB for details). Evaluating the preprocessing meth- ods on a larger subset of the dataset would give higher credibility to the results. An additional limitation concerns the hyperparameters of the various pre- CHAPTER 4. DISCUSSION 51

processing methods that were evaluated. Given the limited time frame for the thesis and the limited computational power available, an exhaustive hyperpa- rameter search has not been done for any method. Rather, the hyperparameter have at best been found through sparse grid searches, and at worst been set based on some heuristic or a previous work. It is therefore possible that the chosen hyperparameters are suboptimal, and that the results would be different for the optimal choice of hyperparameters.

4.3 Future work

Given the obtained results, there are several potential future research direc- tions. An obvious extension of this work would be to evaluate the considered preprocessing techniques using additional datasets and/or additional base call- ing models, in order to explore the generalizability of the results. Similarly, the same preprocessing methods could be evaluated with more thorough hy- perparameter searches, to verify the validity of the results. As the experiments with quantization methods beyond uniform quantiza- tion have been fairly limited in this thesis, a potential direction of future work would be to investigate further the importance of quantization in achieving optimal base calling performance. While both k-means quantization and in- formation loss optimized quantization were found to decline performance in our experiments, it is for instance possible that a different configuration (e.g. with fewer quantization bins, or different hyperparameters) could yield better results. Since for normalization, filtering, and feature extraction, only a small set of basic methods were considered, a more interesting future research direction would be to study more sophisticated methods within these domains. One pos- sibility would be to look at previous work in fields with similar input data, for instance speech processing, and consider whether methods used there could potentially be applicable to the electrical current signals from a nanopore. A likely challenge lies in the fact that the in the nanopore signals, information about the underlying nucleobase sequence is modulated in the signal ampli- tude, while in speech processing (and other similar fields) signals are often treated in terms of their frequency content. Another suggestion for potentially exploring further the impact of differ- ent preprocessing methods would be to study the resulting base calling per- formance in greater detail. While the aggregate base calling performance (in terms of identity, insertion, deletion, and substitution rates) on the full test set is perhaps the most relevant target metric, further analysis could poten- 52 CHAPTER 4. DISCUSSION

tially be performed by studying histograms of these metrics across the test set, such as the ones exemplified in AppendixC. Comparing the base calling performance given different preprocessing schemes on specific reads, could potentially yield deeper insight into what the benefits of certain methods are compared to others.

4.4 Ethics and society

While DNA sequencing techniques are evolving, the technology is not yet at a stage where all of society is aware of its existence or its implications. It is however undeniable that the current applications of DNA sequencing has made considerable positive impact on society. The use of sequencing in forensics en- ables the identification of criminal perpetrators through genealogical research. For example, the so-called Golden State Killer, an infamous serial killer who committed at least 13 murders in California during the 1970s and 1980s, was identified in 2018 using genealogical research [24]. DNA sequencing is also the primary tool for identifying and studying viruses, for instance for the pur- pose of developing vaccines, as most viruses are too small to be seen in a microscope. In the currently ongoing COVID-19 pandemic, this particular application is perhaps more relevant than ever before. In clinical medicine, DNA sequencing can for instance be used for diagnosis of diseases that are associated with known mutations in one or multiple genes. Sequencing also plays an important role in research within the fields of biology and medicine, to increase our understanding of genetics and life in general. Even though the societal benefits of DNA sequencing are clear, there are also considerable ethical concerns regarding the technology. The eventual wide-spread availability of DNA sequencers raises questions for instance re- garding who should be allowed to read or share information about the DNA of a human person. One concern is that, in the wrong hands, DNA sequencing could enable a new form of systematic discrimination, based on the genetics of individuals rather than just their externally perceived attributes. Similarly, an- other potential use-case of ethically questionable nature, is if DNA sequencing were to be used to perform genetic tests, for instance for intelligence or ath- letic ability by employers in recruitment processes, or insurance companies to determine individual insurance policies. While the current understanding of genetics and its probabilistic nature speaks against the eventual possibility of performing such tests, these are real concerns since developments in the field happen quickly. The speed at which research progress in the field of DNA se- quencing has been made over the past decade particularly also raises concerns CHAPTER 4. DISCUSSION 53

regarding public awareness of the technology. One could question whether an individual who consents to having their DNA sequenced is sufficiently aware of the different ways the acquired data could come to be used. Chapter 5

Conclusion

In this thesis, we have shown that the application of various preprocessing techniques on the signals from a nanopore DNA sequencer, can have a mod- erate impact on the performance of a subsequent DNA base caller. With ap- propriate methods for normalization, feature extraction, and quantization the mean identity rate of the studied ED-HMM base caller was improved by 2 - 3 percentage points, compared to a conventional preprocessing scheme. It was found that the application of z-score normalization in sliding windows of the input signals, as well as the inclusion of additional dispersion based features, were especially impactful in improving the base calling performance. A relevant future research direction would be to explore the generalizabil- ity of the derived results to other base callers or other datasets. Since most modern base callers utilize deep neural networks, reproducing the experiments with a deep base calling model would be of particular interest. Further work could also be done investigating the impact of quantization of the nanopore signals, as experiments on this topic were fairly limited in this thesis. An ad- ditional research direction would be to evaluate more sophisticated methods for signal normalization, filtering, or feature extraction, perhaps by taking in- spiration from adjacent fields such as speech processing, as only basic methods were considered in this work.

54 Bibliography

[1] Erik Pettersson, Joakim Lundeberg, and Afshin Ahmadian. Genera- tions of sequencing technologies. Feb. 2009. doi: 10.1016/j.ygeno. 2008.10.003. [2] Oxford Nanopore Technologies. url: https://nanoporetech. com/. [3] Kerstin Göpfrich and Kim Judge. “Decoding DNA with a pocket-sized sequencer”. In: Science in School 43 (2018), pp. 17–20. url: https: //www.scienceinschool.org/content/decoding-dna- pocket-sized-sequencer. [4] Franka J. Rang, Wigard P. Kloosterman, and Jeroen de Ridder. From squiggle to basepair: Computational approaches for improving nanopore sequencing read accuracy. July 2018. doi: 10 . 1186 / s13059 - 018-1462-9. url: https://genomebiology.biomedcentral. com/articles/10.1186/s13059-018-1462-9. [5] Ryan R. Wick, Louise M. Judd, and Kathryn E. Holt. “Performance of neural network basecalling tools for Oxford Nanopore sequencing”. In: bioRxiv (Feb. 2019), p. 543439. doi: 10.1101/543439. [6] Oxford Nanopore Technologies. Analysis solutions for nanopore se- quencing data. 2019. url: https://nanoporetech.com/nanopore- sequencing-data-analysis. [7] Haotian Teng et al. “Chiron: translating nanopore raw signal directly into nucleotide sequence using deep learning”. In: GigaScience 7.5 (Apr. 2018). doi: 10 . 1093 / GIGASCIENCE / giy037. url: https : / / academic . oup . com / gigascience / article / 7 / 5 / giy037/4966989.

55 56 BIBLIOGRAPHY

[8] Marcus H Stoiber et al. “De novo Identification of DNA Modifications Enabled by Genome-Guided Nanopore Signal Processing”. In: bioRxiv (Dec. 2016), p. 094672. doi: 10 . 1101 / 094672. url: https : / / www . biorxiv . org / content / 10 . 1101 / 094672v1 . abstract. [9] Peng Ni et al. “DeepSignal: detecting DNA methylation state from Nanopore sequencing reads using deep-learning”. In: Bioinformatics 35.22 (2019), pp. 4586–4595. doi: 10.1093/bioinformatics/btz276. url: https://academic.oup.com/bioinformatics/article/ 35/22/4586/5474907. [10] Qian Liu et al. “Detection of DNA base modifications by deep recur- rent neural network on Oxford Nanopore sequencing data”. In: Nature Communications 10.1 (Dec. 2019), pp. 1–11. issn: 20411723. doi: 10. 1038/s41467-019-10168-2. url: https://www.nature. com/articles/s41467-019-10168-2. [11] Vladimír Boža, Broňa Brejová, and Tomáš Vinař. “DeepNano: Deep recurrent neural networks for base calling in MinION Nanopore reads”. In: PLoS ONE 12.6 (June 2017). issn: 19326203. doi: 10 . 1371 / journal.pone.0178751. url: https://www.ncbi.nlm. nih.gov/pmc/articles/PMC5459436/. [12] Arne Leijon and Gustav Eje Henter. Pattern Recognition - Fundamental Theory and Exercise Problems. Stockholm: KTH - School of Electrical Engineering, 2015. doi: 10.1002/wics.99. [13] Shun Zheng Yu. Hidden semi-Markov models. Feb. 2010. doi: 10 . 1016/j.artint.2009.11.011. [14] Heng Li. “Minimap2: pairwise alignment for nucleotide sequences”. In: Bioinformatics 34.18 (2018), pp. 3094–3100. issn: 1367-4803. doi: 10.1093/bioinformatics/bty191. url: https://doi. org/10.1093/bioinformatics/bty191. [15] Sergey Ioffe and Christian Szegedy. “Batch normalization: Accelerating deep network training by reducing internal covariate shift”. In: 32nd International Conference on Machine Learning, ICML 2015. Vol. 1. International Machine Learning Society (IMLS), Feb. 2015, pp. 448– 456. isbn: 9781510810587. [16] T Jayalakshmi and A Santhakumaran. “Statistical normalization and back propagation for classification”. In: International Journal of Com- puter Theory and Engineering 3.1 (2011), pp. 1793–8201. BIBLIOGRAPHY 57

[17] Samit Bhanja and Abhishek Das. “Impact of Data Normalization on Deep Neural Network for Time Series Forecasting”. In: arXiv preprint arXiv:1812.05519 (Dec. 2018). url: http://arxiv.org/abs/ 1812.05519. [18] S C Nayak, B B Misra, and H S Behera. Impact of Data Normalization on Stock Index Forecasting. Tech. rep. 2014, pp. 257–269. url: www. mirlabs.net/ijcisim/index.html. [19] David Arthur and Sergei Vassilvitskii. k-means++: The Advantages of Careful Seeding. Tech. rep. Stanford University, June 2009. url: http: //ilpubs.stanford.edu:8090/778. [20] Svetlana Lazebnik and Maxim Raginsky. “Supervised learning of quan- tizer codebooks by information loss minimization”. In: IEEE Transac- tions on Pattern Analysis and Machine Intelligence 31.7 (2009), pp. 1294– 1309. issn: 01628828. doi: 10.1109/TPAMI.2008.138. url: https://ieeexplore.ieee.org/document/4531751. [21] Illumina | Sequencing and array-based solutions for genetic research. url: https://emea.illumina.com/. [22] nanoporetech/taiyaki: Training models for basecalling Oxford Nanopore reads. url: https://github.com/nanoporetech/taiyaki. [23] scikit-learn.org. scikit-learn - Mini-Batch K-Means clustering. 2019. url: https : / / scikit - learn . org / stable / modules / generated/sklearn.cluster.MiniBatchKMeans.html. [24] JV Chamary. How Genetic Genealogy Helped Catch The Golden State Killer. 2020. url: https://www.forbes.com/sites/jvchamary/ 2020/06/30/genetic-genealogy-golden-state-killer/ #3c457be5a6d0. Appendix A

Detailed background on the ED- HMM base caller

This appendix provides a more detailed description of the ED-HMM base caller than what was provided in section 2.3. We give a deeper background on HMMs (and HSMMs), and account more extensively for the mathematical de- tails of the ED-HMM base calling model. Note that the text has considerable overlap with section 2.3, but delves deeper into most topics.

A.1 Hidden Markov Models (HMMs)

In many unsupervised learning settings, where only inputs are observed, there is however a notion of additional, hidden variables (so-called latent variables) which are assumed to affect the inputs in a way that cannot be observed directly from the data. In such settings, one can attempt to explicitly account for these latent variables in the machine learning model, either as a means to improve the performance, or because inference of the latent variable itself might be of interest. A frequently used model for sequence data that has this structure is the Hidden Markov Model (HMM). In a HMM, the modeled system is assumed to be a Markov process with hidden states. This means that at each time step, the system takes one of a number of possible hidden states with a probability dependent only on the previous state. Additionally it assumes that the observed data at time step t is dependent only on the state of the model at time step t. This observed data, which we previously referred to as inputs, is commonly denoted as observa- tions in this context. The sequence of hidden states in a HMM for time steps t = 1,...,T can be denoted S1,...,ST , where St assumes a value from a set

58 APPENDIX A. DETAILED BACKGROUND ON THE ED-HMM BASE CALLER 59

Figure A.1: A Hidden Markov Model represented as a Bayesian network. of N possible states. The corresponding sequence of observations are denoted K X1,...,XT , where Xt ∈ R . The parameters of a HMM can be denoted by λ = {q, A, B}, where the elements of q ∈ RN , A ∈ RN×N , and B ∈ RN×K are defined as

qj = P [S1 = j] (A.1)

aij = P [St+1 = j|St = i] (A.2)

bj(x) = P [Xt = x|St = j] (A.3)

th where bj is the j row of B. A Hidden Markov Model is commonly repre- sented as a Bayesian network, as illustrated in Figure A.1. There exists several variations of the HMM. A more detailed review is provided in [12].

A.1.1 The Baum-Welch algorithm In order to use a Hidden Markov Model for inference, its parameters must be learned from data. The procedure for training a HMM is known as the Baum-Welch algorithm, and has its basis in a more general machine learning algorithm known as the Expectation Maximization (EM) algorithm. The EM algorithm is used to train models in which some of the variables are unob- served (i.e. latent). By taking the expectation over all latent variables (known as the E-step), a lower bound for the log-likelihood of the data in the vicinity of the current estimate of the model parameters can be found. The estimate of the model parameters is then updated to maximize this lower bound (known as the M-step). By iteratively performing the E-step and M-step, given some initial guess for the model parameters, the EM-algorithm converges to a local maximum. In the Baum-Welch algorithm, expectation maximization is used in combination with recursive computations of probabilities in the HMM to learn the parameters of the model. A more detailed account of the Baum- Welch algorithm can be found in [12]. 60 APPENDIX A. DETAILED BACKGROUND ON THE ED-HMM BASE CALLER

A.1.2 The Viterbi algorithm Given a trained HMM, one often wants to compute the most probable sequence of hidden states given a particular sequence of observations. Formally, this can be formulated as finding the state sequence (ˆi1,...,ˆiT ) such that

(ˆi1,...,ˆiT ) = arg max P [S1 = i1,...,ST = iT |x1, . . . , xT , λ]. (A.4) (i1,...,iT )

Since there are N T possible hidden state sequences that could have gener- ated an observed sequence of length T , this problem cannot be reasonably solved with a brute-force search. Instead, one can use a procedure known as the Viterbi algorithm which utilizes dynamic programming to cover only the necessary parts of the full search space. The algorithm operates recursively for t = 1,...,T . The key idea is that at time step t, only the most probable path that results in each state i for i = 1,...,N needs to be considered further, since the state at time step t + 1 is dependent only on the state at time step t. Thus, at any time t there are only N possible candidates for the most probable sequence. As the algorithm reaches t = T , the most probable sequence can be selected from the N candidates. A more detailed account of the Viterbi algorithm can be found in [12].

A.2 Hidden Semi Markov Models (HSMMs)

In the traditional HMM, the only way to model an extended stay in some state is through repeated self-transitions. Since the self-transition probability for any given state is modeled by a single scalar, the level of nuance in the modeling of state durations is highly restricted. A variation of the common HMM where the duration of states are mod- eled explicitly is known as a Hidden Semi Markov Model (HSMM) [13]. In a HSMM, each state has an associated variable duration, during which the model will remain in the state and produce additional observations. The addi- tion of state durations add significant complexity to the model, as the durations must be incorporated in the transition and observation probabilities. However, training and inference with the model can still be performed using extended Baum-Welch and Viterbi algorithms that follow the same principles as in the regular HMM. A detailed account of these algorithms for the HSMM is pro- vided in [13]. APPENDIX A. DETAILED BACKGROUND ON THE ED-HMM BASE CALLER 61

A.2.1 Explicit Duration HMM Different simplifying assumptions can be made with regards to how the distri- bution of durations is modeled in the HSMM. For instance, one could assume that

1. a transition to the current state is independent of the duration of the previous state and,

2. the duration is conditioned only on the current state.

A HSMM utilizing these assumptions is known as an Explicit Duration HMM (ED-HMM) [13]. Due to its simplicity compared to other variations of the HSMM, the ED-HMM is the most popular HSMM in many applications [13].

A.3 Explicit Duration HMM as a DNA base caller

In nanopore DNA sequencing, HMMs (and especially ED-HMMs) can be uti- lized to model the sequence of nucleobases in a DNA strand, given the ob- served current signal. Since the sequence of nucleobases is the unobserved quantity that wish to infer, they correspond to the sequence of states in the model. The measured current signal corresponds to the observations in the model. As the translocation speed of the DNA strand through the nanopore is non-uniform, ED-HMMs that can model the duration of states with more nu- ance are especially well suited to this application. In this section we proceed to describe one such ED-HMM model that can function as a base caller for nanopore sequencing.

A.3.1 Model definition It is generally assumed that a number of nucleobases are inside the pore at any given time (typically between 3 and 7) and thus affect the signal simul- taneously. The state representation is therefore chosen to be the sequence of k nucleobases, a so-called k-mer, currently affecting the signal. For a model using a 5-mer state representation, for instance, the current state might be the sequence ‘ATACG’. The model remains in the same state for a duration d ∈ Z+, up to a maxi- mum of d = D steps. Henceforth, we refer to the tuple (S, d) of a state S and its associated duration d, as a so-called super-state. 62 APPENDIX A. DETAILED BACKGROUND ON THE ED-HMM BASE CALLER

Figure A.2: The ED-HMM illustrated in a lattice representation. For d > 1, the model transitions to the same state, but with a lower duration. When d = 1, the model transitions to a new state with a new random duration d. In this example, D = 3.

The modeling of state durations is then implemented as follows. Upon en- try to a state, a duration d is drawn from a probability distribution described D by model parameter C ∈ R , such that Cj = p(d = j). For durations d = 2,...,D − 1, super-state (S, d) always transitions into the super-state (S, d − 1), i.e. the same state but with a duration one step shorter. Like in the regular HMM, an observation is produced on every transition. When the super-state eventually reaches (S, 1), the model transitions into some new state Sˆ (in accordance with the transition probabilities of the model), and a duration for the new state is once again drawn from p(d). In order to facilitate the pos- sibility of remaining in the same state for longer than D steps, the super-state (S,D) transitions back into (S,D) with probability c ≤ 1, and into (S,D −1) with probability 1 − c. The modeling of durations thus adds parameters C and c to be learned, in addition to the transition and observation probabili- ties. Figure A.2 provides a simple illustration of the above scheme in a lattice representation. The transition from the super-state (S, 1) into some new super-state (S,ˆ d) models a shift of a single nucleobase in/out of the k-mer representing state S. Compared to a standard ED-HMM, this adds constraints on what state APPENDIX A. DETAILED BACKGROUND ON THE ED-HMM BASE CALLER 63

Figure A.3: The ED-HMM illustrated in a lattice representation with added k-mer transition constraints. The k-mer corresponding to the new state must omit the first nucleobase of the current k-mer, and append a new nucleobase at the end. Note that the figure shows a 3-mer state representation for the sake of brevity. In practice, a 5-mer or 7-mer state is used. transitions are possible, since the new k-mer must omit the first nucleobase of the previous k-mer, and append a new nucleobase at the end. For instance, in a 3-mer state representation, ‘TTT’ can only transition to ‘TTA’, ‘TTC’, ‘TTG’, ‘TTT’. Figure A.3 shows an illustration of a lattice representation of the model, taking these constraint into account.

A.3.2 Model parameters The dimensions of the parameters of the model described above can be ex- + pressed in terms of the following quantities. Nq ∈ Z is the number of bins into which the raw nanopore current signal is quantized, k ∈ {5, 7} is the k- mer length, and D ∈ Z+ is the maximum duration length. Since there are four different nucleobases, the number of k-mers and thus the number of states, is 4k. k The parameters of the model are the initial state probabilities q ∈ R4 , the k k transition probabilities A ∈ Rk×4 , the observation probabilities B ∈ RNq×4 , the duration probabilities C ∈ RD and c ∈ R. The model parameters are defined so that

qj = P [S1 = j] (A.5)

aij = P [next nucleobase = i|St = j] (A.6)

bij = P [X = i|S = j] (A.7)

Cj = p(d = j) (A.8)

c = P [St+1 = S, d = D|St = S, d = D] (A.9) 64 APPENDIX A. DETAILED BACKGROUND ON THE ED-HMM BASE CALLER

A.3.3 Initialization To apply a training procedure where the parameters of a ED-HMM can be learned from data, the model parameters must be appropriately initialized. Be- low, we exemplify an initialization scheme for the parameters of the ED-HMM base caller. The initial state probabilities can be initialized uniformly, so that qj = 1 4k ∀j. Each column bj(X) = P [X|S = j] of the observation probability matrix B is initialized with a Gaussian probability distribution. It has been observed empirically that k-mers with ’C’ or ’T’ as its mid-point nucleobase typically produce higher current signal values in the nanopore than k-mers with ’A’ or ’G’ as its mid-point nucleobase. For this reason, each row bj is initialized 2 2 with one of two Gaussians, N(µCT, σ ) or N(µAG, σ ), depending on which nucleobase is in the mid-point of the jth k-mer. With a current signal that has been normalized so that samples lie in the interval [−5.0, 5.0], the parameters of Gaussians are set to µCT = 1.0, µAG = −1.0, and σ = 2.0. The transition probability matrix A can be initialized uniformly, so that 1 aij = 4 , ∀ i, j. It has however been observed that better results can be achieved if the transition probabilities are set using statistics from previously base called DNA. From a known sequence of nucleobases, the probability of the next nucleobase being a ’A’, ’C’, ’G’, or ’T’ given the previous k-mer can trivially be estimated, and the entries of the transition probability matrix A set accordingly. For the duration probability distribution, a Poisson distribution with mean λ is assumed, such that

( d−1 −λ λ e , d = 1,...,D − 1 p(d) = (d−1)! (A.10) PD−1 0 1 − d0=1 p(d ), d = D.

The probability c of self-transitioning into duration d = D is set to

D −λ λ e λ c = D! = (A.11) λD−1e−λ D (D−1)! where c < 1 if and only if λ < D. Given this construction, only the value of the parameter λ needs to be set in order to fully initialize the duration distribution. It can be shown that the expected value of the number of steps that the model remains in the same state, henceforth referred to as the expected duration Ed, APPENDIX A. DETAILED BACKGROUND ON THE ED-HMM BASE CALLER 65

can be expressed

D−1  c  X Ed = p(D) D + + dp(d). (A.12) 1 − c d=1

The value of Ed can be empirically be estimated given a previously base called DNA sequence and the corresponding raw current measurements. Further, it can be shown that if regarded as a function Ed = Ed(λ), Ed is strictly ˜ increasing in λ, and Ed > λ + 1. Therefore, given an empirical estimate Ed of ˜ Ed, a bisection search can be used to find the value of λ ∈ (Ed − 1,D) such ˜ that Ed(λ) = Ed.

A.3.4 Training Much like with the regular HMM, the parameters of a ED-HMM can be learned from data using a variation of the Baum-Welch algorithm. A detailed account how the algorithm works for a ED-HMM and other HSMMs can be found in [13]. Typically, in the application of base calling, a ground-truth sequence of nucleobases is available in the training data, along with the electrical current time series. While the ground-truth sequence of nucleobases is known in the training data, there is no ground-truth alignment between this sequence and the current time series indicating which current samples originated from which nucleobase. However, if we consider a hypothetical scenario in which such an alignment was available, learning the parameters of the ED-HMM would be greatly simplified. If the alignment was known, one could for instance, for each possible k-mer, form a histogram from the training data indicating the number of times different current levels was observed while this k-mer was in th the nanopore. The j column of the observation probabilities bj = P [X|S = j] could then be set directly as a normalized version of the histogram for the jth k-mer. The remaining model parameters could similarly be set directly using histograms created from the training data. In actuality, we do not have access to a ground-truth alignment between the current samples and the sequence of nucleobases. Instead, along with the model parameters, the alignment can be learned from the training data. The Baum-Welch algorithm allows learning of the model parameters, similar to the scenario in which a ground-truth alignment is known, but instead taking into account all possible alignments and their respective probabilities. In contrast to the Baum-Welch algorithm that utilizes probabilistic model- ing of the alignments, another training algorithm that considers only the most 66 APPENDIX A. DETAILED BACKGROUND ON THE ED-HMM BASE CALLER

probable alignment at any given training step, may also be used. We hence- forth refer to training with the latter algorithm as hard training, and training with the modified Baum-Welch algorithm as soft training.

A.3.5 Inference Base calling consists of finding the most probable sequence of nucleobases (i.e. states) given samples of the nanopore current signal (i.e. observations). Given a trained ED-HMM, much like in the regular HMM, this sequence can be found using the Viterbi algorithm. A more detailed account of how the Viterbi algorithm works for ED-HMM and other HSMMs can be found in [13].

A.3.6 Evaluation After performing inference with any base caller, one commonly wants to com- pare the inferred sequence of nucleobases to the ground-truth reference se- quence in order to evaluate the performance of the base calling model in ques- tion. Performing such a comparison between nucleobase sequences requires that the sequences are first aligned, so that each base is correctly compared to the corresponding base in the other sequence. This alignment is performed us- ing a collection of publicly available sequence alignment software tools known as minimap2 [14]. Once the inferred sequence has been aligned with the reference sequence, performance is commonly measured in a metric known as identity rate, which in turn can be split into an insertion rate, a deletion rate, and a substitution rate. The insertion, deletion, and substitution rates measure how many bases have been incorrectly inserted, deleted, and substituted in the inferred sequence, in relation to the reference sequence. They are respectively defined as

ic insertion rate = (A.13) mc + ic + dc + sc dc deletion rate = (A.14) mc + ic + dc + sc sc substitution rate = (A.15) mc + ic + dc + sc where mc = count(matches), ic = count(insertions), dc = count(deletions), and sc = count(substitutions). The identity rate is defined as

mc identity rate = (A.16) mc + ic + dc + sc APPENDIX A. DETAILED BACKGROUND ON THE ED-HMM BASE CALLER 67

and thus provides a measure of similarity between the inferred sequence and the reference sequence, such that the identity rate is 1 if the sequences are exactly identical, and 0 if they do not match in any base. Note that, given the above construction, it also holds true that

identity rate = 1 − insertion rate − deletion rate (A.17) − substitution rate

An example of how the different metrics are computed when applied to two short sequences is provided in Figure A.4.

Figure A.4: The different performance metrics and their relation illustrated on two short example sequences. Appendix B

Observed shift in DNA translo- cation speed

While the ED-HMM model used allows for a non-uniform translocation speed, it assumes that the mean translocation speed over a large number of reads does not change over time. In order to check if the mean translocation speed is in fact constant in the full dataset, the average duration that each nucleobase stays in the nanopore was computed separately for each file (of approximately 4 000 reads each), as P ( ) ˆ r∈reads length signalr Ed = P (B.1) r∈reads length(referencer) where reads denotes all reads in the file, and signalr and referencer denotes the signal and reference of read r in the file, respectively. A plot showing the expected duration for each file is provided in Figure B.1. Since the expected durations are plotted according to the acquisition order of the files, it is im- mediately apparent that the expected duration increased over time. That the expected duration increased indicates that the mean translocation speed of the DNA strands decreased over time, since the fact that the average nucleobase stayed longer in the nanopore necessarily implies that the DNA strand moved slower. Since the ED-HMM base caller assumes durations are Poisson distributed with constant parameter λ, it can only successfully model data where the ex- pected duration over a large number of reads is constant. This is an inherent limitation of the model and finding a way to more flexibly model the translo- cation speed of the DNA strands is an ongoing research problem. The current modeling of the translocation speed limits usage of the full dataset to investi- gate the performance of preprocessing, since performance of the model will

68 APPENDIX B. OBSERVED SHIFT IN DNA TRANSLOCATION SPEED 69

Figure B.1: Expected duration of a nucleobase in the nanopore for each file, plotted in the acquisition order of the files. be largely inconsistent across files. For instance a model trained on the first few files will not perform well on the last few files that have a vastly different expected duration, but will perform well on other early files. If however, for instance only the first 10 files are used for training and evaluation, there is a roughly constant expected duration across the files, and performance will be reasonably consistent. Appendix C

Supplemental results

As mentioned in the beginning of section 3.2, the base caller software allows generation of histograms showing the identity rate (and/or insertion, deletion, and substitution rate) on a per-read basis. This gives a more fine-grained per- spective on the performance on the test set. Examples of histograms for the identity rate, insertion rate, deletion rate, and substitution rate can be found in Figures C.1, C.2, C.3, and C.4 respectively. In these examples, signals were normalized using z-score normalization with window size w = 32768 and then uniformly quantized into Nq = 101 bins.

Figure C.1: Histogram of identity rates on the test set.

70 APPENDIX C. SUPPLEMENTAL RESULTS 71

Figure C.2: Histogram of insertion rates on the test set.

Figure C.3: Histogram of deletion rates on the test set.

Figure C.4: Histogram of substitution rates on the test set. Appendix D

Further experiments with stan- dard deviation feature

In order to investigate the unexpected results with the 7-mer model when adding the standard deviation feature, as described in section 3.2.3, some fur- ther experiments were made. With an otherwise identical configuration as that previously described, further training iterations were run with the same mod- els. Table D.1 shows the resulting performance after training for an additional 2 or 4 soft iterations. Table D.2 shows the resulting performance after training for an additional 2 or 4 soft iterations, followed by one additional hard itera- tion. We see that that the models with a higher number of quantization bins improve moderately from additional soft iterations, but that results otherwise worsen from further training. We argue that one therefore safely can conclude that the unexpected drop in performance when adding the standard deviation is not caused by poor convergence. One additional experiment was made where the number of quantization bins was changed to Nq,2 = 11, while keeping the rest of the configuration the same. The resulting performance from this experiment is shown in Table D.3. Comparing to the results of Table D.1, the performance is slightly worse than with Nq,2 = 21. Therefore, we can conclude that the drop in perfor- mance when adding the standard deviation is likely not caused by the number of quantization bins being too large.

72 APPENDIX D. FURTHER EXPERIMENTS WITH STANDARD DEVIATION FEATURE 73

method identity insertion deletion substitutions 20 + 5 iterations, stddev, 21 bins 80.91% 7.02% 5.58% 6.49% 20 + 5 iterations, stddev, 51 bins 81.07% 7.49% 5.10% 6.35% 20 + 5 iterations, stddev, 101 bins 80.80% 7.92% 4.88% 6.41%

22 + 5 iterations, stddev, 21 bins 80.89% 7.21% 5.51% 6.39% 22 + 5 iterations, stddev, 51 bins 81.12% 7.62% 5.04% 6.23% 22 + 5 iterations, stddev, 101 bins 80.98% 7.94% 4.83% 6.24%

24 + 5 iterations, stddev, 51 bins 80.96% 7.75% 5.02% 6.27% 24 + 5 iterations, stddev, 101 bins 80.83% 8.07% 4.82% 6.28%

Table D.1: Resulting performance of 7-mer model on the test set when adding the standard deviation feature and training with more soft iterations.

method identity insertion deletion substitutions 22 + 6 iterations, stddev, 21 bins 80.33% 6.98% 6.05% 6.64% 24 + 6 iterations, stddev, 51 bins 80.48% 7.48% 5.52% 6.52% 24 + 6 iterations, stddev, 101 bins 80.26% 7.86% 5.31% 6.58%

Table D.2: Resulting performance of the 7-mer model on the test set when adding the standard deviation feature and training with both more soft itera- tions, and more hard iterations.

method identity insertion deletion substitutions 20 + 5 iterations, stddev, 11 bins 80.21% 6.90% 6.08% 6.80%

Table D.3: Resulting performance of the 7-mer model on the test set when adding the standard deviation feature and lowering the number of quantization bins to Nq,2 = 11. TRITA-EECS-EX-2020:673 www.kth.se