<<

Florida State University Libraries

Electronic Theses, Treatises and Dissertations The Graduate School

2017 Developing SRSF Shape Analysis Techniques for Applications in Neuroscience and Genomics Sergiusz Wesolowski

Follow this and additional works at the DigiNole: FSU's Digital Repository. For more information, please contact [email protected] FLORIDA STATE UNIVERSITY

COLLEGE OF ARTS AND SCIENCES

DEVELOPING SRSF SHAPE ANALYSIS TECHNIQUES

FOR

APPLICATIONS IN NEUROSCIENCE AND GENOMICS

By

SERGIUSZ WESOLOWSKI

A Dissertation submitted to the Department of in partial fulfillment of the requirements for the degree of Doctor of Philosophy

2017

Copyright c 2017 Sergiusz Wesolowski. All Rights Reserved. � Sergiusz Wesolowski defended this dissertation on October 30, 2017. The members of the supervisory committee were:

Wei Wu Professor Co-Directing Dissertation

Richard Bertram Professor Co-Directing Dissertation

Anuj Srivastava University Representative

Peter Beerli Committee Member

Washington Mio Committee Member

Giray Okten Committee Member

The Graduate School has verified and approved the above-named committee members, and certifies that the dissertation has been approved in accordance with university requirements.

ii ACKNOWLEDGMENTS

First and foremost, I would like to express my deepest gratitude to my advisers, Dr. Richard Bertram and Dr. Wei Wu, for their continuous mentoring and overlooking my progress. I would like to thank the Center of Genomics and Personalized Medicine and its director Dr. Daniel Vera for continuous support, collaboration, feedback and providing the genomic data used in parts of this work. Also I would like to thank Dr. David Gilbert for consultations and interesting ideas for applications of the developed framework. I would like to acknowledge the support, enthusiasm and encouragement of the FSU Shape Analysis Group, with special thanks to Dr. Anuj Srivastava and Dr. Derek Tucker for maintaining the “fdasrvf” Github repository. I would also like to thank my colleagues, professors and administration from the Mathematics department. I would like to express special thanks to Eva - janitor on myfloor, who always had a smile for me. I would like to mention Jorge Martinez, an undergraduate mathematics student at FSU, who through his collaboration, reminded me of my own career path. Last, but not least, I would like to thank Sepideh Ebadi, my better half, for the motivation and the support she has given me in every struggle. Thanks to her I pursued further, investigated deeper and accomplished more.

iii TABLE OF CONTENTS

List of Figures ...... vi List of ...... x Abstract ...... xi

1 Introduction 1 1.1 Stochastic Point Processes and Neural Spike Trains ...... 1 1.2 Functional Data Analysis and Genomics ...... 3 1.2.1 Exon Level Gene Differential Expression ...... 5 1.2.2 Changes in Nucleosomal DNA Positioning and Gene Expression ...... 6 1.3 Theoretical Background: in Function Spaces with the SRSF Framework . . 7

2 A New Framework for Euclidean Summary Statistics in the Neural Spike train Space 12 2.1 Introduction ...... 12 2.2 Methods ...... 14 2.2.1 GVP Metric ...... 15 2.2.2 Definition of the Summary Statistics and Their Properties ...... 15 2.2.3 Computation of the Mean Spike Train ...... 16 2.2.4 Advantages Over Previous Methods of Averaging Spike Trains ...... 21 2.3 Results ...... 23 2.3.1 Noise Removal Method ...... 23 2.3.2 Result for Simulated Data ...... 25 2.3.3 Result in Real Data in Gustatory System ...... 26 2.4 Discussion ...... 29

3 SRSF Shape Analysis for Sequencing Data Reveal New Differentiating Patterns 33 3.1 Introduction ...... 33 3.2 Methods ...... 35 3.2.1 Functional ANOVA for Read Densities ...... 35 3.2.2 Pre-processing of the Raw Data ...... 37 3.2.3 SRSFseq: Base Model ...... 37 3.2.4 SRSFseq: Noise Removal (Shape and Energy Preserving Alignment) . . . . . 38 3.3 Results ...... 41 3.3.1 RNAseq Expression Analysis with Base, Shape and Energy Models ...... 41 3.3.2 Misalignment as Differences in Activity Patterns ...... 44 3.4 Discussion ...... 46

4 How Changes in Shape of Nucleosomal DNA Near TSS Influence Changes of Gene Expression 48 4.1 Introduction ...... 48 4.2 Methods ...... 49

iv 4.2.1 Experimental Description ...... 50 4.2.2 Mathematical Model ...... 50 4.2.3 Algorithm Description ...... 51 4.3 Results ...... 52 4.4 Discussion ...... 54

5 Theoretical Developments for the SRSF Framework 57 5.1 Introduction ...... 57 5.2 Theoretical Results ...... 58 5.2.1 Robustness in Space ...... 58 F 5.2.2 Robustness in the SRSF Space ...... 61 5.3 Discussion ...... 63 5.3.1 Towards the ANOVA Test in the SRSF Space ...... 63 5.4 Auxiliary Lemmas, Proofs and Definitions ...... 65 5.4.1 General Purpose Lemmas ...... 65 5.4.2 Robustness in Space ...... 68 F 5.4.3 Robustness in the SRSF Space ...... 71

6 Summary and Discussion 74

Bibliography ...... 76 Biographical Sketch ...... 82

v LIST OF FIGURES

1.1 Obtaining spike trains from the neural voltage traces. Panel A: An example of obtaining a spike train from a single voltage trace by recording the spike timing. Panel B: A simulated dataset of thirty spike trains. Figure in panel A is adapted and modified from [51]...... 2

1.2 Next Generation Sequecning (NGS) workflow schematic with highlighted possible vari- ability biases. Figure adapted and modified from [68] ...... 4

1.3 Obtaining the point pattern data from the mapped reads. The leftmost coordinate of each read on the reference genome is reported...... 5

1.4 Amplitude and phase variability in a set of . Each of the simulated curves has a similar shape feature. The variability in the shape feature consists of the height and the location, which corresponds to the amplitude and the phase variability components. Figure adapted and modified from [54]...... 9

2.1A: 30 spike trains generated from a homogeneous Poisson process. Each vertical line indicates a spike.B: Estimation results whenλ 2 = 6. Upper Panel: The sum of squared distances (SSD) over all iterations. Lower Panel: The estimated mean spike train over all iterations. The initial is the spike train on the top row (0), and the final estimate is the spike train on the bottom row (12th).C: Estimation result when λ2 = 60...... 20

2.2 Averaged spike trains according to four different methods...... 22

2.3 Scheme differentiating the noise removal approach from standard inference on spike train data. Dashed boxes indicate the components of standard inference framework, the solid lines indicate where the noise removal framework is introduced...... 23

2.4 Illustration of the noise addition and the noise removal with the use of the , ⊕ � operations.A. Background noise - 40 spike trains generated from HPP(10); the mean background noise is presented with dashed lines in the bottom row.B.2 20 spike × trains from IPP(ρX ) (asterisks) and IPP(ρY ) (), respectively.C. Sum of spike trains fromA andB.D. Spike trains after the background noise removed...... 25

2.5 The noise removal influence on classification performance with respect to increasing noise levelα.A. the classification performance for the noisy data. The bold lines represent the average classification score among 50 simulations, the dotted lines indi- cate the standard deviation from the average classification score.B. Same asA, but for the noise-removed data.C. mean classification score curves fromA (dashed line) andB (dotted line)...... 26

vi 2.6 An example of spike trains from Cell 10. Each group of 3 or 4 rows corresponds to a different type of stimuli applied.A. The 5-second pre-stimulus spike trains, whose mean spike train, calculated by the MAPC algorithm, is shown by the thick vertical bars on the top of the panel.B. The 5-second stimulus period.C. the same 5 second period of spike trains as inB, but with spontaneous activity subtracted out...... 27

2.7 The result of the noise removal procedure applied to each of the recorded 21 cells. The marker coding is the same for both panels and indicates the influence of the noise removal approach on the classification score: black circles - increase, grey diamonds - decrease, black asterisks - unchanged.A. Raw classification scores for each cell in each condition.B. same result as inA, but in terms of classification score increase with respect to mean noise size. The vertical black line corresponds to the noise size cutoff of 10 spikes ...... 28

3.1 A simulated example of six samples offiltered density functions coming fromk = 2 dif- ferent condition with significantly different underlying true density functionsµ red, µblack. 38

3.2 Six intensities generated from two conditions: red and black with significantly different true base density functionsµ red, µblack.A) The unaligned raw intensities.B) The same density functions after the phase noise removal procedure...... 40

3.3 Six intensities generated from two conditions: red and black with same underlying true base density functionsµ red =µ black.A) The unaligned raw intensities.B) The same density functions after the shape noise removal procedure...... 40

3.4 The heat-maps of the overlaps between lists of genes called differentially expressed by SRSFseq models (Base, Shape, Energy) and count based methods, using the signifi- cance level (A)α=0.05, (B)α=0.01...... 42

3.5 (A) Example of an exonic region called differentially expressed by all three SRSFseq models, but not detected by any of the count-based methods, Two conditions: control (black), HOXA1 KO (red). Top panel: The point patterns over reference genome obtained by mappingfirst bp of each read, middle panel:filtered density functions, third panels: aligned density functions according to shape-preserving model, bottom panel: aligned density functions according to energy-preserving model. The p-values reported by other methods for the whole gene are: Cufflinks: 0.229, DESeq2: 0.908, Limma-voom: 0.983. (B) Similar example, but for convenience we provide the UCSC genome browser screen-shot for the region on top of the mainfigure. The comparison with genome browser indicates that the new differential patterns detected by SRSFseq can be explained by the current knowledge about gene location. The p-values reported by other methods: Cufflinks: 0.077, DESeq2: not reported, Limma-voom: not reported. 43

3.6 (A) Example of an exonic region called differentially expressed on the significance level ofα=0.01 only after the shape noise is removed. Two conditions: control (black), HOXA1 KO (red). Top panel: The point patterns over the reference genome obtained by mappingfirst base pair of each read. Middle panel:filtered density functions. Bottom panel: aligned density functions. (B) Example of an exon region that was

vii called differentially expressed by the base model, but lost significance after applying the shape-noise removal procedure. The sum of square distances were inflated due to the noise. Two conditions: control (black), HOXA1 KO (red). Top panel: The point patterns over the reference genome obtained by mapping thefirst base pair of each read. Middle panel: The observedfiltered density functions. Bottom panel: The aligned density functions...... 45

3.7 (A) Energy-preserving noise removal improves detection of differences comparing to shape-preserving alignment and base model, by capturing a false positive. (B) Energy-preserving noise removal improves detection of differences comparing to shape- preserving alignment and base model, by capturing a false negative. (C) Energy- preserving noise removal causes loss of information and fails to detect a significant difference between expression patterns. This difference is successfully captured by the shape-preserving alignment and base model...... 46

4.1 DNA shift on nucleosome as presented in [63]. Figure describes three different setups that can occur in DNA positioning around the nucleosome. Thefirst row shows DNA wrapped around nucleosome tightly or forming a loop (shift). The last panel shows how the “shift” in DNA can allow Transcription Factors (TF) to bind and initiate transcirption. The middle row shows DNA behavior in regular - not protected state. Figure adapted and modified from [63] ...... 49

4.2 Illustration of the two-step alignment algorithm. Panel 1: Gathering short read coordinates and mapping them to the reference genome. Each row represents a different sample. There are two samples per two conditions (red, black). Each dot represents a read position on the reference genome. Panel 2: Estimating the densities for each sample. Mathematical Model: The observed i-th sample in the j-th condition is represented by:µ =µ γ +� , ij j ◦ ij ij whereγ ij is a random diffeomorphism representing the compositional noise.µ j is the true, unknown nucleosomal DNA shape specific for thej-th condition. Panel 3 (part I): First

step alignment. Estimating ˆµj using the SRSF. Solid black and red curves are the SRSF- aligned condition-specific averages. Mathematical Model:µ =µ γ +� whereγ is a j ◦ j j j condition specific change of shape. Panel 3 (part II): Second step alignment. Estimateγ j using the SRSF framework tofind optimal ˆγj. The dashed red represents the result of the alignment between the black and the red conditions. Panel 4: Measuring the change of

the shape between the red and the black:γ red vsγ black. The utilized test statistic is the net d 1 2 area between the red and the black curves: (1 γ (γ− (t))) dt ...... 53 − dt red black � 4.3 Thefigure describes the effect of nucleosome� shift increasing the chance of the differ- ential gene expression. The x-axis indicates what percentage of the largest observed shift is detected. The y-axis reflects the proportion of genes that are called differ- entially expressed with at least x% of shift detected to all genes that have at least x% shift. The shift amount is calculated on the DNA region corresponding to the nucleosome located directly before the gene. On any significance level (0.1, 0.05, 0.01) the proportion of DE genes increases as the magnitude of the shift increases...... 54

viii 4.4 Example of two TSS regions with captured large nucleosome rearrangements according to the two-step alignment algorithm. The rearrangement was quantified using the truncated phase distance. The two step alignment algorithm was applied to the curves specified on the whole TSS domain. The statistic, reflecting the change in nucleosome positioning, was calculated only between the dashed vertical lines. For the detailed description of each panel, we refer the reader to the Figure 4.2...... 56

ix LIST OF THEOREMS

1 Definition (Amplitude of a curve) ...... 8 2 Definition (Γ on )...... 8 F 3 Definition (Fisher-Rao Riemannian metric) ...... 9 4 Definition (Amplitude distance) ...... 10 5 Definition (Phase distance) ...... 11 1 Remark (Existence of optimal warping) ...... 11

1 Algorithm (Matching-Adjusting-Pruning-Checking (MAPC) Algorithm:) . 17 6 Definition (Poisson point process) ...... 19 1 Corollary (Poisson point process characterization) ...... 19 1 (Convergence of the MAPC algorithm) ...... 30

2 Algorithm (Two-step alignment algorithm) ...... 51

7 Definition (The SRSF model) ...... 57 8 Definition (SRSF mean estimation algorithm) ...... 58 9 Definition (ΓL)...... 59 1 Lemma (Robustness of the mean orbits estimation ) ...... 60 2 Corollary (Robustness of the mean orbit estimation) ...... 60 2 Lemma (Robustness of the center of the orbit) ...... 60 3 Lemma (Robustness of optimal warpingγ ∗)...... 60 2 Theorem (Robustness of the aligned functions in space) ...... 61 F 3 Theorem (Robustness of optimal warping inγ ∗ SRSF space) ...... 62 4 Lemma (Trace of the covariance operator is invariant under warping) ...... 64 5 Lemma (Eigenvalues of the covariance operator are invariant under warping) . . . . 64 6 Lemma (Square norm inequality) ...... 65 7 Lemma (Norm equivalence of the warping functions) ...... 65 8 Lemma (Invariance under simultaneous warping) ...... 66 9 Lemma (Convergence of the inverse warping functions) ...... 66 10 Lemma (Rate of convergence for argminγ � )...... 67 11 Lemma (L2,γ- continuity of the Elastic metric) ...... 67 12 Lemma (Robustness of orbit of ¯q)...... 68 13 Lemma (Robustness of the center of the orbit, ¯q=(qq + ¯�,γ0)) ...... 69 3 Corollary (Functional LLN convergence) ...... 69 14 Lemma (Robustness of KM of warping functions ˆnγ)...... 69 15 Lemma (Robustness of optimal warpingγ ∗)...... 70 16 Lemma (˙γ convergence) ...... 71 4 Corollary (˙γ convergence rate) ...... 72 1 17 Lemma (˙γ− convergence rate) ...... 72 5 Corollary (Commutative equivalence of warpings in SRSF space) ...... 73 2 18 Lemma (SRSF convergence of ¯nγ impliedL convergence) ...... 73

x ABSTRACT

This dissertation focuses on exploring the capabilities of the SRSF statistical shape analysis frame- work through various applications. Each application gives rise to a specific mathematical shape analysis model. The theoretical investigation of the models, driven by real data problems, gives rise to new tools and theorems necessary to conduct a sound inference in the space of shapes. From a theoretical standpoint the robustness results are provided for model parameter estimation and an ANOVA-like statistical testing procedure is discussed. The projects were a result of the collaboration between theoretical and application-focused research groups: the Shape Analysis Group at the Department of Statistics at Florida State Uni- versity, the Center of Genomics and Personalized Medicine at FSU and the FSU’s Program in Neuroscience. As a consequence, each of the projects consists of two aspects - the theoretical investiagtion of the mathematical model and the application driven by a real life problem. The application’s components are similar from a data modeling standpoint. In each case the problem is set in an infinite dimensional space, elements of which are experimental data points that can be viewed as shapes. The three projects are:

“A new framework for Euclidean summary statistics in the neural spike train space”. The • project provides a statistical framework for analyzing the spike train data and a new noise removal procedure for neural spike trains. This framework adapts the SRSF elastic metric in the space of point patterns to provide a new notion of the distance.

“SRSF shape analysis for sequencing data reveals new differentiating patterns”. This project • uses the shape interpretation of the Next Generation Sequencing data to provide a new point of view of the exon level gene activity. The novel approach reveals a new differential gene behavior, that cannot be captured by the state-of-the-art techniques. The program code is available online on the Github repository.

“How changes in shape of nucleosomal DNA near TSS influence changes of gene expression”. • The result of this work is the novel shape analysis model explaining the relation between the change of the DNA arrangement on nucleosomes and the change in the differential gene expression.

xi CHAPTER 1

INTRODUCTION

In this chapter, for each of the projects we introduce the necessary theoretical foundations, data structure description and mathematical models that connect those. The models are based on adaptations of the Square Root Slope Function shape analysis framework (see [55]). The designs allow us to draw new conclusions in problems in the areas of neuroscience, gene expression, gene regulation and DNA structure. Each application considered in this work has a separate introduction section intended to fa- miliarize the reader with biological concepts needed to understand the problem and impact of the results. In addition, in a separate chapter we provide new theoretical results for the generative shape analysis models, used in genomic applications.

1.1 Stochastic Point Processes and Neural Spike Trains

Neural spike trains are the axon voltage traces reflecting neural activity. A voltage spike, also called an electrical impulse, is the fundamental unit of information in the brain (see Figure 1.1A). Neural spike trains are often called the language of the brain and are the focus of many investigations in computational neuroscience. Statistical analysis and inference on spike trains is one of the central topics in the neural coding. It is of great interest to understand the underlying structure of given neural data and capture basic summary properties like the notion of a mean or a variance (see Figure 1.1B). The challenge is that the spike trains vary in the number of spikes and their distribution across the domain. The consequence is that the space of point patterns representing the spike timings does not have a structure and lacks certain Euclidean properties. In particular, a notion of an addition operation between spike trains is not well specified. That means, that basic summary statistics, such as a mean, can’t be calculated as a typical summation divided by the total count of spike trains. To tackle this problem, we can use the notion of Karcher mean (see [21]) which allows us to define a mean as long as the considered space is equipped with a metric. We do not need a vector

1 5

10

15

20

25

30 0 0.5 1 A) B) time (sec)

Figure 1.1: Obtaining spike trains from the neural voltage traces. Panel A: An example of obtaining a spike train from a single voltage trace by recording the spike timing. Panel B: A simulated dataset of thirty spike trains. Figure in panel A is adapted and modified from [51]. space structure. The mean is defined as the minimizer of a sum of square distances, where distance is calculated according to the specified metric. The formal definition is provided in Chapter 2. Based on various metric distances between spike trains, recent investigations have introduced the notion of average or prototype spike train to characterize the template pattern in neural activity. However, as those metric spaces lack certain Euclidean and vector spaces properties, these averages are non-unique and do not have the conventional properties of a mean. This makes the inference statistically inconsistent and hard to model from a mathematical standpoint. In this project, we propose a new framework to define the mean spike train. To do that, we interpret the arising spike trains as realizations of a Poisson stochastic point process and adopt a metric from anL p family. We have chosen the Poisson point process model because its assumptions coincide with common sense expectations of the spike train data structure. The Poisson point process with intensity functionρ models the spike train timings as randomly generated point clouds on the interval representing the time span of occurrence of the spikes. The randomness is modeled in two ways:

1. For any sub-interval of the time domain that the spike train is defined on, the number of spike events in that domain follows a Poisson distribution with a mean parameter defined as a of the interval. The measure of the sub-interval B is defined as

µ(B)= ρ(x)dx. �B

2 2. Conditioned on the number of spiking events, the spike timings (over a specified intervalB) are distributed according to a density functionf that is proportional to the intensity function ρ: ρ(x) f(x) = . B ρ(x)dx An interesting property of the Poisson point process,� that is particularly useful from the simulation point of view, is that spike trains on disjoint time domains are independent. The formal definition of the Poisson Point Process and its properties are provided in Chapter 2. On Poisson point process simulations, we demonstrate that the new mean spike train properly represents the average pattern in the conventional fashion, and can be effectively computed using a theoretically-proven convergent procedure. We compare this mean with other spike train averages and demonstrate its superiority. Furthermore, we apply the new framework in a recording from rodent geniculate ganglion, where backgroundfiring activity is a common issue. We show that the proposed mean spike train can be utilized to remove the background noise and improve decoding performance. We also provide an estimation procedure for the new mean spike train and prove its convergence.

1.2 Functional Data Analysis and Genomics

To understand how functional data analysis can be used in genomics, we have to explain the process in which the data is obtained. One of the bottlenecks in this process was how to reliably and cost-efficiently uncover the order of nucleotides of the investigated genetic material (determine, which of the four nitrogenous bases ( adenine (A), guanine(G), thymine (T), cytosine (C)), appear in the genetic material and in what order). This step is called the sequencing step. A milestone in the area of sequencing is the development of the sequencing technology referred to, as Next Generation Sequencing (NGS) (also called: shotgun sequencing, high-throughput se- quencing or massively parallel sequencing) [38, 41, 39, 44]. The NGS gives researchers a powerful tool to take a closer look at DNA and RNA patterning. The technology, popularized in 2007, is gradually being improved, and new applications are being developed. It has not only increased the reliability of sequencing results [39] but also has significantly decreased the costs of the experiment. This resulted in rapid expansion of sequencing-based scientific projects, expansion of the available database of experimental results and a rapid development of tools to analyze genomic phenomena,

3 such as gene expression [1], alternative splicing [60, 1], copy number aberrations [53], and gene regulation. Sequencing-based methods rely on inferences from the abundance and distribution of the DNA objects called “short reads” which are derived from the Illumina sequencing platform (an industry standard sequencing technique). The usual experimental pipeline follows steps listed in Figure 1.2. The short reads are pieces of DNA or RNA that are extracted from the biological sample and processed in several steps. They are obtained by chopping the collected genetic material into short fragments and then “amplified”. The amplification procedure works like a real-life bootstrapping, meaning the genetic material is copied several times. This is done to ensure that no active parts of the genetic material are omitted in the following sequencing step. After the sequencing step, short reads are mapped to a reference standard genome in order to give context to the observed genetic activity (using for example, the Bowtie2 [28] or the BWA [32] software packages). By recording the positions and counts of the mapped reads (see Figure 1.3) one can get insight into the genetic activity of the genomic region of interest. Each experimental step listed has an influence on the qualitative features of the data. Drawing sound inferences from such experiments relies on appropriate mathematical methods to model the distribution of reads along the genome.

Figure 1.2: Next Generation Sequecning (NGS) workflow schematic with highlighted pos- sible variability biases. Figure adapted and modified from [68]

The genomic feature is usually modeled with random variables, which reflect the number of mapped reads originating from the genomic region of interest. This is done by assuming discrete distributions for the response variables in the model (e.g., Poisson, Poisson-gamma, binomial, binomial-beta). Our aim is to quantify not only the differences in the intensity of the mapped reads but also the differences in the position of the mapped reads in the genomic locus of interest. To do this, we interpret the point patterns resulting from the NGS experiment as a realization of a stochastic point process.

4 Figure 1.3: Obtaining the point pattern data from the mapped reads. The leftmost coordinate of each read on the reference genome is reported.

We assume that the experimental results follow a non-homogeneous Poisson point process [43]. That is, the positions of mapped reads on the genome region of interest are the realization of a non- homogeneous Poisson point process. Under this assumption, we estimate the intensity functions of the underlying point patterns. The intensity functions over a specified domain on the genome are treated as shapes. This interpretation allows us to proceed with the Square Root Slope Functions shape analysis (see [55]) modeling for genomic data. In the following chapters, we show that the SRSF interpretation uncovers previously unobserved genomic features.

1.2.1 Exon Level Gene Differential Expression

Genes are referred to as the coding fragments of the DNA. In a quick sketch, the genes are constructed as a sequence of blocks, starting with a Transcription Starting Site (TSS) and followed by a of blocks called exons - the essential coding sequences. The exon blocks are separated by introns - the non-coding sequences. One of the reasons for great scientific interest in differential gene expression patterns is that it helps understand gene regulation and cell behavior in general. Due to the rapid development of sequencing tools and especially the NGS methods, more detailed questions can be asked. If equipped with proper mathematical tools, we could give some insight into those questions. To quantify the activity of a gene or a difference in activity between different conditions, a notion of gene expression and gene differential expression was introduced, which, for NGS, is derived from the number of short reads mapped to the gene locus. The underlying assumption is that the regions of the genome that are more active (are more expressed) have proportionally more reads mapped. If the regions are selected to correspond to genomic loci, we obtain information about gene activity (gene expression).

5 The increased reliability and the design of the NGS experiment allows for a more sophisticated mathematical framework which takes into account not only the intensity of expression but also the position of particular reads aligned to the genome region. The idea is motivated by events of alter- native splicing [67] and drugs blocking selected exon regions. In both cases, one can experimentally obtain the genes with the same expression intensity, but with drastically different spatial patterns of the mapped short reads. Quantifying difference between those is crucial for understanding gene regulation. Motivated by that, we focus on the additional information contained in the positioning of the mappings and on the underlying patterns in distributions of the reads. In Figure 1.2 this refers to mathematically reinterpreting steps corresponding to the last two arrows. Following the short read interpretation from the beginning of this section, we encode the positioning information as shapes of short read density curves over gene loci.

1.2.2 Changes in Nucleosomal DNA Positioning and Gene Expression

To fully understand gene network of interactions, besides gene expression, we need to consider various phenomena that regulate gene activity. One of those features is the spatial organization of the DNA that contains the gene and how it is wrapped around histones to form nucleosomes. Understanding how nucleosomes are distributed relative to DNA can advance thefield of genomics and explain the mechanisms of gene regulation. Despite recent advances in technology of sequencing and experimental design, the full comprehension of how nucleosomes are arranged in DNA remains elusive. Untapping the full potential of second- and third- generation sequencing experiments, and drawing sound inferences from such experiments, relies on appropriate mathematical methods that can capture the complexity of the structure of the underlying biological processes. It is known that the DNA packed around nucleosomes enters a “protected” state in which it is difficult to access the DNA by other compounds including, for example, transcription factors. As a consequence, if a gene is located in the vicinity of a nucleosome, its expression might be altered. Some recent advances managed to capture certain features of nucleosomes [9], [42], [13]. Neverthe- less, the full comprehension about how nucleosomes are arranged in DNA is not yet uncovered. Numerous Next Generation Sequencing techniques are used to detect the DNA-nucleosome arrangements. In our case we are using the MNase-seq protocol in which the DNA is digested using micrococcal nuclease enzyme, leaving only DNA fragments that were protected by nucleosomes.

6 The DNA material undergoes the standard sequencing procedure mentioned at the beginning of this section. Thefinal product is the so-called library of short reads mapped to a reference genome. This time the short reads originate from nucleosome protected regions, rather than from active genomic regions, There are many algorithms to detect the enrichment of reads in NGS experiments, and also several algorithms to detect changes in the enrichment of reads among different experimental con- ditions. These approaches, however, are largely limited to gross changes in reads enrichment, and generally fail to detect small but significant shifts in the locations of histone-DNA interactions that may be biologically relevant. This is especially true for changes in nucleosome arrangement that occur in the vicinity of the Transcription Starting Sites. Shifts in the relative positions of the DNA may affect the access of transcription factors and, in turn, alter the gene activity pattern. To address this analytical gap, we describe a new statistical shape analysis framework based on Square Root Slope Functions. The new model redefines experimental results as shapes over a reference genome. The shape interpretation allows us to establish the connection between changes of the DNA arrangement around the nucleosome and changes in the neighboring gene’s expression. The model explains how nucleosome shape and position can regulate gene activity.

1.3 Theoretical Background: Statistics in Function Spaces with the SRSF Framework

In current research in any area that deals with data quantification, statistics and mathematics play key roles in interpreting the data. The designed tools help with the analysis, hypothesis testing and drawing conclusions. Such tools are particularly useful when the data set is large. Functional data analysis (FDA) is a branch of statistics that allows one to access complicated and expanded data structures that come in the form of infinite-dimensional objects (e.g., curves, shapes or surfaces). It allows a completely new way of accessing the problems and thus has the potential to uncover new patterns in the data. In this work, we utilize the FDA approach to show new results in data driven problems in neuroscience and genomics. A detailed description of the state-of-the-art shape analysis techniques can be found in [55]. We focus on a particular method, which, in our opinion, is the most suitable for the questions that we

7 address. This method relies on the “SRSF transformation” for real valued, absolutely continuous functions defined over afixed unit interval. We denote the space of those functions as . F The SRSF stands for Square Root Slope Function and the related transformation is what the name indicates. Given a functionµ , the SRSF transformation, denoted byq, is given by: ∈F ˙µ SRSF(µ) =q := sgn(˙µ) ˙µ = . (1.1) | | ˙µ � | | The simplicity of the transformation begs the question, how� can it make an impact on analysis techniques and be mentioned in the main scope of this work. To answer this we take a step back to review some properties of the shapes and shape analysis following the book [55]. To perform statistical analysis on shapes of real functions, we need a notion of a distance in the space of shapes. The distance that we use, has two modes of variability: amplitude and phase. Those two components are visualized in Figure 1.4. The amplitude of a functionµ is an ∈F equivalence class, defined as follows:

Definition 1 (Amplitude of a curve).

[µ] = µ γ γ Γ . (1.2) { ◦ | ∈ } whereγ is an orientation preserving diffeomorphism of the unit interval andΓ is a space of all such diffeomorphisms.

The spaceΓ is also a group with the operation of function composition and the identity element

being the identity function denoted byγ id. We impose common sense requirement for the metric that the metric should be invariant under the action of the groupΓ. The action of the groupΓ on the space is by right-composition. F Definition 2(Γ group action on ). An action of groupG on a spaceH is defined as a pair (g, h) F withg G andh H, such that (g, h) H and (e, h) =h wheree denotes the identity element ∈ ∈ ∈ of the groupG. The action of groupΓ on the space of functions is defined as: F (γ,f)=f γ. (1.3) ◦ Upon the SRSF transformation, the action translates into:

(γ,q)=q γ ˙γ. (1.4) ◦ � 8 Figure 1.4: Amplitude and phase variability in a set of curves. Each of the simulated curves has a similar shape feature. The variability in the shape feature consists of the height and the location, which corresponds to the amplitude and the phase variability components. Figure adapted and modified from [54].

Does such a metric exist? Thankfully, the answer is: “yes”. Moreover, according to a theorem in [5], if the domain is two or higher, there exists only one metric (up to a scalar), that satisfies the desired properties. This metric is the Fisher-Rao generalized Riemannian metric, which is defined as follows:

Definition 3 (Fisher-Rao Riemannian metric). Given a functionf , and two vectors tangent ∈F to atf,v , v T , then: F 1 2 ∈ f F 1 1 < v1, v2 >= ˙v1(t)˙v2(t) dt (1.5) 0 f˙(t) � | | is a Riemannian metric on and can be used to calculate the length of paths between points in F F (denote such distance as distFR(f1, f2)).

The existence of the metric is not enough for performing the data analysis. As it is a Riemannian metric, it is given as an inner product localized at a point in the tangent space. It does allow one to calculate the lengths of paths in , but it does not specify the notion of distance between points in F explicitly. This complicates the calculations forfinding the shortest path. Not only it does not F

9 have a closed form formula, but it also adds computational costs to the already expensive functional data analysis. The power behind the SRSF transformation comes from the fact noted in [56] that the Fisher- Rao Riemannian metric becomes a standardL 2 metric on the space of SRSF transformed curves, upon the SRSF transformation. That is, the distance between two absolutely continuous functions

f1, f2 is given by: dist2 (f , f ) = q q 2, (1.6) FR 1 2 || 1 − 2||

whereq 1 =SRSF(f 1) andq 2 =SRSF(f 2). Then the amplitude distance between curves can be defined following [56] as:

Definition 4 (Amplitude distance).

distamplitude(f1, f2) = inf distFR(f1, f2 γ) = inf q1 q 2 γ ˙γ . (1.7) γ ◦ γ || − ◦ || � To define the phase distance in a consistent manner, we need to impose a metric onΓ. The metric also has to be invariant under the actions of the groupΓ on itself. Thus, again, we use the Fisher- Rao generalized metric and the SRSF transformation,SRSF(γ)=h := √ ˙γ. A simple check shows that h = 1. In fact, it can be shown, following [55], that under the SRSF transformation the || ||2 spaceΓ becomes a Hilbert sphere inL 2 (SRSF(Γ) = h: h = 1 ). Such a geometric structure, { || || 2 } allows us to explicitly calculate the distances between different warping functionsγ ,γ Γ. This 1 2 ∈ distance can be calculated as the arc-length of the shorter arc of the great connecting SRSF transformed warping functions on the Hilbert sphere. We refer to this distance as the intrinsic distance.

1 1 1 1 1 distintrinsic(h1, h2) := cos− (< h1, h2)>= cos − ( h1(t)h2(t)dt) = cos− ( ˙γ1(t)˙γ2(t)dt). �0 �0 � (1.8) An alternative way of calculating the distance is by considering the wholeL 2 space and calculating the standardL 2 distance. We refer to this distance as the extrinsic distance.

dist (h , h ) = h h = ˙γ ˙γ = (1,γ ) (1,γ ) . (1.9) extrinsic 1 2 || 1 − 2||2 || 1 − 2||2 || 1 − 2 ||2 � � For computational purposes, we choose to use the latter to define the phase distance. Later, in Chapter 5, also for computational purposes, we adopt an extrinsic approach to calculate the Karcher mean (see [21]) of the set of warping functions inΓ.

10 Definition 5 (Phase distance). Letγ be the optimal warping between two functionsf , f , ∗ 1 2 ∈F that isγ = argmin q q γ √ ˙γ . Then the phase distance betweenf andf is: ∗ γ || 1 − 2 ◦ || 1 2

dist (f , f ) = dist (1, ˙γ) = 1 ˙γ (1.10) phase 1 2 extrinsic ∗ || − ∗||2 � � Remark 1 (Existence of optimal warping). The optimal warpingγ ∗ defined above, might not always exist nor has to be unique. Despite this obstacle, as the functional datasets are given in a discretized

form,γ ∗ can still be approximated numerically. Throughout this work, by a warping function that is a minimizer of an amplitude distance (e.g.γ = argmin q q γ √ ˙γ ), we understand its ∗ γ || 1 − 2 ◦ || discrete approximation.

Equipped with these tools, we begin analyzing a new generative model in the SRSF shape space. The model describes the experimental results that allow different conditions (e.g., treatment and control groups or cancer and healthy cell types) and a different number of data points per condition. The i-th shape in j-th condition follows:

q = (q +� ) γ ˙γ , (1.11) ij j ij ◦ ij ij � whereq j is a deterministic function,γ ij models the compositional noise (random domain warping)

and� ij is a representing the additive noise. One might ask why we do not set up the modeling in the function space prior to SRSF trans- formation. Then the model would look simpler:

µ = (µ +� ) γ (1.12) ij j ij ◦ ij

The reason for that is that, despite being seemingly simple in the design, the model above has to undergo the SRSF transformation in order for us to perform the shape analysis. This transformation

raises the complicated issue of defining the derivative of the additive noise˙� ij. What would be a candidate distribution for such a process? What are its basic variance and mean properties? Due to these complications, we deem the SRSF modeling approach to be more suitable.

11 CHAPTER 2

A NEW FRAMEWORK FOR EUCLIDEAN SUMMARY STATISTICS IN THE NEURAL SPIKE TRAIN SPACE

Results published as: Wesolowski, Sergiusz, Robert J. Contreras, and Wei Wu. ”A new framework for Euclidean summary statistics in the neural spike train space.” The Annals of Applied Statistics 9.3 (2015): 1278-1297.

2.1 Introduction

Due to the stochastic nature of the spike discharge record, probabilistic and statistical methods have been extensively investigated to examine the underlyingfiring patterns [49, 8, 24, 6, 23]. However, these methods mostly focus on parametric representations at each given time and therefore can prove to be limited in data-driven problems in the space of spike trains. Alternative approaches for analyzing spike train data are based on metricizing the spike train space. Over the past two decades, various methods have been developed to measure distances or dissimilarities between spike trains, for example, the distances in discrete state space, discrete time models [35, 37, 49], those in discrete state space, continuous time models [65, 3, 4, 64, 70], those in continuous state space, continuous time models [62, 18, 17]; and a number of others [52, 26, 46, 19, 45]. An ongoing pursuit of great interest in computational neuroscience is defining an average that can represent tendency of a set of spike trains. What follows is the problem of defining basic summary statistics reflecting the intuitive properties of the mean and the variance, which are crucial for further statistical inference methods. In particular, to make thefirst-order statistic, mean, convenient for constructing new framework and inference methods, it should satisfy the following properties:

1. The mean is uniquely defined in a certain framework.

2. The mean is still a spike train.

12 3. The mean represents the conventional intuition of average as in the Euclidean space.

4. The mean depends on exact spike times only, and is independent of the recording time period.

5. The mean can be computed efficiently.

Property 3 can be described as follows: Given a set ofN spike trains with each havingK spikes, we denote these trains using vectors (x , ,x ) N , where each coordinatex represents the { i1 ··· iK }i=1 ik spike timing. Then the mean spike train is expected to resemble 1 N (x , ,x ). N i=1 i1 ··· iK In [66], the authors considered a “consensus” spike train, which� is the centroid of a spike train set (under the Victor-Purpura metric). This idea was further generalized in [11] to a “prototype” spike train which does not have to belong to the given set of spike trains, but its spike times are chosen from the set of all recorded spike times. Recently, a notion of an “average” based on kernel smoothing methods was introduced in [20]. In [70, 71], the authors proposed an elastic metric on inter-spike intervals to define a mean directly in the spike train space. However, none of these approaches satisfies the 5 desirable properties listed above, and therefore may result in limited use in practical applications. In this chapter, we propose a new framework for defining the mean spike train. We adopt a recently-developed metric related to anL p family withp 1, which inherits desirable properties in ≥ the special case ofp = 2 [12]. This metric is a direct generalization of the Victor-Purpura metric, and we refer to it as a GVP (Generalized Victor-Purpura) metric. We will demonstrate that this new mean spike train satisfies all aforementioned 5 properties. In particular, the new framework is the only one (over all available methods) that has desirable Euclidean properties on the given spike times. Such properties are crucial for the definition of summary statistics such as the mean, variance, and covariance in the spike train space. In general, these 5 properties assure intuitiveness of the summary statistics, as well as efficiency in their estimation. In contrast, previous methods have issues such as non-uniqueness, dependence on model assumptions, or more complicated computa- tions, and therefore do not result in the same level of performance (see the detailed comparison in Section 2.2.4). One direct application of the mean spike train is in neural decoding in the rodent peripheral gustatory system [69]. The neural data was recorded from single cells in the geniculate ganglion,

13 as the spiking activity in these neurons was modulated with respect to different taste stimuli on the tongue. It is commonly known that spontaneous spiking activity can be observed even if only artificial saliva is applied. Thus, the neural response is actually a mixture of a background activity and a taste-stimulus activity. We demonstrate using simulation as well as real data that the pro- posed new framework can be used to remove the background activity, which leads to improvement in decoding performance. In Section 2.2, we define the new framework by introducing the mean spike train under the GVP metric, and provide an efficient algorithm to estimate it. In Section 2.3, we extend this framework by developing a statistical approach for noise removal and apply the method to the experimental data. We then discuss the merits of the new framework in Section 2.4. Finally in Appendix, we provide mathematical details on the convergence of the mean estimation algorithm.

2.2 Methods

Before we turn to describing the methods, it is necessary to set up a formal notation in the spike train space. AssumeS is a spike train with spike times 0

M S=(s j)j=1 = (s1, s2, . . . , sM ).

We define the space of all spike-trains containingM spikes to be and the space of all spike-trains S M to be = . This can be equivalently described as a space of all bounded,finite, increasing S ∪ M∞=0 SM sequences. A time warping on the spike times (or inter-spike intervals) has been commonly used to measure distance between two spike trains [65, 12, 70]. LetΓ be the set of all time warping functions, where a time warping is defined as an orientation-preserving diffeomorphism of the domain [0,T ]. That is, Γ= γ : [0,T] [0,T] γ(0) = 0,γ(T)=T,0<˙γ(t)< . { → | ∞} It is easy to verify thatΓ is a group with the operation being the composition of functions. By applyingγ Γ on a spike trainS=(s )M , one obtains a warped spike trainγ(S)=(γ(s ))M . ∈ j j=1 j j=1

14 2.2.1 GVP Metric

In [12], a new spike train metric was introduced with parameterp [1, ). This metric is a ∈ ∞ direct generalization of the classical Victor-Purpura (VP) metric (VP is a special case whenp = 1), and we refer to it as the Generalized Victor-Purpura (GVP) metric. In particular, whenp = 2, this metric resembles an EuclideanL 2 distance. M N Assume, thatX=(x i)i=1 andY=(y j)j=1 are two spike trains in [0,T ]. Forλ(> 0), the GVP metric betweenX andY is given in the following form:

1/2 2 2 dGV P [λ](X,Y ) = min EOR(X,γ(Y )) +λ (xi y j) . (2.1) γ Γ  −  ∈ i,j:x =γ(y ) { �i j }   whereE ( , ) denotes the cardinality of the Exclusive OR (i.e. union minus intersection) of two OR · · sets. That is,E OR(X,γ(Y )) measures the number of unmatched spike times betweenX andγ(Y) and can be computed as

M N EOR(X,γ(Y )) =M+N 2 1 γ(y )=x − { j i} �i=1 �j=1 where1 is an indicator function. The constantλ(> 0) is the penalty coefficient. We emphasize {·} thatd GV P is a proper metric; that is, it satisfies positive definiteness, symmetry, and the inequality. It shares a lot of similarities with the classicalL 2 norm. Similarly to the result in [71], one can show that the optimal time warping between two spike M N trainsX=(x i)i=1 andY=(y j)j=1 must be a strictly increasing, piece-wise linear function, with nodes mapping from points inY to points inX. Based on this fact, a dynamic programming algorithm was developed to compute the distanced GV P with the computational cost of the order of O(MN). Using the matching theory, another efficient algorithm was also developed to computed GV P in the cost ofO(MN) [12].

2.2.2 Definition of the Summary Statistics and Their Properties

Conventional statistical inferences in the Euclidean space are based on basic quantities such as mean and variance. For statistical inferences in the spike train space, we can analogously use an Euclidean spike train metric to define these summary statistics as follows. For a set of spike trainsS ,S , ,S where the corresponding numbers of spikes are 1 2 ··· K ∈S n , n , ,n (arbitrary non-negative integers), respectively, we define their sample mean using 1 2 ··· K

15 the classical Karcher mean [21] as follows:

K 2 S∗ = argmin dGV P [λ](Sk,S) . (2.2) S ∈S �k=1 2 When the mean spike trainS ∗ is known, the associated (scalar) sample variance,σ , can be defined in the following form, K 2 1 2 σ = d [λ](S ,S∗) . (2.3) K 1 GV P k − �k=1 The computation of this variance is straightforward, and the main challenge is in computing the mean spike train for anyλ(> 0). Before we move on to the computational methods for the summary statistics, we list several basic theoretical properties of the mean spike trains using thed GV P metric. The proofs are omitted here to save space.

1. The optimal time warping between two spike trains must be a continuous, increasing, and piece-wise linear function between subsets of spike times in these two trains.

2. Let spike trainsX=(x )M ,Y=(y )M be defined on [0,T ]. Ifλ 2 <1/(MT 2), then i i=1 i i=1 ∈S M M 1/2 2 2 dGV P [λ](X,Y)= λ (xi y i) � − � �i=1 3. Assume a set of spike trainsS ,S , ,S withn , n , ,n spikes, respectively, and 1 2 ··· K ∈S 1 2 ··· K letN = max(n , n , ,n ). Ifλ 2 <1/(KN T 2), then the number of spikes in the max 1 2 ··· K max mean train is the median of n K . { k}k=1 4. Let spike trainsS , ,S . Ifλ 2 <1/(KMT 2) , then the mean spike train has a 1 ··· K ∈S M conventional closed-form: K 1 S . K k �k=1 2.2.3 Computation of the Mean Spike Train

To compute the mean spike trainS ∗ under the GVP metric, we need to estimate two unknowns: 1) the number of spikesn, and 2) the placements of these spikes in [0,T ]. For a general value of λ> 0, neither the matching term nor the penalty term is dominant, and therefore we cannot identify the number of spikes in the mean before estimating their placements [70]. A key challenge

16 is, that we need to update the number of spikes in the algorithm. In this work, we propose a general algorithm to estimate the mean spike train. We initialize the number of spikes in the mean spike train equal to the maximum of n , n , ,n , and then adjust this number during the { 1 2 ··· K } iterations. We present here, how the Karcher mean in Eqn. 2.2, can be efficiently computed using a convergent procedure.

Algorithm. Assume that we have a set ofK spike trains,S , ,S withn , n , ,n 1 ··· K 1 2 ··· K k nk n spikes, respectively. DenoteS k = (si )i=1 andS=(s i)i=1 . Then the sum of squared distances in Eqn. 2.2 is:

K K k 2 k 2 k 2 dGV P [λ](S ,S) = min EOR[S ,γ(S)] +λ (si s j) . (2.4) γ Γ  −  k=1 k=1 ∈ i,j:sk=γ(s ) � � { �i j }   K k 2 We develop here an iterative procedure to minimize k=1 dGV P [λ](S ,S) (as a function ofS) and estimate the optimalS ∗. This new algorithm has� four main steps in each iteration: Matching, Adjusting, Pruning, and Checking, and we refer to it as the MAPC algorithm. In particular, the Adjusting step corresponds to the Centering step in the MCP-algorithm in [71]; in contrast to the nonlinear warping-based Centering-step, the Adjusting step utilizes the Euclidean property and updates the mean spike train in an efficient linear fashion. The Checking step is mainly used to avoid underestimating the number of spikes in the mean. This step adds one spike into the current mean and checks if such addition results in a better mean (i.e. smaller mean squared distances). In contrast, such problem is not addressed in the MCP algorithm.

Algorithm 1 (Matching-Adjusting-Pruning-Checking (MAPC) Algorithm:). 1. Letn= max n , n , ,n . (Randomly) set initial times for then spikes in [0,T ] to form an initial { 1 2 ··· K } S.

2. Matching Step: Use the dynamic programming procedure [70] tofind the optimal matching γk fromS toS k, k=1, ,K. That is, ···

k k 2 k 2 γ = argmin EOR[S ,γ(S)] +λ (si s j) (2.5) γ Γ  −  ∈ i,j:sk=γ(s ) { �i j }   3. Adjusting Step:

17 (a) Fork=1, ,K,j=1, ,n, define ··· ··· sk if i 1, ,n , s.t.γk(s ) =s k rk = i ∃ ∈{ ··· k} j i j s otherwise � j (b) DenoteR = (rk, ,r k), k=1, ,K. Then we update the mean spike train to be k 1 ··· n ··· ˜ 1 K S= K i=1 Ri.

4. Pruning Step� : Remove spikes from the proposed mean S¯ that are matched less thanK/2 times.

k k K (a) For eachj=1, ,n, count the number of timess j appears in γ (S ) k=1. That is, N ··· { } hj = k=1 1s γk(Sk). j ∈ (b) Remove�s j from S˜ ifh j K/2, j=1, ,n, and denote the updated mean spike train ≤ ··· as S˜∗. Then S˜∗ = s S˜ h > K/2 . { j ∈ | j } 5. Checking Step: To avoid being stuck in a local minimum, we check if an insertion or/and deletion of a specific spike can improve the mean estimation:

(a) Let Sˆ∗ be S˜∗ except one spike with the minimal number of appearances (randomly chosen if there are multiple spikes at the minimum) in the Pruning step. Then, update the mean as ˆ K k ˆ 2 K k ˜ 2 S∗ if k=1 dGV P [λ](S , S∗) < k=1 dGV P [λ](S , S∗) S∗∗ = S˜ otherwise � ∗ � �

(b) Let Sˆ∗∗ be the current meanS ∗∗ with one spike inserted at random within [0,T]. Then update the mean as

ˆ K k ˆ 2 K k 2 S∗∗ if k=1 dGV P [λ](S , S∗∗) < k=1 dGV P [λ](S ,S∗∗) S∗∗∗ = S˜ otherwise � ∗∗ � �

6. Mean Update: LetS=S ∗∗∗ andn be the number of spikes inS.

7. If the sum of squared distances stabilizes over Steps 2-6, then the mean spike train is the current estimate and we can stop the procedure. Otherwise, go back to Step 2. Denote the estimated mean in themth iteration asS (m).

K k (m) 2 One can show that the sum of squared distances (SSD), k=1 dGV P [λ](S ,S ) , decreases iteratively as a function ofm. As 0 is a natural lower bound, the� SSD will always converge when m gets large. The detailed proof is given in the Appendix. Note that this MAPC algorithm takes only linear computational order on the number of spike trains in the set. In practical applications,

18 wefind that this algorithm has great efficiency in reaching a reasonable convergence to a mean spike train. In general, when the penalty coefficientλ gets large, the optimal time warping chooses to have fewer matchings between spikes to lower the warping cost. Some of the spikes in the mean will be removed during the iterations to minimize the SSD. In the extreme case, whenλ is sufficiently large, any time warping would be discouraged (as that will result in a larger distance than simply the Exclusive OR operation). In this case, the mean spike train will be an empty train. This result indicates that in order to get a meaningful estimate of the mean spike train, the penalty coefficient λ should not take a very large value. In practical use, one may use a cross-validation to decide the optimal value ofλ.

Illustration of the MAPC Algorithm. To illustrate the algorithm we use the homogeneous and in-homogeneous Poisson point process definition to simulate spike train data. Following [43] we have the definition: and a corollary:

Definition 6 (Poisson point process). We say thatX is a Poisson Point Process on a spaceS

equipped with measureµ (in our case,S is an interval inR with standardµ being the Lebesgue measure) with a locally integrable, positive intensity functionρ if

B S,µ(B)< N(B) P oiss(µ(B)) (2.6) ∀ ⊂ ∞ ∼

and n ρ(x) n N,B S,µ(B)< XB N(B)=n i=1 are iid with densityf(x) = (2.7) ∀ ∈ ⊂ ∞ { | } µ(B) whereµ(B)= ρ(x)dx andX =X B B B ∩ � It can be shown as a corollary that:

Corollary 1 (Poisson point process characterization). IfX is a Poisson point process onS, and B S, thenX ,X ,... are independent ifB ,B ,... are disjoint. ∀i i ⊂ B1 B2 1 2 The definition reflects the idea that spike trains appear in a stochastic manner and are inde- pendent on disjoint time periods. To test the performance of the MAPC algorithm, we illustrate the mean estimation using 30 spike trains randomly generated from a homogeneous Poisson point process. Let the total time

19 T = 1(sec), the Poisson rateρ = 8(spikes/sec). The individual spike trains are shown in Fig. 2.1A. The number of spikes in these trains varies from 3 to 13, with the median value of 9. Therefore, n, the number of spikes in the mean, is initialized to be 13 and we adopt randomly distributed 13 spikes in [0,T ] as the initial for the mean in each case. We letλ 2 vary between 6 and 60 to show the behavior for a small and a large warping penalty. A B C

200 300

5 150 250 SSD SSD 100 200 10

50 150 0 2 4 6 8 10 0 1 2 3 4 5 6 7 8 15 number of iterations number of iterations

0 0 1 1 20 2 3 2 4 3 5 4 6 25 7 5 8 6 9 7 10 11 8 9 30 number of iterations 12 number of iterations 0 0.5 1 0 T 0 T time (sec) time (sec) time (sec)

Figure 2.1:A: 30 spike trains generated from a homogeneous Poisson process. Each vertical line indicates a spike.B: Estimation results whenλ 2 = 6. Upper Panel: The sum of squared distances (SSD) over all iterations. Lower Panel: The estimated mean spike train over all iterations. The initial is the spike train on the top row (0), and thefinal estimate is the spike train on the bottom row (12th).C: Estimation result whenλ 2 = 60.

The result for the MAPC algorithm forλ 2 = 6 is shown in Fig. 2.1B. The upper panel shows the evolution of the SSD in Eqn. 2.2 versus the iteration index. We see that it takes only a few iterations for the SSD to decreasingly converge to a minimum. The estimated mean spike trains over iteration are shown in the lower panel. Apparent changes are observed in thefirst few (1 to 5) iterations, and then the process stabilizes. Note that the spikes in the mean train are approximately evenly spaced, which properly captures the homogeneous nature of the underlying process. We also note that the number of spikes in this mean spike train is 9, which equals the median of the numbers of spikes in the set. The result forλ 2 = 60 is shown in Fig. 2.1C. With a larger penalty, the optimal time warping between spike trains chooses to have fewer matchings between spikes to lower the warping cost. Some of the spikes in the mean are removed during the iteration. In this case, the convergent SSD is about 150, which is greater than the SSD whenλ 2 = 6 (at about 80). Note that whenλ is even larger, we expect fewer or even no spikes to appear in the estimated mean.

20 2.2.4 Advantages Over Previous Methods of Averaging Spike Trains

Table 2.1: Comparison of Average of Spike Trains for Different Methods

“mean” (proposed “consensus” “prototype” “average” “mean” Method frame- [66] [11] [20] [70] work) Victor- Victor- van Rossum Elastic met- Metric used GVP metric Purpura Purpura metric ric metric metric Properties (in Introduction) 1, 2, 3, 4, 5 2, 4, 5 2, 4, 5 1, 2, 4 1, 2, 5 satisfied Full spike Given sam- Given spike Full spike Full spike Domain train space ple set times set train space train space Number median of median of of spikes, NA NA NA n1, ,n N n1, ,n N λ2 1 { ··· } { ··· } � Spike times if Restricted Restricted ISI-based cj = n1 = = 1 N to sample to sampled NA nonlinear 2··· N k=1 skj nN ,λ 1 set spike times form � Uniqueness in Almost� Almost Non-unique Non-unique Not known the full space surely surely

There have been multiple ideas of capturing the general trend in a set of spike trains, which include the “consensus” spike train [66], “prototype” spike train [11], “average” spike train [20], and “mean” spike train [70]. However, none of those concepts satisfies all desirable 5 properties of a mean spike train listed in the Introduction section. We have summarized the most relevant differentiating features in Table 2.1. In case of the “consensus” and “prototype” spike trains, one main problem lies in the non-uniqueness of the results in the spike train space, which arises directly from the underlying metric used (resembles Manhattan distance). If the estimated spiking times of those averages are restricted to spiking times in the sample sets, then the estimates can be unreliable, particularly when sample sizes are relatively small. The “average” design uses the van Rossum metric, which relies on kernel-smoothing of the spike trains. The estimation of the “average” is based on a greedy algorithm, but the accuracy and the of the method have not been carefully examined. We propose a new notion of a “mean” spike train based on the kernel-free GVP metric. The key advantages behind our design are the Euclidean properties of GVP distance and the subsequent Karcher mean definition (in Eqn. 2.2). The new framework satisfies all 5 desirable properties, which distinguishes it from others.

21 It is worth noting that, to the best of our knowledge, the GVP metric is one of only two spike train metrics that have Euclidean properties (the other one is the Elastic metric proposed in [70]). However, the Elastic metric satisfies only 3 out of the 5 properties, and the GVP metric has two apparent advantages over it: Firstly, the Elastic metric estimates the mean spike train explicitly depending on recording intervals. Such dependence may introduce additional level of noise arising from experimental parameters, making the inference less reliable. In contrast, the mean spike train under the GVP metric relies only on exact spike times in the given data and is independent of recording intervals (Property 4). Secondly, the fact that Elastic mean is estimated through the inter-spike intervals (ISI) makes it difficult to capture the intuition behind the result, whereas the GVP mean is estimated directly through spike times and resembles the intuition of the mean estimation (Property 3). For illustrative purposes, we compare spike train “averages” of all methods using the 30 spike trains in Sec. 2.4.2, where the data is simulated under a homogeneous Poisson process. A natural expectation is that these averages should be equi-distantly spaced across the time domain. We adopt a simple measure for the equi-distant spacing - compute the standard deviation of the ISI in each train, denoted bySD ISI . Basically, smallerSD ISI values indicate more even spacing. In Fig. 2.2, we show the averages estimated using the GVP mean, Elastic mean, “Prototype”, and “Consensus” methods (In the GVP and Elastic methods, we let penalty coefficients be sufficiently small). It is found thatSD ISI in the GVP mean is 0.019, the smallest over all four methods.

GVP Mean (SDISI = 0.019) Elastic Mean (SDISI = 0.029)

0 1 0 1 time (sec) time (sec)

“Prototype” (SDISI = 0.037) “Consensus” (SDISI = 0.124)

0 1 0 1 time (sec) time (sec)

Figure 2.2: Averaged spike trains according to four different methods.

22 2.3 Results

The notion of the mean spike train has a direct application in neural decoding. In this chapter, we examine how the mean can be used to remove spontaneous activity in the geniculate ganglion neurons and improve decoding performance.

2.3.1 Noise Removal Method

Geniculate ganglion neurons exhibit spiking response to the chemical stimulus applied on the taste buds on the tongue. Such neuronal activity is commonly used in the neural coding in the peripheral gustatory system [10, 7, 30]. We note that these neurons exhibit response even if there is no stimulus applied or the stimulation is a control solution – artificial saliva. That is, the observed spike trains under the stimulation are likely to be a mixture of the spontaneous activity and responses to the taste stimuli. In the context of neural decoding, such spontaneous activity can be viewed as “background noise”, and a “de-noised” spiking activity is expected to better characterize the neural response with respect to the taste stimulus and result in better decoding performance.

Mean noise estimation Noise Improved inference Background noise removal on noise-reduced data (Observed)

Observed data Isolated neuron (response + noise) Inference responses on noisy data (Not observed)

Figure 2.3: Scheme differentiating the noise removal approach from standard inference on spike train data. Dashed boxes indicate the components of standard inference framework, the solid lines indicate where the noise removal framework is introduced.

Previous approaches to the noise-removal focus mainly on spike count across a time domain and do not have a temporal matching between spikes. In this paper, we propose a novel noise-removal procedure based on our new framework. In Fig. 2.3 we describe the schematic idea of incorporat- ing the noise removal in statistical inference. The procedure assumes that the observed data is a “sum” of isolated neuron responses and their spontaneous activity. To improve the neural decoding we atfirst use the stimulus-free spike recordings to estimate the mean background noise with the MAPC algorithm. Then, we “subtract” the mean out from the observed stimulus-dependent data.

23 Obviously, for random variablesX,Y in vector spaces one cannot assume that X˜=X+Y Y¯( Y¯ − denotes the mean ofY ) is a noise reduced “version” ofX+Y . However, in the space of spike trains we managed to establish this procedure, utilizing the warping matchings on the GVP mean. The procedure of obtaining X˜=X Y Y¯ indeed gives a noise reduced version X˜ of a point pattern ⊕ � X Y . This approach is possible with definitions of the addition and the subtraction in the ⊕ ⊕ � spike train space as follows:

Adding spike trains. We assume that the noise is additive and adding spike trains is achieved

by union set operation. That is, letX=(x 1, . . . xN ) andY=(y 1, . . . , yM ) be two spike trains of lengthN andM respectively. We define a spike trainZ=X Y as a spike train of lengthN+M ⊕ such that Z=X Y=( x , . . . x y . . . y ). ⊕ { 1 N }∪{ 1 M } Subtracting spike trains. Defining the subtraction is more challenging as it cannot follow directly from the set operations. This is due to the fact that it is unlikely to have coinciding spike times in two different spike trains. To perform the subtraction we turn to the definition of the GVP metric and optimal warping between two spike trains (Eqn. 2.1). We define the subtraction of a spike trainY from a spike trainX as removing all spike positions from X that are matched with

spikes in Y under the optimal warpingγ. We say that a spike pair (x i, yj) is matched ifx i =γ(y j) for some pair (i, j).

Formally, letX=(x 1, . . . xN ),Y=(y 1, . . . , yM ) be two spike trains andγ be the optimal

warping between them according to thed GV P metric. We define the subtraction ofY fromX, noted byZ=X Y , as follows, �

Z=X Y=( x , . . . , x x :x =γ(y ) for some pair (i, j) ). � { 1 N }\{ i i j }

Once the and are established, we can describe the noise removal method as follows: We ⊕ � “subtract” the mean(Y ) from the observedX Y using the matching of the GVP metric and ⊕ obtain a spike trainX Y mean(Y ) by removing matched spikes. We atfirst use a simulation ⊕ � for illustration of both and operations in Section 2.3.2. The new method is then applied to a ⊕ � real experimental dataset in Section 2.3.3.

24 AB CD Background noise original data noised data noise removed

5 5 5 5

10 10 10 10

15 15 15 15 20 20 20 20 25 25 25 25 30 30 30 30 35 35 35 35 40

40 40 40 0 0.5 1 1.5 2 0 0.5 1 1.5 2 0 0.5 1 1.5 2 0 0.5 1 1.5 2 time (sec) time (sec) time (sec) time (sec) µ (squares)X (asterisk)X µ X µ ˆµ i i i ⊕ i i ⊕ i � ˆµ (dashed lines)Y (circles)Y µ Y µ ˆµ i i ⊕ i+20 i ⊕ i+20 �

Figure 2.4: Illustration of the noise addition and the noise removal with the use of the , operations.A. Background noise - 40 spike trains generated from HPP(10); the ⊕ � mean background noise is presented with dashed lines in the bottom row.B.2 20 spike × trains from IPP(ρX ) (asterisks) and IPP(ρY ) (circles), respectively.C. Sum of spike trains fromA andB.D. Spike trains after the background noise removed.

2.3.2 Result for Simulated Data

To illustrate the noise removal framework wefirst generate 40 independent realizations µ 40 of { i}i=1 a homogeneous Poisson process on [0, 2] with constant intensityα = 10. These simulations represent the noise and are used to estimate the mean background noise ˆµ with the MAPC algorithm. The results are shown in Fig. 2.4A. Next we generate two sets of 20 independent spike trains, X 20 and Y 20 , as realizations { i}i=1 { i}i=1 of an inhomogeneous Poisson processes (IPP) with intensity functionsρ (t) = exp ( (t 1.5) 2)) X − − andρ (t) = exp ( (t 0.5) 2), respectively. The generated spike trains are shown in Fig. 2.4B. In Y − − our framework they correspond to the underlying true neuronal signals. In the third step we obtain the equivalent of the “observed” data, by adding the previously generated noise for each generatedµ i, to the corresponding spike trainsX i andY i. The combined results are shown in Fig. 2.4C. In this case adding spike trains is understood in the set operations terms. We obtain spike trains following Poisson processesX µ IPP(ρ +α),Y µ i ⊕ i ∼ X i ⊕ i+20 ∼ IPP(ρ Y +α). The mean background noise spike train ˆµ is then subtracted out from each realization of the noised dataset, according to the procedure described in Section 2.3.1. For eachi=1... 20, we

25 obtain the noise removed spike trains:X µ ˆµ,Y µ ˆµ, shown in Fig. 2.4D. i ⊕ i � i ⊕ i+20 � A B C 1.1 1.1 1

0.95 1 1 0.9

0.9 0.9 0.85 0.8

0.8 0.8 0.75

0.7 0.7 0.7 Classification score Classification score Classification score 0.65

0.6 0.6 0.6 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 noise level α noise level α noise level α

Figure 2.5: The noise removal influence on classification performance with respect to increasing noise levelα.A. the classification performance for the noisy data. The bold lines represent the average classification score among 50 simulations, the dotted lines indicate the standard deviation from the average classification score.B. Same asA, but for the noise-removed data.C. mean classification score curves fromA (dashed line) and B (dotted line).

We repeat this simulation procedure 50 times for each level ofα [2, 20] and perform classifica- ∈ tion on the noisy dataX µ ,Y µ as well as on the noise-removed:X µ ˆµ,Y µ ˆµ. i ⊕ i i ⊕ i+20 i ⊕ i � i ⊕ i+20 � The classification score is obtained by a standard leave-one-out cross-validation. We record the av- erage score (the classification accuracy) with the standard deviation for eachα level, the result is shown in Fig. 2.5AB. As anticipated with the increasing noise intensityα, the classification performance on each of the noisy and noise-reduced datasets declines. However, if we compare the two average classification scores presented in Fig. 2.5C, we see that the noise removal frame- work outperforms the classification on the noisy data once the noise intensity levelα becomes not negligible. This result indicates that the proposed noise-removal procedure can help increase the contrast between different classes and result in an improvement in classification analysis. Next, we will examine this procedure on a real experimental dataset.

2.3.3 Result in Real Data in Gustatory System

Here we apply the noise removal procedure to neural response in the gustatory system and test if the decoding (i.e. classification with respect to taste stimuli) can improve after the spontaneous activity is removed. The data consists of spike train recordings of rat geniculate ganglion neurons

26 AB C

mean background noise

0.1M NaCl: N

0.003M AA: A1

0.01M AA: A2

0.03M AA: A3

0.1M AA: A4

0.003M AA+0.1M NaCl: NA1

0.01M AA+0.1M NaCl: NA2

0.03M AA+0.1M NaCl: NA3

0.1M AA+0.1M NaCl: NA4

0.01M CA: C

0 1 2 3 4 5 5 6 7 8 9 10 5 6 7 8 9 10 Time (seconds) Time (seconds) Time (seconds)

Figure 2.6: An example of spike trains from Cell 10. Each group of 3 or 4 rows corresponds to a different type of stimuli applied.A. The 5-second pre-stimulus spike trains, whose mean spike train, calculated by the MAPC algorithm, is shown by the thick vertical bars on the top of the panel.B. The 5-second stimulus period.C. the same 5 second period of spike trains as inB, but with spontaneous activity subtracted out. and was previously used in [69]. Briefly, adult male Sprague-Dawley rat’s geniculate ganglion tongue neurons were stimulated with 10 different solutions over time period of 5 seconds: 0.1 M NaCl, 0.01 M citric acid (CA), 0.003, 0.01, 0.03, and 0.1 M acetic acid (AA), and each AA mixed with 0.1 M NaCl. Each stimulus was presented 2-4 times. Stimulus trials were divided into three time regions: a 5-second pre-stimulus period, a 5-second stimulus application period, and a 5-second post-stimulus period. During thefirst and third regions, a control solution of artificial saliva was applied. During the stimulus period, one of the 10 aforementioned solutions was applied. In this study, we focus on classifying the given spike trains according to the 10 stimuli presented in each of 21 observed neurons. In Fig. 2.6AB, we present the real data recordings in thefirst and second time regions from one example neuron. The spike trains in the pre-stimulus 5 second period reflect spontaneous activity with artificial saliva applied. They are treated as “noise” data, in contrast to stimulus-dependent responses. We compute their mean spike train with the parameterλ 2 = 0.001 (a small value to get more spikes in the mean). The result is shown in the top row in Fig. 2.6A. This mean properly summarizes

27 spiking activity during the pre-stimulus period. In the next step, we subtract out this mean noise from the data during the stimulus period (spike trains between the 5th and 10th second). The noise-removed spike trains are shown in Fig. 2.6C. We can now compare the decoding performance using the observed stimulus-response data and the “noise removed” data. To reliably evaluate classification scores, we take the approach of leave- one-out cross-validation. In both cases of the observed data and the noise removed data, the class is assigned according to the nearest neighbor’s class under thed GV P metric. In this classification analysis, we useλ 2 = 225, a relatively larger value to emphasize the importance of both matching term and penalty term in Eqn. 2.1. A B

improvement 0.25 decline 0.6 unchanged 0.2 10

0.15 16 12 14 0.5 0.1 8 11 20 7 0.05 5 3 17 15 19 9 0.4 2 1 6 0 21 4 −0.05 0.3 18 13 −0.1 classification score

classification improvement −0.15 0.2 −0.2

0.1 −0.25

1 3 5 7 9 11 13 15 17 19 21 1 5 10 12 14 17 23 25 33 36 cell index number of spikes in mean background noise

Figure 2.7: The result of the noise removal procedure applied to each of the recorded 21 cells. The marker coding is the same for both panels and indicates the influence of the noise removal approach on the classification score: black circles - increase, grey diamonds - decrease, black asterisks - unchanged.A. Raw classification scores for each cell in each condition.B. same result as inA, but in terms of classification score increase with respect to mean noise size. The vertical black line corresponds to the noise size cutoff of 10 spikes

The comparison on the classification accuracy is shown in Fig. 2.7A. In 10 out of 21 cells the classification was improved after the noise removal procedure and in only 4 cells the classification was hindered. Classification in 7 cells remained unchanged after noise removal, which seems to be quite significant. To explain this issue we have investigated the size of the mean background noise and its influence on increase in classification performance. It turned out, as seen in Fig. 2.7B, that in 5 out of these 7 unchanged cells, the pre-stimulus spiking is negligible (the estimated mean spike

28 train has 0 or 1 spike). In those cases, obviously, subtracting out the mean noise spike train will bare minimum influence. The remaining two cases are associated with the opposite problem of the noise size - the num- ber of spikes in the pre-stimulus period is comparable or greater than the number of spikes in the stimulation period. When such mean noise is subtracted out it also can take away relevant infor- mation, thus may not improve the decoding. Those noise size issues are consistent with common intuition behind the noise removal. However, it is worth noting that the size of the estimated mean background noise can be controlled in our new framework, by adjusting the penalty parameterλ (Section 2.2.3). More investigation will be conducted on the selection ofλ in our future work. When focusing on cells that have significant noise influence in their spiking pattern (at least 10 spikes in the estimated mean noise spike train), we see that in the majority of cases (8 out of 13) the noise removal has improved the classification score (Fig. 2.7B). In extreme cases we obtain up to 20% improvement in the decoding performance (Note that with 10 different stimuli, a random guess results in 10% average accuracy). Only 3 out of the 13 cells indicate loss of information and 2 are not influenced by noise removal. In summary, wefind that our notion of mean background noise for spike train data is in agreement with common understanding of the additive noise for majority of recorded neurons. Moreover, the proposed noise removal framework effectively improves neural decoding, provided that the pre-stimulus spiking has a high enough intensity.

2.4 Discussion

We proposed a new framework for defining the mean of a set of spike trains and the deviation from the mean. We provide an efficient algorithm for computing the mean spike train and prove the convergence of the method. The framework is based on thed GV P metric [12] which resembles the Euclidean distance. This concept gives an intuitive sample mean point pattern, in which the spike positions in the mean are averaged among matched spikes in a set of all spike trains. Our summary statistics provide the the basis for inference on point pattern data and we utilize it to develop a mean-based noise removal approach. We show that our procedure improves the clas- sification score for simulated inhomogeneous Poisson point process data with various non-negligible noise levels. We have also applied the new tools to a neural decoding problem in rat’s gusta-

29 tory system. It is found that the mean point pattern approach and the noise removal framework significantly improved the neural decoding among the set of 21 neurons. In the noise removal framework, we have defined the operations of addition and subtraction between spike trains with use the matching component of the GVP metric. We, however, note that those operations do not satisfy the law of associativity. For more advanced analysis, it is desirable to establish an algebraic structure on the space of point patterns. Thus refining those approaches will be pursued in the further work. Once the algebraic structure is established, statistical models can be built and regression anal- ysis can be performed. With this setting and the already developed mean and the deviation from the mean approaches, we expect to develop classical statistical inferences such as hypothesis tests, confidence intervals/regions, FANOVA (functional ANOVA), FPCA (functional PCA), and regres- sions on functional data [48, 61]. All these tools are expected to provide a new methodology for more effective analysis and modeling of neural spike trains or any point pattern data in general.

Proof of Convergence of the MAPC Algorithm Theorem 1 (Convergence of the MAPC algorithm). Denote the estimated mean in themth iter- (m) K k (m) 2 ation of the MAPC-algorithm asS . Then the sum of squared distances k=1 dGV P (S ,S ) decreases iteratively. That is, �

K K d (Sk,S(m+1))2 d (Sk,S(m))2. GV P ≤ GV P �k=1 �k=1 Proof: The proof will go through Steps 2-6 of the algorithm - in each step we will show that (m) (m) (m) the overall distance to the proposed meanS = (s , ,s n ) is non-increasing. 1 ··· 1. Matching: In themth iteration, wefind the optimal matchingγ k fromS (m) toS k for each k 1...K . Having those we can write: ∈{ } K K K d (Sk,S(m))2 = E (Sk,γ k(S(m))) +λ 2 (sk s (m))2 (2.8) GV P OR i − j k=1 k=1 k=1 i,j:sk=γk(s(m)) � � � { i � j }

(m) (m) (m) (m) 1 K 2. Adjusting: By definition, we updateS to S˜ = (˜s , ,˜s n ) = R , where 1 ··· K k=1 k R = (rk, ,r k) with k 1 ··· n � sk if i 1, ,n , s.t.γk(s(m)) =s k rk = i ∃ ∈{ ··· k} j i k=1, ,K,j=1, , n. j (m) ··· ··· � sj otherwise

30 K k (m) 2 K n k (m) 2 N n k Hence, (m) (s s ) = (r s ) (r k=1 i,j:sk=γk(s ) i j k=1 j=1 j j k=1 j=1 j { i j } − − ≥ − (m) 2 ˜js ) . � � � � � � Letγ be the piecewise linear warping function fromS (m) to S˜(m), i.e. S˜(m) =γ(S (m)). Then

K K K n d (Sk,S(m))2 E (Sk,γ k(S(m))) +λ 2 (rk ˜s(m))2 GV P ≥ OR j − j �k=1 �k=1 �k=1 �j=1 K K E (Sk,γ k(S(m))) +λ 2 (sk ˜s(m))2 ≥ OR i − j k=1 k=1 i,j:sk=γk(s(m)) � � { i � j } K K K k k 1 (m) 2 k (m) 2 k (m) 2 = E (S ,γ γ − (S˜ )) +λ (s ˜s ) d (S , S˜ ) . OR ◦ i − j ≥ GV P k=1 k=1 i,j:sk=γk γ 1(˜(sm)) k=1 � � { i �◦ − j } � ˜ (m) ˜(m) K ˜(m) 3. Prunning: S∗ = s j S k=1 1s γk(Sk) K/2 is a subset of S in which all { ∈ | j ∈ ≥ } spikes appear in γ k(Sk) K at leastK/2 times. Base on the result in the Adjusting step, { }k=1 � we have

K K K k (m) 2 k k 1 (m) 2 k (m) 2 d (S ,S ) E (S ,γ γ − (S˜ )) +λ (s ˜s ) . GV P ≥ OR ◦ i − j k=1 k=1 k=1 i,j:sk=γk γ 1(˜(sm)) � � � { i �◦ − j }

Using the basic rule of the Exclusive OR, it is easy tofind that

K K k k 1 (m) k k 1 (m) E (S ,γ γ − (S˜∗ )) E (S ,γ γ − (S˜ )) OR ◦ ≤ OR ◦ �k=1 �k=1 (m) (m) (m) (m) Let S˜∗ = (s∗ , ,s ∗ ), wheren ∗ denotes the number of spikes in S˜∗ . Then, 1 ··· n∗ K K K k (m) 2 k k 1 (m) 2 k (m) 2 d (S ,S ) E (S ,γ γ − (S˜∗ )) +λ (s ˜s∗ ) GV P ≥ OR ◦ i − j (m) k=1 k=1 k=1 i,j:sk=γk γ 1(˜∗s ) � � � { i �◦ − j }

K k (m) 2 d (S , S˜∗ ) . ≥ GV P �k=1 4. Checking: Finally we perform the checking step to avoid the possible local minima in the (m) (m) (m) pruning process. In the test if a spike can be removed from S˜∗ ), we let Sˆ∗ be S˜∗ except one spike with minimal number of appearance. Then update the mean spike train as

ˆ (m) K k ˆ (m) 2 K k ˜ (m) 2 (m) S∗ if k=1 dGV P (S , S∗ ) < k=1 dGV P (S , S∗ ) S∗∗ = S˜ (m) otherwise � ∗ � �

31 N k (m) 2 N k (m) 2 It is easy to verify that k=1 dGV P (S ,S ) k=1 dGV P (S ,S∗∗ ) . In the test if a (m) (m) ≥ (m) spike can be added to S˜∗∗ ), we let Sˆ∗∗ beS ∗∗ with one spike inserted at random within [0,T ]. Then update� the mean spike train as �

ˆ (m) K k ˆ (m) 2 K k (m) 2 (m) S∗∗ if k=1 dGV P (S , S∗∗ ) < k=1 dGV P (S ,S∗∗ ) S∗∗∗ = S˜ (m) otherwise � ∗∗ � � It is easy to see that K d (Sk,S(m))2 K d (Sk,S (m))2. k=1 GV P ≥ k=1 GV P ∗∗∗

� (m+1)� (m) Using Step 6, the mean at (m + 1)th iteration isS =S ∗∗∗ . Hence,

K K d (Sk,S(m+1))2 d (Sk,S(m))2. GV P ≤ GV P �k=1 �k=1

32 CHAPTER 3

SRSF SHAPE ANALYSIS FOR SEQUENCING DATA REVEAL NEW DIFFERENTIATING PATTERNS

Results published as: Wesolowski, Sergiusz, Daniel Vera, and Wei Wu. ”SRSF shape analysis for sequencing data reveal new differentiating patterns.” Computational Biology and Chemistry 70 (2017): 56-64.

3.1 Introduction

In this work we propose a new framework (SRSFseq) based on Square Root Slope Functions shape analysis to analyse Illumina sequencing data. In this new approach the basic unit of informa- tion is the density of mapped reads over region of interest located on the known reference genome. The densities are interpreted as shapes and a new shape analysis model is proposed. An equiv- alent of a Fisher test is used to quantify the significance of shape differences in read distribution patterns between groups of density functions in different experimental conditions. We evaluated the performance of this new framework to analyze RNA-seq data at the exon level, which enabled the detection of variation in read distributions and abundances between experimental conditions not detected by other methods. Thus, the method is a suitable supplement to the state of the art count based techniques. The variety of density representations andflexibility of mathematical design allow the model to be easily adapted to other data types or problems in which the distribu- tion of reads is to be tested. The functional interpretation and SRSF phase-amplitude separation technique gives an efficient noise reduction procedure improving the sensitivity and specificity of the method. Second-generation sequencing technologies, such as Illumina sequencing, has allowed researchers to discover fundamental features of genomes and their regulation, organization, and dynamics. For example, sequencing experiments that examine the dynamics of transcription of genomic DNA into RNA (RNA-seq), involve the isolation of RNA from populations of cells which are experimentally processed and sequenced. The generated sequences are often mapped to a reference genome to

33 identify the genes from which the sequences originated. The quantification of sequences mapped to each gene are then used to estimate the level to which each gene is expressed. These gene expression estimates are often compared between experimental conditions to make inferences about gene expression differences. This sequence-map-quantify-compare paradigm is the basis for many functional genomics experiments. The data provided by these experiments are in the form of genomic coordinates from which the reads are assumed to be derived from, and which number on the order of tens of millions of observations. Because of various biological and technical aspects of these experiments, these read distributions have proved difficult to model [15]. While there have been numerous attempts to accurately model these data, nearly all involve reducing data in a form that summarizes read counts over defined genomic regions, which discards or significantly reduces information on the spacial distribution of reads. In the case of RNA-seq or ChIP-seq, which is typically focused on examining discrete genomic units of genes, these read distributions are generally reduced by summarizing the number of reads that map to each gene. This approach discards information about the spatial distribution these reads derive relative to the gene or exon. While this simplifies the complex data and is more easily modeled, it also loses information that may provide insight on gene expression dynamics that arev associated with the shape of the distribution of reads, (e.g as alternative splicing and variation in transcription start and termination sites, differential exon usage). This method also suffers from inaccuracies introduced by the presence of overlapping genes, which may cause inaccurate counting of one gene caused by the expression of another overlapping gene. Several statistical models have been developed to attempt to address these issues, including DESeq2, DEXSeq, Cufflinks, Limma, BaySeq, EBSeq, [36, 2, 34, 59, 58, 29, 14, 31]. Here, we present the SRSFseq, a new framework for analyzing genomics data based on second- generation sequencing. The framework interprets the distribution of read alignments across the genome as shapes. This approach takes into account information provided by the base-level dis- tribution of the mapped reads in order to examine variability in the shapes of the read densities over genomic regions. It takes into account the relative read abundances and the differences in read density profiles. We show how this framework can be used to identify new differential expres- sion behaviors and successfully supplement the results established by state-of-the-art, count based methods.

34 In the following sections we show the functionality of the SRSFseq on an example of an exon level differential expression analysis. The functional interpretation of the model allows us to use the phase-amplitude separation [57] which accounts for additional levels of the noise and the data normalization. We utilize the functional F-test [72] to determine the differential expression and compare the results with selected popular methods: Cuffdiff, DESeq2 and Limma-voom. Next we propose an alternative extension of the SRSFseq in application to the shift and the shape change detection in MNase-seq data for nucleosome positioning.

3.2 Methods

In order to model the distribution of read alignments,first we obtain the read densities of a specific genomic region of interest (G). This region is assumed to be common among different samples, e.g. exon, gene, transcription starting site. The read densities, from now on, are treated as the basic unit of information for further modelling. In this work to obtain read densities, we are using standard kernel density estimator applied to coordinates of the mapped reads as seen in an example in Figure 1.3. Throughout this paper we will refer to this step as thefiltering. The choice of the density estimation technique, may bear significant influence on what features of the Next Generation Sequencing (NGS) data are to be extracted. As a consequence of thefiltering step, the data is automatically normalized. Summarizing, our approach moves the area of modeling from vector-valued variables, used in most of the available methods, to infinitely dimensional space of read densities. The mathematical complexity level is higher, but it is necessary to benefit from the advanced shape analysis modeling tools to unlock the full potential of the Next Generation Sequencing.

3.2.1 Functional ANOVA for Read Densities

The density normalization (filtering) is essential, as we want to focus to uncover new information stored in the NGS results. The density normalization allows us to mod-out all differences coming from discrepancies between number of mapped reads as well as make the data comparable between experimental samples. As our normalized data no longer depends on read counts, we expect to detect different information encoded in the NGS than the count-based methods. We confirm this hypothesis in the section 3.3.

35 In general SRSFseq is suitable to compare and model any point patterns arising from mapping NGS reads to a reference genome. For sake of clarity we focus on exon level differential gene expression and RNAseq experiments, but we want to emphasize that the methods described below can be applied to any NGS output as long as it consists of mapped reads over known genome. In the example we aim to be able to compare the gene expression patterns betweenj=1, . . . k conditions over a genomic region of interest (in our case, exon). In our approach a gene is differen- tially expressed if at least one of its exons is statistically significantly differentially expressed and different gene isoforms are treated as different genes. To quantify the difference between conditions we utilize the Functional ANOVA F-Snedecor test with null hypothesis:

H0 :µ 1 =µ 2 =...=µ k. (3.1)

The test statistic is a ratio of the sum of squared distances between and within conditions scaled by their degrees of freedom. SS /(n k) T= B − , (3.2) SS /(k 1) W − wherek represents the number of conditions tested and theSS B and theSS W are the sums of squared distances within and between conditions, as defined in the ANOVA statistics, but utilizing theL 2 norm: k k nj SS = n ¯µ ¯µ 2 SS = µ ¯µ 2, (3.3) B j|| .j − ..||2 W || ij − .j||2 �j=1 �j=1 �i=1 with: nj k nj k 1 1 ¯µ.j = µij,¯µ .. = µij, wheren= nj. nj n �i=1 �j=1 �i=1 �j=1 The statistic is known to approximately follow theF(κ(k 1),κ(n k)) distribution under the null − − hypothesis, whereκ is a scaling constant obtained by two-cumulant approximation method (see [72]). Due to low sample sizes for NGS experiments, the cumulant approximation is not reliable, thus in SRSFseq, we use the crudeF(k 1, n k) distribution for the test statistic. − − Equipped with this tool we move to application examples and performance evaluation of the new framework on RNA and DNA-seq data

36 3.2.2 Pre-processing of the Raw Data

To analyze the differences,first we have to perform thefiltering step and obtain functional interpretation of sequencing over the exon locations (exon coordinates were obtained from the UCSC database [25], extracted in the form of a GTFfile obtained from [22]). To do that we prvide R scripts [47] , that assume as input the BAMfiles with mapped reads. Various software suites are available to transform the raw sequencing data into the designed format, we used samtools [33] and bowtie2 [28] for alignment against the human genome (HG19). In our analysis we are using the benchmark datasets described in [58]. In each case the software parameters were set as described in the benchmark analysis of the same dataset. As a result of this procedure we obtain a set of functions over shared reference regionG, each function representing a different NGS experiment sample over different exon. Our goal is to compare the NGS experiments between conditions. A sample in the functional form fromj-th condition is denoted ˜µij(t), wheret is the approximate position on in the common reference domainG. We assume that in each condition the observedfiltered data ˜µij comes as a distortion of a unknown true density specific for the conditionj, (denoted byµ j). We propose three ways of modelling intensities for detecting differences in NGS results, each model accounts for different types of distortions. In each model the intensities are normalized prior to performing analysis, so that G ˜µij(s)ds = 1. The models differ based on assumptions made on the source of variability between� the intensities. In the discussion section we add one more model to show the possible extensions of the shape analysis framework.

3.2.3 SRSFseq: Base Model

We propose a simple ANOVA - like setup for the observed, pre-processed density functions. The density representation ofi-th sample inj-th condition is modelled as:

˜µij =µ j +� ij, (3.4) where� ij is a Gaussian stochastic process reflecting the noise in the data withE� ij = 0 and common covariance functionK(s, t).µ j is the base normalized density function of thej-th condition.� ij is assumed to be pairwise independent. The density functionsµ ij = ˜µij can be then directly applied in the test statistic described above.

37 The base model handles well obvious density differences as exemplified in Figure 3.1. As we show in section 3.3.1, even the simplest functional case proves to be useful in discovering new differential patterns. density 0.0 0.1 0.2 0.3 0.4 0 0.2 0.4 0.6 0.8 1 domain G

Figure 3.1: A simulated example of six samples offiltered density functions coming from k = 2 different condition with significantly different underlying true density functions µred, µblack.

3.2.4 SRSFseq: Noise Removal (Shape and Energy Preserving Alignment)

Unfortunately, due to the low sample sizes of the NGS experiment, thefiltering pre-processing step is very sensitive to the noise, when obtaining density functions. This may inflate the type I and type II errors in the test, due to misalignment in thefiltered functions. To account for this issue we extend the analysis by an additional preprocessing step using the SRSF phase-amplitude separation method [57]. We assume that the density functions might be distorted by domain shifts, which we will refer to as warping functionsγ. Before conducting the analysis it is necessary to remove the warping noise, thus align the density functions. In our analysis we will look at two different type of distortions: shape preserving and energy preserving. Each model is capturing different aspects of the functional data and different ways to remove the noise.

SRSFseq: Shape Preserving Noise Removal (Shape). We assume that observed warped density functions ˜µij follow:

38 ˜µ =µ γ (3.5) ij ij ◦ ij whereγ :G G is an orientation preserving diffeomorphism corresponding to the warping noise. ij �→ Using the phase-amplitude separation we are able tofind optimal, shape preserving alignment ˆγij and use it to obtain undistorted intensities ˆµij

1 ˆµ =µ γ ˆγ− (3.6) ij ij ◦ ij ◦ ij

The warpingγ ij representing the phase noise, may have significant influence on the differential expression test, thus to properly evaluate the test statistic it is necessary to conduct the inference with the unwarpped density functions ˆµij. The process of aligning intensities by applying the 1 composition with the warping functionγ ij− to ˜µij reduces the phase noise. The aligned intensities

ˆµij can be then used in the Functional ANOVA test statistic. The aligned intensities are then modelled analogously to the base model:

ˆµij = (µj +� ij) (3.7)

The advantage of the warping noise reduction in the model, is that we are able to eliminate misalignment between density functions, mistakenly inflating or deflating the values of the test statistic. To visualize the issue we have simulated six Gaussian curves which differ in amplitude or phase and compared them before and after alignment. In Figure 3.2 the curves were generated from two Gaussian intensitiesµ red andµ black significantly differing by the variance parameter. In panel A) the difference is not obvious from the point of view of the test as theL 2 variability within each condition is comparable to the variability between each condition. After alignment the difference between conditions becomes apparent - panel B). The second scenario shown in Figure 3.3, corresponds to a problem where the random distortions accidentally drive the intensities to seem to be different. This can occur only by chance, but due to low samples sizes of the NGS experiment and the large number of genomic regions to be tested, such events have a non-negligible probability to occur, that has to be accounted for. As shown in Figure 3.3 six intensities were generated as distortions of the same Gaussian curve, but by accident three of them, consecutively generated, were shifted to the right (panel A). In such case the difference between those conditions could be falsely called significant. After alignment the groups of intensities seem indistinguishable (panel B).

39 (A) (B) density density 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 domain G domain G

Figure 3.2: Six intensities generated from two conditions: red and black with significantly different true base density functionsµ red, µblack.A) The unaligned raw intensities.B) The same density functions after the phase noise removal procedure.

(A) (B) density density 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 domain G domain G

Figure 3.3: Six intensities generated from two conditions: red and black with same un- derlying true base density functionsµ red =µ black.A) The unaligned raw intensities.B) The same density functions after the shape noise removal procedure.

SRSFseq: Energy Preserving Noise Removal. In this model, as before we assume that the original density functions are distorted by a random warpingγ ij. This time, however, the

40 warping does not necessarily preserve the shape of the curve, but is constrained to maintain the energy (L2 norm). The intuition behind the energy-preserving model is the same as for the shape- preserving model, with the sole difference that the noise is introduced in Energy-preserving way. Accounting for the energy-preserving noise has the advantage over the shape-preserving noise, that it can cope with noise yielding significantly different shapes. The cost of it, however is that, the less constrained noise removal procedure (energy) may also accidentally remove critical information from the data. We assume the following model for the distorted intensities:

1 ˜µ = (µ +� ) γ ˙γ− (3.8) ij j ij ◦ ij ij � Using same SRSF phase-amplitude separation method [57] we obtain the optimal alignment between 1 density functions ˆijγ− that preserve the energy norm of each curve. The aligned model is then:

1 1 ˆµ = ˜µ ˆγ− ˆ˙γ− =µ +� (3.9) ij ij ◦ ij ij j ij � The obtained energy-aligned density functions ˆµij can be then further used in the functional ANOVA test statistic

3.3 Results 3.3.1 RNAseq Expression Analysis with Base, Shape and Energy Models

To evaluate the information provided by SRSFseq we have compared the new functional mod- els (denoted “Base”, “Shape” and “Energy”) with several other differential expression methods: Cufflinks, DESeq2, DEXSeq and Limma-voom (sections 3.2.3, 3.2.4 and 3.2.4 ) using published HOXA1 knock-out RNA-seq data [58]. In our design a gene or exon is defined by the UCSC gene models. We treat each gene isoform as a separate gene and we call a gene differentially expressed if at least one of its exons shows a change in the density function. The heat-maps infigure 3.4 show the dissimilarity between the SRSFseq and count-based meth- ods. The heatmap entries indicate the number of genes called differentially expressed by both methods (row and column). For all of the new models, there is little overlap with count based methods. At the same time the SRSFseq methods have high overlaps. This is to be expected as

41 SRSF normalization eliminates all count-based differences between samples. Interestingly, several genes identified specifically within our framework were related to developmental regulation, includ- ing COL1A1 and BAX, which have been previously implicated as targets of HOXA1 [40, 73]. What can be surprising at the beginning is that the SRSFseq methods report lower number of differen- tially expressed regions. This, however is also to be expected, as the differences in the shape of read density occur less frequently then count-based differences.

(A) (B)

272 249 225 147 9 38 31 28 159 Base 37 37 27 19 1 6 3 3 28 Base

249 718 354 142 18 99 75 67 369 Shape 37 98 65 29 1 11 10 9 67 Shape

225 354 1318 147 52 159 132 114 616 Energy 27 65 140 22 4 15 15 15 86 Energy

147 142 147 506 17 49 84 76 255 Base gene le 19 29 22 182 5 11 33 36 56 Base gene le

9 18 52 17 3693 152 159 122 524 Cufflinks splicing 1 1 4 5 2519 72 74 63 183 Cufflinks splicing

38 99 159 49 152 3538 1405 1308 1081 Cufflink isof 6 11 15 11 72 2526 1118 1063 512 Cufflink isof

31 75 132 84 159 1405 3256 2607 727 Limma−voom 3 10 15 33 74 1118 2402 2054 377 Limma−voom

28 67 114 76 122 1308 2607 2616 611 DESeq2 3 9 15 36 63 1063 2054 2086 333 DESeq2

159 369 616 255 524 1081 727 611 11997 DEXSeq 28 67 86 56 183 512 377 333 7083 DEXSeq orms Base orms Base Shape Shape Energy Energy DESeq2 DESeq2 DEXSeq DEXSeq Limma − voom Limma − voom Base gene level Base gene level Cufflink isof Cufflinks splicing Cufflink isof Cufflinks splicing

Figure 3.4: The heat-maps of the overlaps between lists of genes called differentially expressed by SRSFseq models (Base, Shape, Energy) and count based methods, using the significance level (A)α=0.05, (B)α=0.01.

New Differential Expression Patterns Uncovered. Figure (3.5AB) shows an example genes (uc001bvt.2, uc003vec.2) , that displays a clear difference infiltered read densities, however this difference is not captured by any count-based methods. That is because the overall counts at the genes do not significantly change (Cufflinks: p=0.229, DESeq2: p=0.908, Limma-voom: p=0.983). We have observed similar differential patterns in 272 genes called differentially expressed (α=0.05) identified only by the base model. At the significance level ofα=0.01. 30 out of 37 showed potential of exon overlap (Figure 3.5B), where the UCSC genome browser indicates an overlap between HBP1 and COG5 exactly on the region where the density functions differ.

42 (A) (B)

Expression pattern for: uc001bvt.2 1 Expression pattern for: uc003vec.2 1 936 1092 803 859 780 792 1460 995 1218 817 978 700 0.1 0.2 0.5 0.1 0.2 0.5 33145500 33146000 33146500 33147000 33147500 106842500 106843000 106843500 106844000 ANOVA ANOVA P.value = 7e 04 P.value = 0.0078 0.0000 0.0005 0.0010 0.0015 0.0000 0.0010 0.0020

33145500 33146000 33146500 33147000 33147500 106842500 106843000 106843500 106844000 ANOVA after shape alignment ANOVA after shape alignment P.value = 5e 04 P.value = 0.0058 0.0000 0.0005 0.0010 0.0015 0.0000 0.0010 0.0020

33145500 33146000 33146500 33147000 33147500 106842500 106843000 106843500 106844000 ANOVA after energy alignment ANOVA after energy alignment P.value = 0.0028 P.value = 0.0019 0.0000 0.0005 0.0010 0.0015 0.0002 0.0002 0.0006 0.0010 33145500 33146000 33146500 33147000 33147500 106842500 106843000 106843500 106844000

Figure 3.5: (A) Example of an exonic region called differentially expressed by all three SRSFseq models, but not detected by any of the count-based methods, Two conditions: control (black), HOXA1 KO (red). Top panel: The point patterns over reference genome obtained by mappingfirst bp of each read, middle panel:filtered density functions, third panels: aligned density functions according to shape-preserving model, bottom panel: aligned density functions according to energy-preserving model. The p-values reported by other methods for the whole gene are: Cufflinks: 0.229, DESeq2: 0.908, Limma- voom: 0.983. (B) Similar example, but for convenience we provide the UCSC genome browser screen-shot for the region on top of the mainfigure. The comparison with genome browser indicates that the new differential patterns detected by SRSFseq can be explained by the current knowledge about gene location. The p-values reported by other methods: Cufflinks: 0.077, DESeq2: not reported, Limma-voom: not reported.

Advantage of the Shape- and Energy- Preserving Noise Removal. Accounting for the phase variability is designed to improve the results of the new method by controlling for variability in read distributions that would result in false positives and affect the statistical significance of the

43 truly differentially expressed genes that were not detected by the base model alone (nor by any count based methods) (Figure 3.6). Figure 3.6A shows an example gene X, where the noise removal procedure can improve the differential expression detection. The alignment of the density functions helps reduce theSS W component of the test statistic, which previously was keeping the statistic result below the signif- icance levelα=0.01. Figure 3.6B shows an example gene, X, where the noise removal procedure increases the p-value above the significance level ofα=0.05, consistent with a lack of strong evidence for the differential expression based on read densities alone. In both instances (A,B) the noise removal improved the results by either capturing a False Positive, False Negative Which noise removal is superior? In the Figure 3.7 we highlight that, although the noise re- moval is desirable and overall performance improves when compared with base model, neither of the proposed alignments proves to be significantly better than the other. Figures 3.7A,B show the advantage of the energy-preserving alignment over the shape-preserving alignment model by reducing the type I and II errors. The third panel however (Figure 3.7C) emphasizes that, even though energy-preserving alignment seems to perform better, its relatively weak constraints (con- stant energy orL 2 norm), may cause information loss after noise removal. In particular the energy- preserving aligning method won’t distinguish between density functions that have the same energy even if their shapes are significantly different, as seen on the bottom panel of thefigure.

3.3.2 Misalignment as Differences in Activity Patterns

The proposed functional framework describes a novel way of modelling the genomic distribution of reads by identifying the differences in the shape of the read density functions between exper- imental conditions. To show the potential of SRSFseq we exemplify how our generative models (energy- and shape- preserving alignment), can be extended, by alternating roles of the aligning components in the model (γ functions). In the new setting we assume that the observed intensities

˜µij arrive as condition-specific shape changesγ j of the same true base densityµ and test whether the shape changesγ j are significant between conditions. The aligning functionsγ are no longer recognized as noise - they now carry potentially significant information. In short, the we assume that:

˜µ = (µ+� ) γ , or equivalently ˜µ = (µ +� ) whereµ =µ γ . (3.10) ij ij ◦ j ij j ij j ◦ j

44 (A) (B)

Expression pattern for: uc010dkf.3 1 Expression pattern for: uc010mcc.3 2 207 494 226 524 158 384 1087 1338 862 976 672 842

− 0.1 0.2 0.5 671000 672000 673000 674000 − 0.1 0.2 0.5 104240500 104241000 104241500 104242000 104242500 ANOVA ANOVA P.value = 0.0138 P.value = 0.0443 0.0000 0.0005 0.0010 0.0015 0.0000 0.0010 0.0020 671000 672000 673000 674000 104240500 104241000 104241500 104242000 104242500 ANOVA after shape alignment ANOVA after shape alignment P.value = 0.0041 P.value = 0.0521 0.0000 0.0005 0.0010 0.0015 0.0000 0.0010 0.0020 671000 672000 673000 674000 104240500 104241000 104241500 104242000 104242500 ANOVA after energy alignment ANOVA after energy alignment P.value = 0.0069 P.value = 0.1003 0e+00 5e − 04 1e − 03 0.0000 0.0005 0.0010 0.0015 671000 672000 673000 674000 104240500 104241000 104241500 104242000 104242500

Figure 3.6: (A) Example of an exonic region called differentially expressed on the signif- icance level ofα=0.01 only after the shape noise is removed. Two conditions: control (black), HOXA1 KO (red). Top panel: The point patterns over the reference genome obtained by mappingfirst base pair of each read. Middle panel:filtered density functions. Bottom panel: aligned density functions. (B) Example of an exon region that was called differentially expressed by the base model, but lost significance after applying the shape- noise removal procedure. The sum of square distances were inflated due to the noise. Two conditions: control (black), HOXA1 KO (red). Top panel: The point patterns over the reference genome obtained by mapping thefirst base pair of each read. Middle panel: The observedfiltered density functions. Bottom panel: The aligned density functions. and we aim to test the null hypothesis of no difference between the shape changes between any two conditionsj 1, j2:H 0 :γ 1 =γ 2 =...=γ k. Or alternatively:H 0 : j1,j2=1...k, j1=j2 ˙γj1j2 = ∀ � 1 1, whereγ j j =γ j γ − . 1 2 1 ◦ j2

Asγ j1j2 can be estimated by the phase-amplitude separation algorithm [57], we can measure the

45 (A) (B) (C)

Expression pattern for: uc004fad.1 1 Expression pattern for: uc002fdb.2 10 Expression pattern for: uc001bvs.3 12 898 48 1742 854 39 1436 762 37 1341 1191 41 1933 982 43 1559 882 30 1311

− 0.1 0.2135954500 0.5 135955000 135955500 135956000 135956500 135957000 − 0.1 0.274729200 0.5 74729300 74729400 74729500 33145000− 0.1 0.2 0.5 33146000 33147000 33148000 33149000 33150000 33151000 ANOVA ANOVA ANOVA P.value = 0.0617 P.value = 0.0479 P.value = 6e−04 0e+00 4e − 04 8e − 04 0.0000 0.0005 0.0010 0.0015 − 0.001 0.001 0.003 0.005 135954500 135955000 135955500 135956000 135956500 135957000 74729200 74729300 74729400 74729500 33145000 33146000 33147000 33148000 33149000 33150000 33151000 ANOVA after shape alignment ANOVA after shape alignment ANOVA after shape alignment P.value = 0.0647 P.value = 0.0363 P.value = 6e−04 0e+00 4e − 04− 04 8e 0.0000 0.0005 0.0010 0.0015 − 0.001 0.001 0.003 0.005 135954500 135955000 135955500 135956000 135956500 135957000 74729200 74729300 74729400 74729500 33145000 33146000 33147000 33148000 33149000 33150000 33151000 ANOVA after energy alignment ANOVA after energy alignment ANOVA after energy alignment P.value = 0.0087 P.value = 0.3883 P.value = 0.0885 0e+00 2e − 04 4e − 04 6e − 04 0.0000 0.0005 0.0010 0.0015 − 0.001 0.001 0.003 135954500 135955000 135955500 135956000 135956500 135957000 74729200 74729300 74729400 74729500 33145000 33146000 33147000 33148000 33149000 33150000 33151000

Figure 3.7: (A) Energy-preserving noise removal improves detection of differences com- paring to shape-preserving alignment and base model, by capturing a false positive. (B) Energy-preserving noise removal improves detection of differences comparing to shape-preserving alignment and base model, by capturing a false negative. (C) Energy- preserving noise removal causes loss of information and fails to detect a significant differ- ence between expression patterns. This difference is successfully captured by the shape- preserving alignment and base model. magnitude of local shifts between conditions, the location of those shifts, or quantify the differences between aligned patterns. As a consequence of the design, this particular model has little application in direct exon-level expression analysis, because there is not a known biological interpretation for RNA-seq read densities to be consistently shifted. It opens, however, the possibility to analyze shifts or shape changes in genomic data problems, for which very few approaches are currently available, e.g. differences in the positions of nucleosomes [9, 16, 42, 13] (See next chapter).

3.4 Discussion

We have proposed a new framework (SRSFseq) of investigating the NGS data through the functional interpretation. We have shown that the new approach can be successfully used in

46 analysing NGS outcomes and uncover information not possible to decode with the state-of-the-art methods. We have equipped the SRSFseq with two functional noise removal procedures improving the type I and type II errors. Interestingly, if we performed the same analysis on whole spliced genes, instead of on separate exons, we still obtain significantly different gene lists called differentially expressed comparing to both exon-level SRSFseq and all count-based methods. The results can be seen on Figure 3.4. In addition we have shown theflexibility of SRSFseq and have given the examples of how it can be tuned to address the experimental questions. In thefiltering step in this we have used the simple kernel density estimator, however we would like to point out, that other density estimation techniques can be used depending on particular ap- plications, especially when one needs to account for over-dispersion of read counts or read clustering problem. We would like to emphasize that the new framework in case of the RNAseq data, due to its normalization procedure, won’t detect gene-wide count based differences. As such, it should not be viewed as a replacement for methods that detect global differences in gene expression, but can be effectively used to supplement their results where overlapping genes result in false negatives. Software in the form of an R script, used to obtain results is available in a reproducible and adjustable way, on a public Github repository: https://github.com/FSUgenomics/SRSFseq

47 CHAPTER 4

HOW CHANGES IN SHAPE OF NUCLEOSOMAL DNA NEAR TSS INFLUENCE CHANGES OF GENE EXPRESSION

4.1 Introduction

In this chapter we focus on the mechanism, by which nucleosomes influence gene activity. In our work we hypothesize that the differences in nucleosomal DNA shapes and locations of these differences pertain to the differential gene expression. By obtaining the locations of rearrangements of the nucleosomal DNA, one can identify genes possibly being repressed or activated. In the Figure 4.1, we have depicted a schematic example of a DNA-histone interaction with nucleosome rearrangements that can alter gene expression. Our goal is to understand the mechanisms that occur between the changes of the nucleosome near the Transcription Starting Site (TSS) and changes in expression of the gene that follows the TSS. Differential gene expression is well established in the area of genomics and the challenge arises in detecting and quantifying the changes in nucleosome positioning. We establish a complete, stand alone novel framework for analyzing NGS experiments (in par- ticular, to model the nucleosome positioning). The framework allows us to access nucleosomal DNA features viewed as shapes. In this way we avoid information loss from the normalization procedures used for NGS experimental data. We incorporate the SRVF functional data analysis model to capture, quantify, and classify different nucleosome shapes. We show that our model can capture new, biologically meaningful differences in nucleosomes and that it is capable of relating them to the changes in gene expression of the neighboring genes. The novelty and advantage of the SRSF based method is that it redefines and perfectly captures the mathematical properties, which can be visualized as scaling and stretching of shapes. These new mathematical constructs are then used to determine shapes of nucleosomes, quantify how they differ between each other, andfinally check if a particular shape is related to gene activity. Contrary to standard analysis, the is done not on numbers, but on shapes.

48 The unit of information is not a real valued variable, but a whole function representing the shape of a nucleosome. In thefinal step we design the statistic that will tell us whether the stretching or scaling necessary between nucleosomes is significantly related to changes in gene activity.

4.2 Methods

We hypothesize that the differences in shapes of read densities around nucleosomes near TSS, pertains to differential gene expression, by controlling DNA accessibility. By obtaining the loca- tions of nucleosomes, one can identify genes possibly being repressed/activated. Our goal is to establish the connection between the changes of the nucleosome positioning in the proximity of the Transcription Starting Site (TSS) and changes in gene expression that are located directly after the TSS. The differential gene expression is well established in the area of genomics. The challange arises in detecting and quantifying the changes in the nucleosome positioning. In this section we are setting the scene for the mathematical framework that will solve this problem.

Figure 4.1: DNA shift on nucleosome as presented in [63]. Figure describes three different setups that can occur in DNA positioning around the nucleosome. Thefirst row shows DNA wrapped around nucleosome tightly or forming a loop (shift). The last panel shows how the “shift” in DNA can allow Transcription Factors (TF) to bind and initiate tran- scirption. The middle row shows DNA behavior in regular - not protected state. Figure adapted and modified from [63]

49 4.2.1 Experimental Description

The experiments were conducted by the Center of Genomics and Personalized Medicine at Florida State University. The experimental protocol to obtain the data is the following: The DNA material was collected from two conditions (control and treatment), two samples in each condition. Next the DNA material is subject to heavy mnase digestion, leaving only highly preserved fragments. Those fragments are presumed to come from the “protected” nucleosome regions. The extracted DNA is then fed into an Illumina Hi-Seq 2000 sequencer. The result comes in the form of four libraries of short reads corresponding to two samples per each of two conditions. The samples are then mapped to the reference genome (hg38 [22, 25]) using a fast short read aligner bowtie2 [28]. Bowtie2 is configured for “phred” quality scores 30. From the qualityfiltered reads we extract and sort coordinates of reads that fall into TSS regions. The TSS regions are specified as an interval of the reference genome defined by the Ensembl TSS coordinate +/ 1000bp. − 4.2.2 Mathematical Model

In this step, using the point processfiltering, the recorded coordinates of mapped reads are transformed into an intensity function and normalized to density to remove the library size bias. The densities are thefinal product of the preprocessing step, which are then used in mathematical modeling, algorithm, and prediction. The densities are obtained using the Gaussian kernel estimation with weighted observation influence. Weights are being assigned by a prior Gaussian kernel density estimated on the same data without weighing. This is a regularizing step, which improves the data quality for the shape alignment procedure that follows. To analyze the data we use a two-step hierarchical approach to account for the within group and the between group variability. In both cases we investigate the phase and amplitude changes. To account for within group variability we utilize the phase noise removal model. In each of the two conditions (j 1,2 ), we assume that the observed i-th sample (i 1...n ) of the read ∈{ } ∈{ j} density functionsµ ij follows: µ = (µ +� ) γ (4.1) ij j ij ◦ ij whereµ j is the true deterministic underlying signal (to be estimated),γ ij is the random diffeomor-

fism representing the warping noise and� ij is a Gaussian process reflecting the amplitude noise.

50 With this model setup, we use the estimated densities as input for the SRSF alignment algorithm for noise removal and obtain the estimate of ˆµj. Such pre-processed estimates are then used as “observed” input for the second step model given by:

µ = (µ+� ) γ (4.2) j j ◦ j The model also accounts for phase and amplitude variability and the estimation of model parameters is done with the use of SRSF alignment algorithm. The curveµ j is the observed process,µ is a deterministic function,� j is a Gaussian process reflecting the amplitude noise andγ j is a condition specific warping. By model design, The difference is now in the goal: rather than remove the warping noise and estimatingµ, we want to estimateγ j and use the phase distance as the statistic describing the magnitude of the change in nucleosome positioning.

4.2.3 Algorithm Description

Interpreting the experimental results as intensities over genomic region of interest (surrounding TSS) moves the statistical analysis into the function spaces. In our approach, each intensity (four samples, two conditions, two samples per condition) is understood as a shape, which can undergo scaling and stretching. Quantifying the difference between shapes and capturing modes of variability in shapes between different conditions is key in predicting and controlling gene activity through nucleosome positioning. The two-step alignment algorithm that is used to obtain the statistic of the size of nucleosomal DNA rearrangement is described below and can be visualized through the four panels of Figure 4.2.

1 2 nij Algorithm 2 (Two-step alignment algorithm). LetP ij = (pij, pij, . . . , pij ,) represent the vector of read coordinates in i-th sample, in j-th condition (j black, red ). The vector elements,p k , ∈{ } ij represent the position of the read on TSS interval.

1. For each point patternP ij, estimate the weight functionsw ij of the coordinates as the density functions using the Gaussian kernel approximation.

2. For each point patternP ij, estimate the read density functionsµ ij using the weighted Gaussian

kernel density estimation with weights defined by the weight functionw ij. The weight of the k k coordinatep ij is given byw ij(pij).

51 3. For each condition, estimate the within-condition averagesˆµj using the SRSF alignment al- gorithm under the model:µ = (µ +� ) γ . ij j j ◦ ij

4. Estimate the phaseˆγ(black, red) betweenˆµ(black) andˆµ(red) using the SRSF alignment algorithm under the model:ˆµ = (µ+� ) γ . j j ◦ j

5. Calculate the phase distance forˆγ(black, red).

The phase distance quantifies the change in the nucleosome arrangement.

4.3 Results

For every known TSS in the human genome, we have calculated the shift statistic score. First, we look at a particular example of a TSS that has a DNA shift between conditions in the top 0.001 quantile of all recorded shifts. This is shown in Figure 4.2. Thefigure shows a clear example that the proposed two-step algorithm can correctly capture changes in relative nucleosome positioning on DNA. The changes are visible on panel 2 and panel 3 of the Figure 4.2. The particular advantage of the algorithm, besides capturing the TSS, is that it also captures theγ j reflecting the particular changes that occur in terms of magnitude and location. The magnitude and the location of the changes can be extracted by looking at the alignment curve, used to calculate the statistical score of the shift. This is shown in red on panel 4. If there is no alignment needed (there is no change in nucleosome positioning), the alignment curve should coincide with the black horizontal line. On the contrary, the larger the deviation from the black horizontal line the larger the change in the position. The coordinates of peaks and valleys on the red curve give information on location of the change, whereas the net area between the black and red gives the amount of observed shift per genomic region in nucleotide base-pairs (bp). This single example does not of course imply that the framework always works correctly. Thus, next we move to the general results considering all TSS regions and their nucleosome shifts. Figure 4.3 indicates that the change in nucleosome positioning, calculated with the SRSF alignment tech- nique, explains change in gene differential expression. That is, the larger the shift in SRSF sense, the more likely the genes are going to be called differentially expressed. The differential expression statistic was calculated for each gene following the TSS region, using the DESeq2 software package [36].

52 TSS of gene ENST00000418548.3: Mapped reads − 5 2 Sample index 32664000 32665000 32666000 32667000 32668000 32669000

Reference genome position

Read densities Density 0.0000 0.0006 0.0012 32664000 32665000 32666000 32667000 32668000 32669000

Reference genome position

Averaged vs Aligned read desnities Density 0.0000 0.0006 0.0012 32664000 32665000 32666000 32667000 32668000 32669000

Reference genome position

Shift necessary for alignment Local shift size 0 1 2 3 4 5 6 7

32664000 32665000 32666000 32667000 32668000 32669000

Reference genome position

Figure 4.2: Illustration of the two-step alignment algorithm. Panel 1: Gathering short read coordinates and mapping them to the reference genome. Each row represents a different sample. There are two samples per two conditions (red, black). Each dot represents a read position on the reference genome. Panel 2: Estimating the densities for each sample. Mathematical Model: The observed i-th sample in the j-th condition is represented by:µ =µ γ +� , where ij j ◦ ij ij γij is a random diffeomorphism representing the compositional noise.µ j is the true, unknown nucleosomal DNA shape specific for thej-th condition. Panel 3 (part I): First step alignment. Estimating ˆµj using the SRSF. Solid black and red curves are the SRSF-aligned condition-specific averages. Mathematical Model:µ =µ γ +� whereγ is a condition specific change of j ◦ j j j shape. Panel 3 (part II): Second step alignment. Estimateγ j using the SRSF framework tofind optimal ˆjγ. The dashed red curve represents the result of the alignment between the black and the red conditions. Panel 4: Measuring the change of the shape between the red and the black: γred vsγ black. The utilized test statistic is the net area between the red and the black curves: d 1 2 (1 γ (γ− (t))) dt − dt red black 53 � � Proportion of differentailly expressed genes VS shift size

Di�erentially expressed genes on level a = 0.1

Di �erentially expressed genes on level a = 0.05

Di�erentially expressed genes on level a = 0.01 0.45 0.50 0.30 0.35 0.40 Proportion genes with at least %/100 shift expressed of differentially

0.70 0.75 0.80 0.85 0.90 0.95 1.00

%/100 of largest observed shift

Figure 4.3: Thefigure describes the effect of nucleosome shift increasing the chance of the differential gene expression. The x-axis indicates what percentage of the largest observed shift is detected. The y-axis reflects the proportion of genes that are called differentially expressed with at least x% of shift detected to all genes that have at least x% shift. The shift amount is calculated on the DNA region corresponding to the nucleosome located directly before the gene. On any significance level (0.1, 0.05, 0.01) the proportion of DE genes increases as the magnitude of the shift increases.

4.4 Discussion

We have proposed a new framework utilizing novel shape analysis techniques, which helped explain the relationship between changes in nucleosome positioning near TSS and differential gene expression of the gene following the TSS. The framework can pinpoint the changes of the nucleosome positioning and describe them in quantitative and qualitative manner. The mathematical model lying on foundations of the SRSF alignment procedure can be adapted to various research problems of DNA rearrangements. The proposed shape analysis technique gives more insight in gene expression and gene regulation. It shows that the gene expression is regulated by changes in DNA arrangements within nucleosomes.

54 These changes are especially influential if localized in a vicinity of the TSS. Another consequence is that, as nucleosome positioning is highly preserved between individuals, it can be treated as a genomic feature responsible for genetic variability (in the same manner as the SNPs are). The shape analysis framework, given enough computational power, can be expanded to whole genome analysis of nucleosome positioning. This could lead to a discovery of nucleosome related genomic features. The initial results seem promising, but the two-step SRSF alignment procedure has space for improvement. In particular, the SRSF alignment algorithm might give unwanted results if the compared functions have overlapping areas of zero-derivative. In such case, the alignment might result in estimates ofγ warping functions with larger than expected deviations in the zero-derivative regions. This is due to the fact that theL 2 distance between constant curves does not change, regardless of the changes in the domain. This can potentially alter the statistic, which quantifies the magnitude of nucleosome rearrangement. An example of such unexpected alignment can be seen on panel 4 of Figure 4.2 towards the right end of the TSS. In thefigure, we can observe a relatively large spike in the alignment cost that does not correspond to any visible changes in panels 1,2 and 3. This problem can be addressed by adding a regularization term in the cost function of the SRSF alignment algorithm, that penalizes large derivatives of the warping functions (see [55]). An alternative approach would be to exclude the regions of zero-derivative when calculating the phase distance. IfM is the union of all zero-derivative regions, then the statistic used to calculate the changes in nucleosome positioning is given by:

2 d 1 1 γred(γblack− (t)) dt (4.3) [0,1] M − dt � \ � � � In Figure 4.4 we show examples of two TSS regions, for which the changes in the nucleosome arrangement was calculated with truncating the zero-derivative regions. On the scale of all TSS regions, the sets of captured differences did not vary significantly between the truncated and original phase distance statistic. From the application point of view, both of the approaches seem to address the zero-derivative alignment problem. The drawback is that neither case (neither the penalty nor the truncation) has a sound interpretation in the SRSF framework. The formal way of tackling this problem remains yet to be found.

55 TSS of gene ENST00000417811.2: Mapped reads TSS of gene ENST00000406910.6: Mapped reads − 5 2 − 5 2 Sample index Sample index 29722000 29723000 29724000 29725000 112498000 112499000 112500000 112501000

Reference genome position Reference genome position

Read densities Read densities Density Density 0.0000 0.0010 0.0000 0.0010 29722000 29723000 29724000 29725000 112498000 112499000 112500000 112501000

Reference genome position Reference genome position

Averaged vs Aligned read desnities Averaged vs Aligned read desnities Density Density 0.0000 0.0010 0.0000 0.0010 29722000 29723000 29724000 29725000 112498000 112499000 112500000 112501000

Reference genome position Reference genome position

Shift necessary for alignment Shift necessary for alignment Local shift size Local shift size 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

29722000 29723000 29724000 29725000 112498000 112499000 112500000 112501000

Reference genome position Reference genome position

Figure 4.4: Example of two TSS regions with captured large nucleosome rearrangements according to the two-step alignment algorithm. The rearrangement was quantified us- ing the truncated phase distance. The two step alignment algorithm was applied to the curves specified on the whole TSS domain. The statistic, reflecting the change in nucleo- some positioning, was calculated only between the dashed vertical lines. For the detailed description of each panel, we refer the reader to the Figure 4.2.

56 CHAPTER 5

THEORETICAL DEVELOPMENTS FOR THE SRSF FRAMEWORK

5.1 Introduction

The functional interpretation allows us to use the powerful SRSF alignment algorithm to uncover new information stored in the genomic data. We have successfully used this framework to provide novel insights in exon level gene expression and nucleosome positioning. In this chapter we explore the theoretical foundations of the SRSF model used. Behind the SRSF algorithm there is a mathematical model formulation that is needed to show the consistency of the alignment and estimation algorithm. These results solidify our confidence in the algorithm performance and correct interpretation of the inference outcomes. The theory developed in [56] accounts for a mathematical model of the following structure:

µ =c µ γ+d (5.1) i i ◦ i wherec i andd i are scalar random variables such thatEc i = 1 andEd i = 0. This model accounts for scaling and uniform shifting of shapes. These assumption sometimes are too strict to consider data problemes arising in genomics. For purposes of our applications we investigate the properties of a more general model:

µ = (µ+E ) γ (5.2) i i ◦ i whereE i is a stochastic process with mean 0. Unfortunately this model gives theoretical problems when considering the SRSF transformation, due to the derivative component. In particular we have to construct a differentiable stochastic processE i and specify it’s distribution. Instead we propose to set the model directly after the SRSF transformation. The new model is:

Definition 7 (The SRSF model). Considern functional samples q n onfixed interval [0, 1] and { i}i=1 assume that they are realizations of a stochastic process defined in the SRSF space as follows:

57 q = (q +� ,γ ) = (q +� ) γ ˙γ (5.3) i 0 i i 0 i ◦ i i � where:q i are square intergrable,γ i are random, orientation preserving diffeomorphisms of the

interval [0, 1];� i are iid (independent, identically distributed), Gaussian processes with mean 0 and a common covariance functionK(s, t)(� GP (0,K));q is a deterministic function representing the i ∼ 0 underlying mean of the model. Additionally we assume thatγ i are also drawn from a distribution with “zero-mean” property, described asEγ(t) =t.

The model is designed to estimate the true underlying signalq 0, given observableq i. The model parameters can be heuristically identified by using the SRSF shape analysis algorithm [56]. The robustness of the estimates of the SRSF framework, was only established for scalar noise model (equation (5.1)) and cannot be taken as granted for the model defined in Definition 7. To be confident in the model results in our applications we investigate the sensitivity of the model to additive stochastic process noise. An additional motivation behind investigating the model’s robustness is designing an ANOVA- like statistic in the SRSF space. This requires imposing a distribution on a space of functions, which is achieved by introducing a Gaussian process reflecting the additive noise. In this setup the SRSF model shares interesting properties with the standard functional ANOVA model described in [72]. Despite these similarities, imposing a test distribution for ANOVA-like setup is not straightforward in SRSF case. The problem is discussed in detail in Section 3.4 and a heuristic solution is provided.

5.2 Theoretical Results 5.2.1 Robustness in Space F To show the influence of the noise on the estimation with the SRSF algorithm, we analyze each step of the algorithm separately. In each step we provide the upper bound of the estimated error. For readers convenience we provide the algorithm desciption: (see [56] for detalis):

Definition 8 (SRSF mean estimation algorithm).

1. Compute the Karcher mean of orbits [q1],[q 2], ...,[q n]. Denote it by [¯q].

2. Find the center of [¯q] wrt q ; call it ¯q. { i}

58 3. Fori=1,2, . . . , nfind optimal warpingγ i∗ by solving:

γi∗ = argmin ¯q (qi,γ) γ Γ || − || ∈

4. Compute the aligned SRSFs ˆiq= (qi,γ i∗)

The outcome of the algorithm is the template ¯q, the warping functions γ , and the aligned { i∗} SRSF functions ˆq . { i} The problem that lies in the base of the robustness property can be rephrased as: what can we say about the magnitude γ γ and in what sense (what norm) if, || � − id||

γid = argmin q (q,γ) andγ � = argmin q+� 1 (q+� 2,γ) (5.4) γ || − || γ || − ||

andγ � is estimated with the SRSF algorithm. The key lemmas allowing us to provide the error bounds relate to the continuity of optimal warping estimate with respect to the additive Gaussian noise are listed in Section 5.4.1. The results require the following additional smoothness and regularity assumptions:

1. Regularity of warping functions:

Definition 9(Γ L). 1 Γ = γ Γ ˙γ L (5.5) L ∈ | L ≤ ≤ � �

For the rest of the chapter we restrict all warping functions to have uniformly bounded derivatives, thus we will be assuming thatγ Γ . ∈ L

2. Smoothness assumption: The SRSF transforms of the considered functions have to have the t 2 following property: LetF(t) = 0 q (s)ds.F is bi-Lipschitz. � Equipped with these essential assumptions and a series of minor observations listed in Section 5.4.1, we are ready to tackle the algorithm’s robustness to Gaussian additive noise in a step by step manner. And thus we have:

59 1. Robustness in thefirst step is governed by the following results:

n Lemma 1 (Robustness of the mean orbits estimation ). Let[¯q] = argmin[q] i=1 dElastic([q][qi]) and assume that we are working with the mentioned generative model (De�finition 7), then: n 8 d2 ([¯q],[q]) � 2 (5.6) Elastic 0 ≤ n || i|| �i=1 Proof. See Section 5.4.2, Lemma 12

Additionally, we have:

Corollary 2 (Robustness of the mean orbit estimation). Let¯q [¯q] be an orbit representative ∈ with associated, unknown warpingγ such thatγ = argmin ¯q (q ,γ) . Then, despiteγ 0 0 γ || − 0 || 0 not being given directly, we can use it implicitly with Lemma 1, claim that:

n 8 2 d([¯q],[q0]) = inf ¯q (q0,γ) 2 = ¯q (q0,γ 0) 2 �i (5.7) γ || − || || − || ≤ �n || || � i=1 � � � Thus, there exists¯�=o( 8 n � 2), such that n i=1 || i|| � � n 8 ¯q=(q + ¯�,γ) or ¯q (q ,γ ) � 2 (5.8) 0 0 || − 0 0 ||2 ≤ �n || i|| � i=1 � � � 2. Using the results in Section 5.4.2, Corollary 3 and the following lemma, we show the robustness of the center of the orbit estimation procedure.

Lemma 2 (Robustness of the center of the orbit). Let again¯q argmin n d ([q][q ]), ∈ q i=1 Elastic i defined as before. With˜γ = argmin (q ,γ) (¯q,γ ) and let¯γ denote the Karcher Mean i γ || i − 0 || n � of˜γi. Then: 1 ¯γ =γ +� +δ and˜γ =γ − γ +� (5.9) n 0 γ n i i ◦ 0 γ Where� =o(32 q 2( ¯�2 + � 2)) andδ is the LLN convergence rate from Corollary 3 γ || q|| || || || i|| n

Proof. See Section 5.4.2, Lemma 13.

3. Robustness of the optimal warping.

Lemma 3 (Robustness of optimal warpingγ ∗). Define:

1 1 1 ˆq= (q ,˜γ ¯γ− ) = ((q +� ,γ ),˜γ ¯γ− ) = (q +� ,γ ˜γ ¯γ− ) (5.10) i i i ◦ n 0 i i i ◦ n 0 i i ◦ i ◦ n

60 Let also

γi∗ = argmin ˆiq (q 0,γ) (5.11) γ || − || Then

˜ ˜ 2 log(logn) γi∗ =γ id + δn + ˜γ�n where˜�γ =o(32 q 0 �i ), δn =o( ) (5.12) || |||| || � n Proof. See Section 5.4.2, Lemma 15

4. Robustness of aligned functions.

Theorem 2 (Robustness of the aligned functions in space). Letq andˆq be defined as F i i previously (Lemma (3) and consider the following inverse SRSF functions:

F(t) = t q (s) q (s) ds 0 0 | 0 | F (t) =� t(q (s) +� (s)) q (s) +� (s) ds=F(s) +o (1) i 0 0 i | 0 i | � Fˆ (t) = � t ˆq(s) ˆq(s) ds i 0 i | i | Then Fˆ F� =o(�,n). || i − ||

Proof.

1 Fˆ =F (γ ˜γ ¯γ− +� ) = (F+o (1)) (γ +o (1)) =F+o (1) (5.13) i i ◦ i ◦ i ◦ n γ � ◦ id �,n �,n

Where thefirst equality utilizes the definition of Fˆi and consequently ˆiq. The second equality

arises from the result in Lemma 3 and the definition of the SRSF transformation ofF i. Finally, the last equality follows if understood in theL 2 sense.

5.2.2 Robustness in the SRSF Space

The above assumption provides only robustness results in the function space . To achieve F the robustness results in the full SRSF sense, we need more assumptions and a slightly different formulation of the problem. The robustness in SRSF space refers tofinding the bounds for ˆq g . This can be translated || i − q||2 to an equivalent statement in the following manner:

ˆq q = (q +� ,γ ∗) q q (q ,γ ∗) + � (5.14) || i − 0|| || 0 γ − 0||≤ || 0 − 0 || || γ||

61 By adding and subtracting the termq 0√ ˙γ∗ and using the triangle inequality followed by a change of variablesγ(t) t in one of the intergrals we have: →

q (q ,γ ∗) q (1 ˙γ) + q q (γ∗) (5.15) || 0 − 0 ||≤ || 0 − ∗ || || 0 − 0 || � Combining together equations 5.14 and 5.15:

ˆq q q (1 ˙γ) + q q (γ∗) + � (5.16) || i − 0||≤ || 0 − ∗ || || 0 − 0 || || γ|| � From this setup we see that aditional assumption for or equicontinuity will be sufficient to control the convergence of the middle term q q (γ ) . Thus, the initial formulation || 0 − 0 ∗ || is brought down to investigating q (1 √ ˙γ) . || 0 − ∗ ||

Theorem 3 (Robustness of optimal warping inγ ∗ SRSF space). In the setting of the previous section, but with the stronger convergence rate assumption for¯γ , namely (1,¯γ ) (1,γ ) = n || n − 0 || o(�,n), we have the following convergence:

1 2 (1,γ ˜γ ¯γ− ) 1 0 inL (5.17) || i ◦ i ◦ n − ||→

1 Proof. Using Lemma 16 for ˜γ= argmin q +� ,γ γ γ − q + ¯� we have: i γ || 0 i i ◦ ◦ 0 − 0 ||

(1,γ ˜γ γ ) 1 =o( � ) (5.18) || i ◦ i ◦ 0 − || || ||

as a consequence we can write:

1 1 1 1 (1,γ ˜γ ¯γ− ) 1 C (1,γ ˜γ ¯γ− ) (1,γ ˜γ γ − ) + (1,γ ˜γ γ − ) 1 (5.19) || i ◦ i ◦ n − ||≤ || i ◦ i ◦ n − i ◦ i ◦ 0 || || i ◦ i ◦ 0 − ||≤

1 1 C (1,γ ˜γ ¯γ− ) (1,γ ˜γ γ − ) +o( � ) (5.20) ≤ || i ◦ i ◦ n − i ◦ i ◦ 0 || || || ≤ Using the property established in Lemma 17 we can remove theγ ˜γ composition: i ◦ i

C˜ (1,¯γ ) (1,γ ) +o( � o(n,�)) (5.21) ≤ || n − 0 || || ||≤

where the last inequality is the consequence of convergence assumptions for ¯nγ.

62 5.3 Discussion

We have expanded the theoretical basis for statistical modeling in SRSF space with additive stochastic process noise. For the described generative model we have provided robustness results for parameter estimation. This can be viewed as thefirst step in establishing consistency and testing schemes for the generative model. Unfortunately the consistency of the model was not yet established at the time when this work is completed. Some properties that might prove to be useful in establishing those results are mentioned in the following section.

5.3.1 Towards the ANOVA Test in the SRSF Space

We are working towards providing a statistical testing framework for functional data that ac- counts for random warping noise. Thus, we aim to design a statistical test for the SRSF models.

In the basis of our approach lies the Gaussian distribution imposed on the noise process� i, with mean zero and common covariance functionK(s, t). We aim to use the following ANOVA setup

in attempt to immitate a functional ANOVA test. Letq ij represent an observed i-th functional sample coming from the j-th condition as:

q = (q ,γ ) +� = (q +� ) γ ˙γ (5.22) ij j ij ij j ij ◦ ij ij � where:

1.� ij is the iid (independent, identically distributed) zero-mean gaussian noise process with common covariance functionK(s, t))

2.γ ij is the warping noise (random domain dyfeomorfism)

3.q j is the true underlying deterministic, condition specific function.

The difference from the standard functional ANOVA statistic for this model, is that the mean statistic is calculated in the SRSF sense, that is with the phase alignment. Thus for thej-th group the sample mean statistic is:

n n 1 j 1 j ¯.jq = ˆijq = (qj +� ij,γ ij∗ ) (5.23) nj nj �i=1 �i=1 Whereγ ij∗ is the optimal warping found with the SRSF aligning algorithm. In case of testing means of two groups of samples, we want to use the standard pivotal test statistic (see [72], Chapter

63 4 and Chapter 5). The standard pivotal test statistics relies on the distribution of theL 2 norm of the estimated mean. The distribution of the statistic can be calculated using the Karhunen-Loewe theorem as follows. m ¯� 2 λ A whereA are iid χ 2 (5.24) || .j|| ∼ r r r ∼ 1 r=1 � wherem indicates the number of positive eigenvalues (possibly infinite).

In our case instead of� ij, we are working with additional diffeomorphism noise (�ij,γ ij). Despite the fact that the SRSF group action is by isometries, meaning � = (� ,γ ) , we cannot || ij|| || ij ij∗ || generalize the approach to obtain the distribution to the mean statistic ¯(� ,γ ) . The issue || .j .j∗ || preventing us from performing the same Karhunen-Loewe expansion is that after applyingγ ij∗ in the alignment procedure in each sample, the iid Gaussian noise process looses the common covariance property. Namely, if� GP (0,K(, s, t)), then ij ∼

(� ,γ ∗ ) GP (0,K ∗) whereK ∗ =K(γ ∗ (t),γ ∗ (s)) ˙γ (t)˙γ (s) (5.25) ij ij ∼ ij ij ij∗ ij∗ � An interesting result, that motivates us to pursue this approach relies on common eigenvalues of the covariance function assumption rather than on whole covariance functions. In the lemmas below we show how the eigenvalues are invariant under the SRSF warping.

Lemma 4 (Trace of the covariance operator is invariant under warping).

1 1 T r(K ∗) = K(γij∗ (t),γ ij∗ (t))˙γij∗ (t)dt= K(t, t)dt=Tr(K) (5.26) �0 �0 Proof. Follows immidiately from change of variables

Lemma 5 (Eigenvalues of the covariance operator are invariant under warping). Let(λ, v) be an arbitrary pair of eigenvalue/eigenfunction. of theK( , ) common covariance operator of� . That · · ij is: 1 K(s, t)v(s)ds=λv(t) (5.27) �0 If an only if(λ,(v,γ ij)) is also a pair of eigen value and eigenfunction but forK ∗ covariance operator.

Proof. Follows from a simple check and change of variables

64 This however is not yet enough to claim the test distribution as proposed in [72]. Even though, we have shown that the eigenvalues of the warped covariance functions remain unchanged, we cannot proceed with the Karhunen-Loewe expansion for the pivotal test statistic, as the eigen functions are not common. Establishing the proper testing distribution remains unsolved, and is in the scope of our future investigations.

5.4 Auxiliary Lemmas, Proofs and Definitions 5.4.1 General Purpose Lemmas

Lemma 6 (Square norm inequality). For any f,g L 2[0, 1] ∈

f+g 2 2 f 2 + 2 g 2 (5.28) || || ≤ || || || ||

Proof. A simple expansion and direct comparison gives the result

f+g 2 = f 2 + 2 f, g + g 2 2 f 2 + 2 g 2 (5.29) || || || || � � || || ≤ || || || ||

Lemma 7 (Norm equivalence of the warping functions). TheL p norms of the warping functions are equivalent. In particular we have:

1 γ γ 2 γ γ γ γ (5.30) 2|| � − 0||2 ≤ || � − 0||1 ≤ || � − 0||2

Proof. Follows from the fact thatγ Γ are bounded and compactly supported. ∈ “ = ” As everyγ Γ is bounded we have: ⇒ ∈ 1 γ γ = γ γ γ γ γ γ 2γ (5.31) || � − 0||2 | � − 0|| � − 0|≤ || � − 0||1 max �0 “ = ” As the everyγ Γ is compactly supported, By Holder’s inequality, we have the standard ⇐ ∈ result γ γ 1 γ γ (5.32) || n − 0||1 ≤ || [0,1]||2|| n − 0||2

Proof can be generalized in similar way to for any p-norm,p 1. ≥

65 Lemma 8 (Invariance under simultaneous warping). For anyγ Γ, the following inequalities hold: ∈

1 γ γ γ γ 2 γ γ 2 L γ γ γ γ 2 (5.33) L|| � ◦ − 0 ◦ || 2 ≤ || � − 0||2 ≤ || � ◦ − 0 ◦ || 2 1 γ γ γ γ 2 γ γ 2 L 2 γ γ γ γ 2 (5.34) L2 || ◦ � − ◦ 0||2 ≤ || � − 0||2 ≤ || ◦ � − ◦ 0||2 The results can be easily generalized to anyp-norm, using the norm equivalence Lemma (7)

Proof. : Inequalities (5.33) For theL 1 norm the result follows from simple change of variables and Schwartz’s inequality followed by using the Definition 9. Both inequalities follow the same scheme of change of variables. 1 1 (γ (γ(t)) γ (γ(t)))2dt= (γ (s) γ (s))2 ˙γ(s)ds L γ γ 2 (5.35) � − 0 � − 0 ≤ || n − 0||L2 �0 �0 1 1 1 (γ (γ(t)) γ (γ(t)))2dt= (γ (s) γ (s))2 ˙γ(s)ds γ γ 2 (5.36) � − 0 � − 0 ≥ L|| n − 0||L2 �0 �0 Inequalities (5.34) By MVT we have:

t (0,1) ξt (γ�(t),γ0(t) : γ(γ �(t)) γ(γ 0(t)) =˙γ(ξ t) γ�(t) γ 0(t) (5.37) ∀ ∈ ∃ ∈ | − | | − | Using the MVT result and Schwartz’s inequality followed by the Definition 9 we can write: 1 1 γ(γ (t) γ(γ (t))) 2dt= ˙γ(ξ)2 γ (t) γ (t) 2 L 2 γ γ 2 (5.38) | � − 0 | t | � − 0 | ≤ || � − 0||L2 �0 �0 The opposite side of the inequality also follows form MVT, but instead of upper limit for˙γ one 1 uses the lower limit resulting in constant L2 1 1 1 γ(γ (t) γ(γ (t))) 2dt= ˙γ(ξ)2 γ (t) γ (t) 2 γ γ 2 (5.39) | � − 0 | t | � − 0 | ≥ L2 || � − 0||L2 �0 �0

Lemma 9 (Convergence of the inverse warping functions). For anyp 1 under theL p norm we ≥ have the following convergence results inΓ:

1 1 γ γ γ − γ − (5.40) � → 0 ⇐⇒ � → 0 With convergence rates:

1 1 1 2 2 4 1 1 2 γ− γ − γ γ L γ− γ − (5.41) L4 || 0 − � ||2 ≤ || 0 − �||2 ≤ || 0 − � ||2

66 Proof. The result follows forL 1 norm from a simple geometrical argument. By the equivalence of norms (Lemma 7 we obtain the result for allp 1. ≥ Without referring to the geometrical argument the proof is as follows:

1 1 1 1 2 1 1 2 1 1 2 γ− γ − = (γ− (t) γ − (t)) dt L (γ− γ γ (s) γ − γ γ (s)) ds = (5.42) || � − 0 ||L2 � − 0 ≤ � ◦ � ◦ 0 − 0 ◦ � ◦ 0 �0 �0 1 1 2 4 2 L γ− γ γ γ − γ γ L γ γ (5.43) || 0 ◦ 0 ◦ 0 − 0 ◦ � ◦ 0||L2 ≤ || 0 − �|| By both parts of Lemma 8 the last inequality arises the form two compositions. The lower boundary follows from the same argument, but using the lower boundaries in Lemma 8 and in the Definition 9.

Lemma 10 (Rate of convergence for argminγ ). Letγ = argmin q+� (q+� ,γ) , then � � || 1 − 2 ||

(q,γ ) (q,γ ) = q (q,γ ) 2 � +2 � =o( � + � ) (5.44) || � − id || || − � ||≤ || 1|| || 2|| || 1|| || 2| Proof. 1 q (q,γ ) q+� (q,γ ) + � = (q+� ,γ − ) q � || − � ||≤ || 1 − � || || || || 1 � − − 2||

q+� (q+� ,γ ) + � + � ≤ || 1 − 2 � || || 1|| || 2||

By choosingγ id forγ � we have:

� +� + � + � 2 � +2 � =o( � + � ) ≤ || 1 2|| || 1|| || 2||≤ || 1|| || 2|| || 1|| || 2||

2 t 2 Lemma 11(L ,γ- continuity of the Elastic metric). LetF(t) = 0 q (s)ds. IfF is bi-Lipschitz (biholder?) � 1 t s F(t) F(s) C t s (5.45) C | − |≤| − |≤ | − | then 1 32 q 2( � 2 + � 2) F F γ 2 γ γ 2 (5.46) || || || 1|| || 2|| ≤ || − ◦ �|| ≥ C2 || id − �||2

67 Proof. The lower bound follows immidiately from the bi-Lipschitz property ofF . It remians to show the upper bound.

1 1 t 2 F F γ = (F(t) F(γ ))2dt= q2(s)ds (q,γ )2(s)ds dt (5.47) || − ◦ �|| − � − � �0 �0 ��0 � By putting the absolute value insiethe inner and then extending the limit of integration to (0, 1) we have

1 2 1 2 q2(s) (q,γ )2(s) ds q(s) (q,γ (s)) q(s) + (q,γ (s)) ds (5.48) ≤ | − � | ≤ | − � || � | ��0 � ��0 � Now using Cauchy-Shwarz inequality

q (q,γ ) 2 q+(q,γ ) 2 16 q 2( � 2 + � 2 + 2 � � 32 q 2( � 2 + � 2) (5.49) ≤ || − � || || � || ≤ || || || 1|| || 2|| || 1|||| 2||≤ || || || 1|| || 2|| Where the last inequality is a consequence of Lemma 10 applied to q (q,γ ) 2 and Lemma 6 || − � || applied to q+(q,γ ) 2 || � || 5.4.2 Robustness in Space F n Lemma 12 (Robustness of orbit of ¯q). Let [¯q] = argmin[q] i=1 dElastic([q][qi]) and assume that we are working with the following generative model: �

qi = (q0 +� i,γ i) (5.50)

Then: n 8 d2([¯q],[q]) � 2 (5.51) 0 ≤ n || i|| �i=1 n 2 Proof. First we consider i=1 dElastic([qi][q]) for arbitrary SRSFq and use Lemma 6 to obtain the inequality: � n n n 2 2 2 d ([qi][q]) = inf q0 +� i (q,γ) inf 2 q0 (q,γ) + 2 �i (5.52) Elastic || − || ≤ γ || − || || || �i=1 �i=1 �i=1 Thus, when replacing [q] by [q0], we have: n n d2 ([q ][q ]) 2 � 2 (5.53) Elastic i 0 ≤ || i|| �i=1 �i=1 Now lets replace the arbitraty [q] with the minimizer of the sum of square distances - [¯q]. Then n n n 2 4 8 d2([¯q][q]) (d2([¯q],[q]) +d 2([q ],[q ])) d2([q ],[q ]) � 2 (5.54) 0 ≤ n i i 0 ≤ n i 0 ≤ n || i|| �i=1 �i=1 �i=1

68 Thefirst inequality arises form averaging the result in Lemma 6 over allq i. The second inequality is

obtained by replacing the minimizer ¯q withq0 in thefirst sum. The last inequality is a consequence of inequality 5.53.

Lemma 13 (Robustness of the center of the orbit, ¯q=(qq + ¯�,γ0)). Choose one element of the orbit¯q [¯g], withγ such thatγ = argmin ¯q (q ,γ) . As a consequence we can write: ∈ 0 0 γ || − 0 ||

n 8 2 d([¯q],[q0]) = inf ¯q (q0,γ 0) 2 = ¯q (q0,γ 0) 2 �i (5.55) γ || − || || − || ≤ �n || || � i=1 � � � Thus there exists¯�=o( 8 n � 2), such that n i=1 || i|| � � n 8 ¯q=(q + ¯�,γ) or ¯q (q ,γ ) � 2 (5.56) 0 0 || − 0 0 ||2 ≤ �n || i|| � i=1 � � � Corollary 3 (Functional LLN convergence). Based on Theorem 4.1 from [27]: Ifa n = 2n log(logn) then: �

S /a 0 in probability • n n → S /a is uniformly bounded a.s. • n n

ES n =a n •

Lemma 14 (Robustness of KM of warping functions ˆnγ). In addition to the generative model set up from Lemma 1 we include the following assumption about the random warpings and their convergence rate:

n n 1 1 1 1 2 log(logn) limn γi− =γ id or γi− γ id =δ n =o( ) (5.57) →∞ n ||n − || � n �i=1 �i=1 Here for thefirst time we are referring to the stochasticity of the warping functionsγ. The conver- gence rate for the functional Law of Large Numbers is shown in [27]. Let again¯q = argmin n d ([q][q ]). With˜γ = argmin (q ,γ) (¯q,γ ) and let¯γ denote q i=1 Elastic i i γ || i − 0 || n the Karcher Mean of˜γ�i. then:

1 ¯γ =γ +� +δ and˜γ =γ − γ +� (5.58) n 0 γ n i i ◦ 0 γ

Where� =o(32 q 2( ¯�2 + � 2)) γ || q|| || || || i||

69 Proof. By the result in Lemma 1 we can substitute ¯q=(qq + ¯�,γ0). Using the properties of group actions ofΓ in SRSF space we can write:

1 ˜iγ= argmin (q0 +� i,γ i γ γ 0− ) (q 0 + ¯�) (5.59) γ || ◦ ◦ − ||

By the Theorem 11 we obtain

1 2 2 2 γ ˜γ γ − γ 32 q ( ¯� + � ) (5.60) || i ◦ i ◦ 0 − id||≤ || 0|| || || || i|| and 1 2 2 2 2 ˜γ γ − γ 32L q ( ¯� + � ) (5.61) || i − i ◦ 0||≤ || 0|| || || || i|| As a consequence we know that:

2 2 2 2 1 � =o(32L q ( ¯� + � )) such that ˜γ=γ − γ +� (5.62) ∃ γ || q|| || || || i|| i i ◦ 0 γ

And thus we can write:

n n 1 1 1 ¯γ = ˜γ= ( γ− γ +� ) =γ +δ +� (5.63) n n i n i ◦ 0 γ 0 n γ �i=1 �i=1

Lemma 15 (Robustness of optimal warpingγ ∗). Define:

1 1 1 ˆq= (q ,˜γ ¯γ− ) = ((q +� ,γ ),˜γ ¯γ− ) = (q +� ,γ ˜γ ¯γ− ) (5.64) i i i ◦ n 0 i i i ◦ n 0 i i ◦ i ◦ n

Let also

γi∗ = argmin ˆiq (q 0,γ) (5.65) γ || − || Then ˜ ˜ 2 log(logn) γi∗ =γ id + δn + ˜γ�n where˜�γ =o(32 q 0 �i ), δn =o( ) (5.66) || |||| || � n Proof. By continuity of argmin results established in the Theorem 11 and invariance under simul- taneous warping (Lemma 8) we know that

1 γi∗ = argmin ˆiq (q 0,γ) = argmin (q0 +� i,γ i ˜γi ¯γn− ) (q 0,γ) = (5.67) γ || − || γ || ◦ ◦ − ||

1 1 = argmin q +� (q ,γ ¯γ ˜γ− γ − ) (5.68) || 0 i − 0 ◦ n ◦ i ◦ i ||

70 By Lemma 14 we have:

1 1 2 2 γ γ ∗ ¯γ ˜γ− γ − 32 q ( � ) (5.69) || id − i ◦ n ◦ i ◦ i ||≤ || 0|| || i||

That implies, that

2 2 1 � =o(32 q ( � ) such thatγ ∗ =γ ˜γ ¯γ− +� (5.70) ∃�γ γ || 0|| || i|| i i ◦ i ◦ n γ

1 By results in Lemma (14) we know that ¯γ =γ +� +δ and that ˜γ=γ − γ +� . By Lemma n 0 γ n i i ◦ 0 γ 1 1 (9) we also have: ¯nγ− =γ o− +� γ +δ n. Substituting inγ ∗ from equation 5.70 we have:

1 1 1 γ∗ =γ ˜γ ¯γ− +� =γ (γ − γ +� ) (γ − +� +δ ) =γ + ˜� + δ˜ (5.71) i i ◦ i ◦ n γ i ◦ i ◦ 0 γ ◦ 0 γ n id γ n

Which should be understood in theL 2 sense in the following way:

1 1 γ ˜γ ¯γ− +� γ γ ˜γ ¯γ− γ + � L γ ˜γ ¯γ + � (5.72) || i ◦ i ◦ n γ − id||≤ || i ◦ i ◦ n − id|| || γ||≤ || i ◦ i − n|| || γ||

Now when we focus on theγ ˜γ component in thefirst norm we can write: i ◦ i

1 γ ˜γ γ L ˜γ γ − γ =L � (5.73) || i ◦ i − 0||≤ || i − i ◦ 0|| || γ||

1 The inequality is a consequence of composition withγ − and Lemma 8 That means that,γ ˜γ = i i ◦ i γ0 +� γ. Withγ n =γ 0 +� γ +δ n (Lemma 14), substituting to the last inequality result in above equation 5.72 we obtain:

1 γ ˜γ ¯γ− +� γ L γ γ +� +δ + � ˜� + δ˜ (5.74) || i ◦ i ◦ n γ − id||≤ || 0 − 0 γ n|| || γ||≤ || γ|| || n||

5.4.3 Robustness in the SRSF Space

Lemma 16(˙γ convergence). Letγ = argmin q+� (q,γ) . Assume also that q is Lipschitz � γ || − || continuous. Then √ ˙γ 1 inL 2 sense. � → Proof. Using the triangle inequality followed by some change of variables in the second norm, and assuming Lipschitz continuity forq we can write:

q(1 ( ˙γ)) q+� (q,γ ) + q(γ ) q � +2 � (5.75) || − � ||≤ || − � || || � − ||≤ || γ|| || || � 71 Where the convergence rate is a consequence of the Lipschitz property combined with result in the Theorem 11 for the second norm and Lemma 10 for thefirst norm. Thus q(1 √ ˙γ) converges. || − � || The question is, can we establish theL 2 convergence for (1 √ ˙γ) || − � ||

1. As q(1 ( √ ˙γ)) 0, there exists a subsequence such that q(1 ( ˙γ ) 0. || − � ||→ || − �,n ||→ 2. That means that there exists a subsequence (1 ˙γ ) 0 almost� everywhere. − �,nk → 3. (1 ˙γ ) is bounded, thus by using the dominated� convergence theorem we have: − �,nk � limk (1 ˙γ�,n ) = lim k (1 ˙γ�,n ) = 0 (5.76) →∞|| − k || || →∞ − k || � � Now, assume that subsequence is not enough, and that for some different subsequence the con- vergence does not hold (name it (1 ˙γ )). Then after applying the dominated convergence − m,� theorem again, we have contradiction with� q(1 ( ˙γ )) 0 || − m,� ||→ � Corollary 4(˙γ convergence rate). If 1 < Then we can prove the convergence of q(1 || q || ∞ || − (√ ˙γ)) 0 with the convergence rate, by using Cauchy-Shwarz inequality and norm equivalence � ||→ Lemma 7. 1 (1 ( ˙γ)) 1 2 q(1 ( ˙γ)) 2 (5.77) || − � ||L ≤ ||q ||L || − � ||L � � Lemma 17(˙γ 1 convergence rate). Letγ ,γ Γ be two warping functions and assume that − 1 2 ∈ L γ γ � . Also assume thatγ ,γ have not only bouded derivatives but are also bi-Lipshitz || 1 − 2||≤ || || 1 2 (with common lipschitz constatK). Then there exist constantsC, C˜, such that:

1 1 1 (1,γ ) (1,γ ) C˜ � (1,γ − ) (1,γ − ) C (1,γ ) (1,γ ) + C˜ � (5.78) C || 1 − 2 ||− || ||≤ || 1 − 2 ||≤ || 1 − 2 || || ||

Proof. First we write:

˙γ1 ˙γ1 0.5 (1,γ 1) (1,γ 2) = ˙γ1 ˙γ2 = − 0.5L ˙γ1 ˙γ2 (5.79) || − || || − || || √ ˙γ1 + √ ˙γ2 ||≤ || − || � � Next using the inverse derivatves, bounded derivatives assumption we can write:

1 1 0.5 0.5 ˙γ1− (γ1) ˙γ − (γ2)2 2.5 1 1 0.5L ˙γ ˙γ 0.5L − 0.5L ˙γ− (γ ) ˙γ − (γ ) (5.80) 1 2 1 1 1 1 2 2 || − ||≤ || ˙γ− (γ1)1 ˙γ2− (γ2) ||≤ || − ||

1 1 Next with the triangle inequality and Lipshitz assumption for˙γ1− ,˙γ2− we have

72 2.5 1 1 2.5 1 1 0.5L (˙γ− (γ ) ˙γ − (γ )) +0.5L ˙γ− (γ ) ˙γ − (γ ) ≤ || 1 1 − 2 1 || || 2 1 − 2 2 || 2.5 1 1 2.5 0.5L ˙γ− (γ ) ˙γ − (γ ) +0.5L K γ γ ≤ || 1 1 − 2 1 || || 1 − 2|| 2.5 1 1 3 1 1 2.5 0.5L ( ˙γ− (γ ) ˙γ − (γ ) +K � ) L ˙γ− (γ ) ˙γ− (γ ) +0.5L K � (5.81) ≤ || 1 1 − 2 1 || || || ≤ || 1 1 − 2 1 || || || � � 1 1 Iterating this approach one more time for ˙γ− (γ ) ˙γ− (γ ) , we get the result. || 1 1 − 2 1 || � � Corollary 5 (Commutative equivalence of warpings in SRSF space). With assumptions as in Lemma 17 we have:

1 1 1 1 1 (1,γ 2 γ − ) 1 (1,γ − γ 2) C˜ 1 (1,γ 2 γ − ) (5.82) C˜ || − ◦ 1 ||≤ || − 1 ◦ ||≤ || − ◦ 1 ||

2 Lemma 18 (SRSF convergence of ¯nγ impliedL convergence).

γ γ 2 C (1,γ ) (1,γ ) (5.83) || n − 0||L ≤ || n − 0 ||

Proof. 1 t 2 2 2 2 2 γn γ 0 ( ˙γn(s) ˙γ0(s) ds) dt= || − || ≤ 0 0 | − | � � � � 1 t ( ( ( ˙γ(s) ˙γ(s) )( ˙γ(s) + ˙γ(s) )ds)dt)2 | n − 0 | | n 0 | ≤ �0 �0 1 1 � � � � 2 ( ( ( ˙γn(s) ˙γ0(s) )( ˙γn(s) + ˙γ0(s) )ds)dt) 0 0 | − | | | ≤ � � � � � � 1 2 2 4L( ( ˙γn(s) ˙γ0(s) )ds) = 4L (1,γ n) (1,γ 0) (5.84) 0 | − | || − || � � �

73 CHAPTER 6

SUMMARY AND DISCUSSION

We have presented the applications and theoretical developments for mathematical models based on the general SRSF framework driven by real data problems. The applications focus on the area of neuroscience (the neural spike train data) and genomics (the Next Generation Sequencing short reads data). Despite seeming distant, the two areas share a core mathematical description. Namely both require developing mathematical tools for analysis in an infinite-dimensional point pattern space. The natural way of obtaining point patterns in neuroscience data is through recording neural voltage spikes over time, whereas in genomics the point patterns are obtained by recording NGS short reads coordinates over the reference genome. As we proposed in Chapter 2, the point pattern space can be used directly for analysis of the neural spike train data. The alternative path is to convert the point clouds to the intensity or the density functions as we have done in Chapter 3 and Chapter 4. For the point pattern approach, we have provided a noise-removal procedure and a notion of mean and variance estimation. The estimation algorithm is supplemented with a proof of convergence. For the functional interpretation, we have designed a statistical model in the SRSF space with additive noise and have shown its robustness to theL 2 norm of the noise process. The mathematical term connecting these two approaches is called the Stochastic Point Process filtering [43]. In the case of converting point clouds to densities, thefiltering utilizes the Poisson Point Process assumption, which is then used in estimating the underlying intensity functions. What is worth noting is the possibility of using other point process models to account for various data features prior to the core SRSF analysis step. The Gibbs process, Cox process, or doubly stochastic process can be used to account for additional noise in data, over-dispersion or auto- correlation. This can take the modeling one step closer to the data, but as usual comes with the cost of additional mathematical complexity and increased computational time. Investigating other filtering techniques is a critical part of our future work.

74 Another extension can be done by adapting the framework to different domains and different scales of the data problem. This can be especially feasible for genomic applications. In this work we have explored the exonic and the TSS regions, which are relatively small compared to the size of the human genome. By considering larger domains such as whole chromosomes or gene families, one can againfind new features in genomic data. This could be particularly interesting in the area of GWAS (Genome Wide Association Studies), 3D chromatin structure analysis, chromosome replication timing, or copy number alterations. From the theoretical point of view, a crucial step that yet remains to be made is establishing

the ANOVA test distribution in the SRSF space to test the difference between meansH 0 :µ j = µ , forj=k for the model: k � µ = (µ +� ) γ ˙γ (6.1) ij j ij ◦ ij | ij| � The next step would be to extend the testing techniques to impose a significance score on the phase distance statistic. In this way we could formally add a test for the hypothesis in the above model for the warping functions:H :γ =γ , forj=k. 0 j k �

75 BIBLIOGRAPHY

[1] Simon Anders and Wolfgang Huber. Differential expression analysis for sequence count data. Genome Biol, 11(10):R106, 2010.

[2] Simon Anders, Alejandro Reyes, and Wolfgang Huber. Detecting differential usage of exons from rna-seq data. Genome Research, 22(10):2008–2017, 2012.

[3] Dmitriy Aronov, Daniel S. Reich, Ferenc Mechler, and Jonathan D. Victor. Neural coding of spatial phase in v1 of the macaque monkey. Journal of Neurophysiology, 89(6):3304–3327, 2003.

[4] Dmitriy Aronov and Jonathan D Victor. Non-euclidean properties of spike train metric spaces. Physical Review E, 69(6):061905, 2004.

[5] Martin Bauer, Martins Bruveris, and Peter W Michor. Uniqueness of thefisher–rao metric on the space of smooth densities. Bulletin of the London Mathematical Society, 48(3):499–506, 2016.

[6] George Box, William Gordon Hunter, and J. Stuart Hunter. Statistics for experimenters: an introduction to design, data analysis, and model building, volume 1. JSTOR, 1978.

[7] Joseph M. Breza, Alexandre A. Nikonov, and Robert J. Contreras. Response latency to lingual taste stimulation distinguishes neuron types within the geniculate ganglion. Journal of Neurophysiology, 103(4):1771–1784, 2010.

[8] Emery N. Brown, Riccardo Barbieri, Val´erieVentura, Robert E. Kass, and Loren M. Frank. The time-rescaling theorem and its application to neural spike train data analysis. Neural Computation, 14(2):325–346, 2002.

[9] Kaifu Chen, Yuanxin Xi, Xuewen Pan, Zhaoyu Li, Klaus Kaestner, Jessica Tyler, Sharon Dent, Xiangwei He, and Wei Li. Danpos: dynamic analysis of nucleosome position and occupancy by sequencing. Genome Research, 23(2):341–351, 2013.

[10] Patricia M Di Lorenzo, Jen-Yung Chen, and Jonathan D Victor. Quality time: representation of a multidimensional sensory domain through temporal coding. Journal of Neuroscience, 29(29):9227–9238, 2009.

[11] David M. Diez, Frederic P. Schoenberg, and Charles D. Woody. Algorithms for computing spike time distance and point process prototypes with application to feline neuronal responses to acoustic stimuli. Journal of Neuroscience Methods, 203(1):186 – 192, 2012.

76 [12] Alexander J. Dubbs, Brad A. Seiler, and Marcelo O. Magnasco. A fastl p spike alignment metric. Neural Computation, 22(11):2785–2808, 2010.

[13] Kai Fu, Qianzi Tang, Jianxing Feng, X Shirley Liu, and Yong Zhang. Dinup: a systematic ap- proach to identify regions of differential nucleosome positioning. Bioinformatics, 28(15):1965– 1971, 2012.

[14] Thomas J. Hardcastle and Krystyna A. Kelly. bayseq: empirical bayesian methods for identi- fying differential expression in sequence count data. BMC Bioinformatics, 11(1):422, 2010.

[15] Katharina E. Hayer, Angel Pizarro, Nicholas F. Lahens, John B. Hogenesch, and Gregory R. Grant. Benchmark analysis of algorithms for determining and quantifying full-length mrna splice forms from rna-seq data. Bioinformatics, page btv488, 2015.

[16] Housheng Hansen He, Clifford A. Meyer, Hyunjin Shin, Shannon T. Bailey, Gang Wei, Qianben Wang, Yong Zhang, Kexin Xu, Min Ni, Mathieu Lupien, et al. Nucleosome dynamics define transcriptional enhancers. Nature Genetics, 42(4):343–347, 2010.

[17] Conor Houghton. Studying spike trains using a van rossum metric with a synapse-likefilter. Journal of Computational Neuroscience, 26(1):149–155, 2009.

[18] Conor Houghton and Kamal Sen. A new multineuron spike train metric. Neural computation, 20(6):1495–1511, 2008.

[19] John D. Hunter and John G. Milton. Amplitude and frequency dependence of spike timing: implications for dynamic regulation. Journal of Neurophysiology, 90(1):387–394, 2003.

[20] Hannah Julienne and Conor Houghton. A simple algorithm for averaging spike trains. The Journal of Mathematical Neuroscience (JMN), pages 1–14, 2013.

[21] Hermann Karcher. Riemann center of mass and mollifier smoothing. Communications on Pure and , 30:509–541, 1977.

[22] Donna Karolchik, Angela S. Hinrichs, Terrence S. Furey, Krishna M. Roskin, Charles W. Sugnet, David Haussler, and W. James Kent. The ucsc table browser data retrieval tool. Nucleic Acids Research, 32(suppl 1):D493–D496, 2004.

[23] Robert E. Kass and Val´erieVentura. A spike-train probability model. Neural Computation, 13(8):1713–1720, 2001.

[24] Robert E Kass, Val´erieVentura, and Emery N. Brown. Statistical issues in the analysis of neuronal data. Journal of Neurophysiology, 94(1):8–25, 2005.

77 [25] W James Kent, Charles W Sugnet, Terrence S Furey, Krishna M Roskin, Tom H Pringle, Alan M Zahler, and David Haussler. The human genome browser at ucsc. Genome Research, 12(6):996–1006, 2002.

[26] Thomas Kreuz, Julie S. Haas, Alice Morelli, Henry Di Abarbanel, and Antonio Politi. Mea- suring spike train synchrony. Journal of Neuroscience Methods, 165(1):151–161, 2007.

[27] Jim Kuelbs. Kolmogorov’s law of the iterated logarithm for valued random variables. Technical report, DTIC Document, 1976.

[28] Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with bowtie 2. Nature methods, 9(4):357–359, 2012.

[29] Charity W. Law, Yunshun Chen, Wei Shi, and Gordon K. Smyth. Voom: precision weights unlock linear model analysis tools for rna-seq read counts. Genome Biology, 15(2):R29, 2014.

[30] Vernon Lawhern, Alexandre A. Nikonov, Wei Wu, and Robert J. Contreras. Spike rate and spike timing contributions to coding taste quality information in rat periphery. Frontiers in Integrative Neuroscience, 5, 2011.

[31] Ning Leng, John A. Dawson, James A. Thomson, Victor Ruotti, Anna I. Rissman, Bart M.G. Smits, Jill D. Haag, Michael N. Gould, Ron M. Stewart, and Christina Kendziorski. Ebseq: an empirical bayes hierarchical model for inference in rna-seq experiments. Bioinformatics, page btt087, 2013.

[32] Heng Li and Richard Durbin. Fast and accurate short read alignment with burrows–wheeler transform. Bioinformatics, 25(14):1754–1760, 2009.

[33] Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, Richard Durbin, et al. The sequence alignment/map format and samtools. Bioinformatics, 25(16):2078–2079, 2009.

[34] Yafang Li, Xiayu Rao, William W Mattox, Christopher I Amos, and Bin Liu. Rna-seq analysis of differential splice junction usage and intron retentions by dexseq. PloS one, 10(9):e0136653, 2015.

[35] Dukhwan Lim and Robert R. Capranica. Measurement of temporal regularity of spike train responses in auditory nervefibers of the green treefrog. Journal of Neuroscience Methods, 52(2):203–213, 1994.

[36] Michael I. Love, Wolfgang Huber, and Simon Anders. Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome Biology, 15(12):1–21, 2014.

[37] Katrina MacLeod, Alex Backer, and Gilles Laurent. Who reads temporal information contained across synchronized and oscillatory spike trains? Nature, 395(6703):693, 1998.

78 [38] Elaine R. Mardis. A decade’s perspective on dna sequencing technology. Nature, 470(7333):198–203, Feb 2011.

[39] John C. Marioni, Christopher E. Mason, Shrikant M. Mane, Matthew Stephens, and Yoav Gi- lad. Rna-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res, 18(9):1509–17, Sep 2008.

[40] Eduardo Martinez-Ceballos, Pierre Chambon, and Lorraine J. Gudas. Differences in gene expression between wild type and hoxa1 knockout embryonic stem cells after retinoic acid treatment or leukemia inhibitory factor (lif) removal. Journal of Biological Chemistry, 280(16):16484–16498, 2005.

[41] Michael L Metzker. Sequencing technologies - the next generation. Nat Rev Genet, 11(1):31–46, Jan 2010.

[42] Clifford A. Meyer, Housheng H. He, Myles Brown, and X Shirley Liu. Binoch: binding inference from nucleosome occupancy changes. Bioinformatics, 27(13):1867–1868, 2011.

[43] Jesper Møller and Rasmus P Waagepetersen. Modern statistics for spatial point processes. Scandinavian Journal of Statistics, 34(4):643–684, 2007.

[44] Ugrappa Nagalakshmi, Zhong Wang, Karl Waern, Chong Shou, Debasish Raha, Mark Ger- stein, and Michael Snyder. The transcriptional landscape of the yeast genome defined by rna sequencing. Science, 320(5881):1344–1349, 2008.

[45] Ant´onioRC Paiva, Il Park, and Jose C Principe. A reproducing kernel framework for spike train signal processing. Neural Computation, 21(2):424–449, 2009.

[46] R. Quian Quiroga, Thomas Kreuz, and Peter Grassberger. Event synchronization: a sim- ple and fast method to measure synchronicity and time delay patterns. Physical Review E, 66(4):041904, 2002.

[47] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2015.

[48] James O. Ramsay. Functional data analysis. Wiley Online Library, 2006.

[49] Fred Rieke, David Warland, Rob de Ruyter van Steveninck, and William Bialek. Spikes: Exploring the Neural Code. MIT Press, 1997.

[50] Stefanie Rosa and Peter Shaw. Insights into chromatin structure and dynamics in plants. Biology, 2(4):1378–1410, 2013.

79 [51] Cyrille Rossant, Dan F.M. Goodman, Bertrand Fontaine, Jonathan Platkiewicz, Anna K. Mag- nusson, and Romain Brette. Fitting neuron models to spike trains. Frontiers in Neuroscience, 5, 2011.

[52] Susanne Schreiber, Jean-Marc Fellous, D Whitmer, P Tiesinga, and Terrence J Sejnowski. A new correlation-based measure of spike timing reliability. Neurocomputing, 52:925–931, 2003.

[53] Jeremy J. Shen and Nancy R. Zhang. Change-point model on nonhomogeneous poisson pro- cesses with application in copy number profiling by next-generation dna sequencing. The Annals of Applied Statistics, 6(2):476–496, 06 2012.

[54] A. Srivastava, W. Wu, S. Kurtek, E. Klassen, and J. S. Marron. Registration of functional data using Fisher-Rao metric. Journal of Royal Statistical Society, 2011. Under Review.

[55] Anuj Srivastava and Eric Klassen. Functional and shape data analysis. Springer, 2016.

[56] Anuj Srivastava, Eric Klassen, Shantanu H Joshi, and Ian H Jermyn. Shape analysis of elastic curves in euclidean spaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(7):1415–1428, 2011.

[57] Anuj Srivastava, W Wu, Sebastian Kurtek, E Klassen, and JS Marron. Statistical analysis and modeling of elastic functions. arXiv preprint arXiv:1103.3817, 2011.

[58] Cole Trapnell, David G. Hendrickson, Martin Sauvageau, Loyal Goff, John L. Rinn, and Lior Pachter. Differential analysis of gene regulation at transcript resolution with rna-seq. Nature Biotechnology, 31(1):46–53, 2013.

[59] Cole Trapnell, Adam Roberts, Loyal Goff, Geo Pertea, Daehwan Kim, David R Kelley, Harold Pimentel, Steven L Salzberg, John L Rinn, and Lior Pachter. Differential gene and tran- script expression analysis of rna-seq experiments with tophat and cufflinks. Nature Protocols, 7(3):562–578, 2012.

[60] Cole Trapnell, Brian A Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan, Marijke J van Baren, Steven L Salzberg, Barbara J Wold, and Lior Pachter. Transcript assembly and quan- tification by rna-seq reveals unannotated transcripts and isoform switching during cell differ- entiation. Nat Biotechnol, 28(5):511–5, May 2010.

[61] Mariano J. Valderrama. An overview to modelling functional data. Computational Statistics, 22:331–334, 2007.

[62] Mark C.W. van Rossum. A novel spike distance. Neural Computation, 13:751–763, 2001.

80 [63] Daniel L. Vera, Thelma F. Madzima, Jonathan D. Labonne, Mohammad P. Alam, Gregg G. Hoffman, S.B. Girimurugan, Jinfeng Zhang, Karen M. McGinnis, Jonathan H. Dennis, and Hank W. Bass. Differential nuclease sensitivity profiling of chromatin reveals biochemical footprints coupled to gene expression and functional dna elements in maize. The Plant Cell, 26(10):3883–3893, 2014.

[64] Jonathan D. Victor, David H. Goldberg, and Daniel Gardner. Dynamic programming algo- rithms for comparing multineuronal spike trains via cost-based metrics and alignments. Journal of Neuroscience Methods, 161(2):351–360, 2007.

[65] Jonathan D. Victor and Keith P. Purpura. Nature and precision of temporal coding in visual cortex: a metric-space analysis. Journal of Neurophysiology, 76(2):1310–1326, 1996.

[66] Jonathan D. Victor and Keith P. Purpura. Sensory coding in cortical neurons. Annals of the New York Academy of Sciences, 835(1):330–352, 1997.

[67] Eric T. Wang, Rickard Sandberg, Shujun Luo, Irina Khrebtukova, , Lu Zhang, Christine Mayr, Stephen F. Kingsmore, Gary P. Schroth, and Christopher B. Burge. Alternative isoform regulation in human tissue transcriptomes. Nature, 456:470–476, 2008.

[68] Sergiusz Wesolowski, Marc R. Birtwistle, and Grzegorz A. Rempala. A comparison of methods for rna-seq differential expression analysis and a new empirical bayes approach. Biosensors, 3(3):238–258, 2013.

[69] Wei Wu, Thomas G Mast, Christopher Ziembko, Joseph M Breza, and Robert J Contreras. Statistical analysis and decoding of neural activity in the rodent geniculate ganglion using a metric-based inference system. PloS one, 8(5):e65439, 2013.

[70] Wei Wu and Anuj Srivastava. An information-geometric framework for statistical inferences in the neural spike train space. Journal of Computational Neuroscience, 31(3):725–748, 2011.

[71] Wei Wu and Anuj Srivastava. Estimating summary statistics in the spike-train space. Journal of Computational Neuroscience, 34(3):391–410, 2013.

[72] Jin-Ting Zhang. Analysis of variance for functional data. CRC Press, 2013.

[73] Xin Zhang, Tao Zhu, Yong Chen, Hichem C Mertani, Kok-Onn Lee, and Peter E Lobie. Human growth hormone-regulated hoxa1 is a human mammary epithelial oncogene. Journal of Biological Chemistry, 278(9):7580–7590, 2003.

81 BIOGRAPHICAL SKETCH

Sergiusz Wesolowski was born in Poland, he had spent most of his life in the capital city - Warsaw, where he had completed his BSc and MSc, both in Mathematics, the latter with specialization in Mathematical Statistics. Through various internships and collaborations (Georgia Health Sciences University, University of Hasselt CentStat) he gained increased interest and knowledge in genomics. This brought him to the Biomathematics PhD program at Florida State University, where he started a collaboration with Shape Analysis Group at the department of Statistics and the Center of Genomics and Personalized Medicine. Throughout his career he published several articles.

List of Publications.

1. Shape-based data analysis for event classification in power systems, Jose Cordova, Reza Arghandeh, Yuxun Zhou, Sergiusz Wesolowski, Wei Wu and Stifter Matthias, PowerTech, 2017 IEEE Manchester, 1–6, 2017, IEEE

2. SRSF shape analysis for sequencing data reveal new differentiating patterns, Sergiusz Wesolowski, Daniel Vera and Wei Wu, Computational Biology and Chemistry, 70, 56–64, 2017, Elsevier

3. A new framework for Euclidean summary statistics in the neural spike train space, Sergiusz Wesolowski, Alexandre Nikonov, Robert Contreras and Wei Wu, The Annals of Applied Statistics 9,3 1278–1297, 2015, Institute of Mathematical Statistics

4. A comparison of Euclidean metrics and their application in statistical inferences in the spike train space, Sergiusz Wesolowski, Alexandre Nikonov, Robert Contreras and Wei Wu, arXiv preprint arXiv:1402.0863 2014.

5. A comparison of methods for RNA-Seq differential expression analysis and a new empirical Bayes approach, Sergiusz Wesolowski, Marc Birtwistle, and Grzegorz Rempala, Biosensors, 3,3, 238–258, 2013, Multidisciplinary Digital Publishing Institute.

6. Stochasticity and time delays in evolutionary games, Jacek Miekisz, Sergiusz Wesolowski, Dynamic Games and Applications volume 1,3, 440, 2011. SP Birkhauser Verlag Boston.

82 In addition to publishing articles, he popularizes the interdisciplinary aspects of his research by presenting at scientific events from theoretical focused conferences like JMM, SPA; through more application focused: SMB, SIAM, ICSA; to computational and bioinformatics oriented - CSHL Genome Informatics.

Conferences and Events.

10 2017 Workshop: Applications-Driven Geometric Functional Data Analysis Tallahassee, FL, USA, Best Poster Award: How Changes in Shape of Nucleosomal DNA Near TSS Influences Changes of Gene Expression Sergiusz Wesolowski, Jorge Martinez, Daniel Vera, Wei Wu.

03 2017 2017 SIAM-SEAS Tallahassee, FL, USA, Talk: Functional Data Analysis for Next Generation Sequencing Experiments Sergiusz Wesolowski, Jorge Martinez, Daniel Vera, Wei Wu.

06 2016 2016 ICSA Applied Statistics Symposium, Atlanta, GA, USA, Talk: Stochastic Point Processes for Next Generation Sequencing. Sergiusz Wesolowski, Wei Wu.

07 2015 The 2015 Annual Meeting for The Society of Mathematical Biology, Atlanta, GA, USA, Talk: Stochastic Point Processes for Next Generation Sequencing. Sergiusz Wesolowski, Wei Wu.

11 2013 International Year of Statistics, Florida State University, Tallahassee, FL, USA, Speed talk and poster session: Stochastic Point Processes for Next Generation Sequencing. Sergiusz Wesolowski, Alexandre A. Nikonov, Robert J. Contreras, Wei Wu.

07 2013 The 36th Conference on Stochastic Processes and Their Applications, Boulder, CO, USA, Poster: A Comparison of Euclidean metrics in spike train space. Sergiusz Wesolowski, Wei Wu.

02 2012 Recent Advances in Statistical Inference for Mathematical Biology, MBI workshop, Ohio State University, OH, USA, Poster: Improving statistical models for discovering cell-type specifc genes. Sergiusz Wesolowski, Piotr Kraj.

10 2011 9th Workshop on Bioinformatics and 4th Convention of the Polish Bioinformatics Society, Krakow, Poland, Talk: Method for comparing microarray data in investigating cell type specific genes Sergiusz Wesolowski, Piotr Kraj.

83 07 2011 European Conference on Mathematical and Theoretical Biology 2011, Krakow, Poland, Talk: Improving statistical models for discovering cell type specific genes Sergiusz Wesolowski, Piotr Kraj.

Besides his enthusiasm about data problems in genomics and shapa analysis techniques, Sergiusz is interested in outdoor activities including, but not limited to mountain biking, kite-surfing, sailing, climbing, hiking and kayaking.

84