Learning and Estimation of Single Molecule Behavior

Learning and Estimation of Single Molecule Behavior Sivaraman Rajaganapathy1;a, James Melbourne1;b, Tanuj Aggarwal2;c, Rachit Shrivastava1;d, and Murti V. Salapaka1;e Abstract— Data analysis in single molecule studies often of tension. The unfolding of a domain leads to a step involves estimation of parameters and the detection of abrupt change in the length of the protein. How much time is changes in measured signals. For single molecule studies, tools needed for the domains to unfold, the relationship to the for automated analysis that are crucial for rapid progress, need to be effective under large noise magnitudes, and often must force applied, the changes in the structure under folding assume little or no prior knowledge of parameters being stud- and unfolding events are questions that are studied with ied. This article examines an iterated, dynamic programming AFM based force spectroscopy. Most of the studies involve based step detection algorithm (SDA). It is established that numerous iterations of experiments. Often, data is collected given a prior estimate, an iteration of the SDA necessarily from a large number of experiments, not just to ensure a improves the estimate. The analysis provides an explanation and a confirmation of the effectiveness of the learning and high confidence on statistical information, but also because estimation capabilities of the algorithm observed empirically. experiment success rates are sometimes intentionally kept Further, an alternative application of the SDA is demonstrated, low. For example, in single molecule force spectroscopy, wherein the parameters of a worm-like chain (WLC) model the protein solution is diluted enough to ensure that the are estimated, for the automated analysis of data from single the success rate of an attachment forming between the tip molecule protein pulling experiments. and protein is between two to ten percent [4], [11]. The I. INTRODUCTION low probabilities of attachment ensure that the probability of attaching to multiple protein molecules is low [4], [11]. The recently found ability of performing single-molecule The task of obtaining accurate yet precise statistics of events experiments have provided crucial insights into the biological under large uncertainties and the confounding dynamics of machinery of the cell [1], [2]. Single molecule experiments the instrumentation is a daunting one. For example, in AFM have revealed that many biological systems exhibit discrete experiments, the protein domains can unfold at rapid rates behavior [3], [4], [5], [6]. For example, motor-proteins (also where the dynamics of the cantilever cannot be ignored. The known as molecular motors) such as, kinesin and dynein, timing of domain unfolding or folding and changes in protein take discrete steps over microtubules while carrying cargo structure have to be ascertained from the measured cantilever and form a fundamental mode of transport inside cells [5]. deflection data which is corrupted by effects of thermal noise Instruments such as atomic force microscopes (AFMs) and (process noise), measurement noise and the dynamics of the optical tweezers have enabled resolution of measuring forces cantilever probe. It is evident that the needs of fundamental in the femto to multiple pico Newton range with spatial biophysics research necessitate the development of tools for resolution at the nanometer scale. These instruments have automatic detection and estimation of abrupt changes in made it possible to perform protein folding and unfolding measurements and parameters of models used to describe the experiments [7], [8], [9], [10], where domains fold and experimental data. In addition, these tools must be capable unfold with force differentials in the pico-Newton and femto- of operating with minimal prior information on the systems Newton range. The length of the domains are in the tens to being studied. In this article, events characterized by steps hundreds of nanometers. The challenges of single molecule appearing in data and parameters are emphasized; however, studies are many. For example, in force spectroscopy ex- the methods discussed are applicable and extensible to many periments using AFMs (see Fig. 1), a solution with protein other characteristics of events (such as impulsive changes in molecules to be studied is deposited on a substrate, and a the data). cantilever with a sharp tip is pressed against the substrate A number of step detection tools for the study of single to enable a protein molecule to attach to the tip. Subse- molecules are available [12], [13], [14], [15], [16], [17], quent to pressing the cantilever against the substrate, the [18], [19], [20], [21]. A comparative study of tools in [17], cantilever is retracted away from it. If there is a successful [18], [19], [20], [21] reported almost similar performance attachment the protein is extended and comes under tension. [3], with the χ2 based technique in [21] observed to have The domains of the protein unfold under the application improved temporal resolution. The step detection algorithm (SDA) [12], an iterative algorithm involving dynamic pro- 1Department of Electrical & Computer Engineering, University of Min- nesota, Minneapolis, MN 55455, USA gramming (Viterbi) based methodology, was demonstrated to 2Cymer LLC (An ASML Company), 17075 Thornmint Court, San Diego, successfully extract stepping statistics from signals with low CA 92127, USA signal to noise ratios (SNR), and with no prior information a<[email protected]>, b<[email protected]>, c<[email protected]>, on the number of steps or their sizes. Further, the algorithm d<[email protected]>, e<[email protected]> was capable of removing the undesirable effects of probe LASER DETECTOR y y y y Let X = fx1; x2; :::xT g be the true stepping signal generated by a system with its dynamics given by y y xt = x t−1 + ut; CANTILEVER where, ut is a stochastic variable with an unknown distribution p(ut) independent of time t. Let Y = fy1; y2; :::; yT g be the observations of Xy corrupted by zero mean Gaussian EXTENSION noise with, PROTEIN y FOLDED CHAIN yt = xt + ηt; DOMAINS 2 where, ηt ∼ N (0; σ ) for all t. The task of the step detection SUBSTRATE algorithm is to find an accurate estimate X^ = fx^1; x^2; :::x^T g of Xy given the observations Y , and, to obtain an estimate Fig. 1. Schematic of a protein pulling experiment. A polypeptide chain of the step-size distribution (that is, an estimate of p(u)). It protein is stretched between a substrate and a cantilever tip of an AFM. is to be noted that even though we have not introduced probe dynamics that can confound measurements, the methodology is applicable in the presence of such dynamics. The simpler dynamics on the data. The SDA’s effectiveness is evident model is used to keep the presentation accessible and for from its use in a number of bio-physical studies that include, space considerations. the detection of wear in molecular tracks [22], the detection of fluorophore events in the study of nanodot arrays [23], B. SDA Formulation and HP1 chromatin binding [24]. Although SDA has found As presented in [12], the SDA takes the measured data as applications in multiple fundamental biophysics studies and input and a fits a staircase possibly with different step-sizes its efficacy demonstrated in simulation studies (where the to the data (we assume that the staircase also admits negative ground truth is known), the reasons for its effectiveness are steps). The SDA is iterative. Assuming the measurement not established. The algorithm is empirically observed to noise is zero mean with variance σ2, where, an estimate of demonstrate learning, in which previous estimates of the step the variance is known, the SDA has as its first iterate, the size distributions are used to provide a new, more accurate solution of the following optimization problem: estimate of the same. Toward an analysis and substantial extension of SDA’s applicability, this article makes two ( T ) main contributions: (i) It provides the framework to evaluate ∗ X 2 X^ = arg min (yt − xt) + Wt(xt − xt−1) ; (1) X and analyze the learning capability of the SDA. A key t=1 result of the article is that it rigorously establishes that where with every learning iteration of the SDA, the estimation W (x − x ) = 9σ2δ¯(x − x ); (2) improves. (ii) It extends the SDA for learning parameters of t t t−1 t t−1 models that explain single molecule behavior. Furthermore, with ( it demonstrates the efficacy of the methods developed, on 0 u = 0 δ¯(u) = : data obtained from simulations as well as experiments. 1 otherwise The article is organized as follows. Section II introduces the step detection problem and provides the details of the Equation 1 has a quadratic error term and a penalty term that SDA. Section III establishes that the SDA’s estimates im- penalizes steps in the fit to the data. The initial high penalty prove with every iteration. Section IV extends the SDA of nine times the variance of noise given by (2), ensures that for identifying worm like chain (WLC) models used in any step included in the first staircase fit is a true step with protein force spectroscopy experiments. The results of this high probability. The optimization problem is solved using reformulation are presented and discussed in Sections V & dynamic programming. It is reasonable to expect that any VI with simulation and experiment data. steps in the fit with penalty (2) will have high probability of being a true step; however, many true steps will be missed II. THE STEP DETECTION PROBLEM in the fit. Subsequent to the first iteration, the staircase fit We begin with a description of the problem setting and obtained is used to create a histogram of step sizes obtained the step detection algorithm proposed in [12]. from the fit. The histogram is normalized to obtain the penalty term for the subsequent iterate, where, step-sizes A.

Learning and Estimation of Single Molecule Behavior

Edge Detection with a Preprocessing Approach

Modèles De Markov Cachés Récents Pour La Détection Et La Reconnaissance D’Activités À Partir D’Un Capteur IMU Placé Sur Une Jambe Haoyu Li

A Method for Wavelet-Based Time Series Analysis of Historical Newspapers

On Change Point Detection Using the Fused Lasso Method∗

Online Parametric and Non Parametric Drift Detection Techniques for Streaming Data in Machine Learning: a Review P.V.N

Title: Single-Molecule Techniques in Biophysics: a Review of the Progress in Methods and Applications

An Anomalous Noise Events Detector for Dynamic Road Traffic Noise

An Online Algorithm for Piecewise Curve Estimation Using &Ell;<Sup>

Hidden Markov Models, Theory and Applications.Indd

Real-Time Step Detection Using Unconstrained Smartphone

A Novel Walking Detection and Step Counting Algorithm Using Unconstrained Smartphones

Low Complexity TOA Estimation for Impulse Radio UWB Systems