DEVELOPING NEW AFM IMAGING TECHNIQUE AND SOFTWARE FOR DNA MISMATCH REPAIR

Zimeng Li

A dissertation submitted to the faulty at the University of North Carolina at Chapel Hill in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Physics and Astronomy.

Chapel Hill 2019

Approved by:

Dorothy Erie

Tom Clegg

Richard Superfine

Wu Yue

Tom Kunkel

Paul Modrich

©2019 Zimeng Li ALL RIGHTS RESERVED

ii ABSTRACT

Zimeng Li: Developing New AFM Imaging Technique and Software for DNA Mismatch Repair (Under the direction of Dorothy Erie)

Atomic Force Microscopy (AFM) is a powerful technique to study the assembly and function of multiprotein-DNA complexes, such as the MutS-MutL-DNA complex in DNA mismatch repair. As a high-resolution, single-molecule imaging technique, AFM has the advantage of directly visualizing individual protein-DNA complexes in their native conformations, but it has two severe limitations. First, it is unable to resolve the location of the

DNA inside the protein. Second, it lacks a comprehensive software that is tailored for single- molecule studies and high-throughput analyses. To tackle these issues, we developed DREEM

(Dual-Resonance-frequency-Enhanced Electrostatic Force Microscopy). DREEM is a new AFM imaging method that is capable of resolving DNA path inside the protein-DNA complex. With

DREEM, we can reveal the path of the DNA wrapping around histones in nucleosomes and the path of DNA through multiprotein mismatch repair complexes. We also developed Image

Metrics, a full-featured AFM software package that excels in high-throughput shape analysis and single-molecule analysis. Built using MATLAB, Image Metrics is uniquely positioned in single- molecule studies with its specially designed single-molecule analysis and shape analysis modules. It also blends unique flexibility into a user-friendly interface that allows for easy customization and automation of user workflows. Finally, as a case study, we used AFM and

Image Metrics to study DNA mismatch repair (MMR) in the context of trinucleotide repeat

(TNR) expansion diseases. Our data show that when interacting with heteroduplex DNAs with

iii the TNR context, MMR proteins adopt conformations that are different from a repair signaling capable species, suggesting the altered conformations may not be competent for repair.

iv TABLE OF CONTENTS

LIST OF TABLES ...... ix

LIST OF FIGURES ...... x

LIST OF ABBREVIATIONS ...... xiv

CHAPTER 1. OVERVIEW ...... 1

CHAPTER 2. VISUALIZING THE PATH OF DNA THROUGH PROTEINS USING DREEM IMAGING ...... 4

2.1. Introduction ...... 5

2.2. Design...... 8

2.3. Results and Discussion ...... 13

CHAPTER 3. IMAGE METRICS, A NEXT-GEN AFM IMAGE ANALYSIS SOFTWARE ...... 24

3.1. Introduction ...... 25

3.2. Interface ...... 29

3.3. Basic Operations ...... 32

3.4. Image Processing...... 36

3.4.1. Surface Correction ...... 36

3.4.2. Cross-correlation ...... 46

3.4.3. Fast Fourier Transform ...... 48

v 3.5. Image Analysis ...... 52

3.5.1. Particle Detection ...... 52

3.5.2. Shape Analysis ...... 57

3.5.3. Single-Molecule Analysis ...... 68

3.5.4. Compare Measurement to Other AFM Software ...... 81

3.6. Macros and User Extensions ...... 92

3.7. Author’s Remarks and Future Directions...... 94

3.7.1. Motivations on Creating Image Metrics ...... 94

3.7.2. Particle Detection ...... 96

3.7.3. Shape Analysis ...... 99

3.7.4. Single-Molecule Analysis ...... 103

3.7.5. Open Development Model ...... 106

3.8. License and Distribution ...... 108

CHAPTER 4. STRUCTURAL AND FUNCTIONAL STUDY OF DNA MISMATCH REPAIR IN THE CONTEXT OF TRINUCLEOTIDES REPEATS EXPANSION...... 109

4.1. Introduction ...... 110

4.1.1. DNA Mismatch Repair (MMR) ...... 113

4.1.2. Trinucleotides Repeat Expansion (TNR) ...... 118

4.1.3. Molecular Mechanisms of TNR ...... 120

4.1.4. Relations between TNR and MMR...... 126

vi 4.2. Materials and Methods ...... 128

4.2.1. Proteins and DNAs ...... 128

4.2.2. AFM Sample Preparation ...... 129

4.2.3. Deposition, Imaging, and Analysis ...... 130

4.3. Results ...... 134

4.3.1. (CTG)1...... 135

4.3.2. (CTG)5...... 139

4.3.3. (CTG)56/(CAG)54 ...... 142

4.4. Discussion ...... 155

APPENDIX A. SUPPLEMENTAL INFORMATION FOR DREEM ...... 163

A.I. Supplemental Figures ...... 163

A.II. Theoretical Basis of DREEM Measurements ...... 166

A.III. Supplemental Experimental Procedures ...... 169

APPENDIX B. AUTHOR’S CONTRIBUTION TO DREEM...... 174

APPENDIX . DREEM OPTIMIZATION ...... 176

APPENDIX D. SUPPLEMENT FIGURES FOR IMAGE METRICS ...... 184

APPENDIX E. SUPPLEMENT FIGURES FOR MMR AND TNR ...... 195

APPENDIX F. IMAGE ARTIFACTS ...... 197

APPENDIX G. PARTICLE CLASSIFICATION ALGORITHMS ...... 201

G.I. Classification through Eigenanalysis ...... 202

vii G.II. Classification through Clustering Analysis ...... 212

G.III. Verifications and Best Practices ...... 214

APPENDIX H. IMAGE SIMULATION ...... 230

REFERENCES ...... 236

viii LIST OF TABLES

Table 3.1 Overview of major AFM Software for Single-Molecule AFM Study ...... 27

Table 3.2 Comparison of Image Flattening Results between Image Metrics and other AFM software...... 40

Table 3.3 Selected List of Particle Metrics ...... 58

Table 3.4 Comparison of Selected Metrics in Particle Analysis ...... 84

Table 3.5 Software Comparisons on Fiber Measurement ...... 88

Table 3.6 Software Comparisons on Fiber Measurement (cont’d) ...... 89

Table 3.7 Comparison of Selected Metrics in Single-Molecule Analysis ...... 91

Table 4.1 DNA substrates length and position of slip-out ...... 129

Table 4.2 AFM sample preparation recipe for MutSβ-MutLα-DNA reactions ...... 130

Table 4.3 Percent Population of DNA-bound Proteins that Form Clusters or Multi-protein Complexes ...... 152

Table C.1 Terms in the minimum detectable force gradient ...... 177

Table G.1 Particle Alignment and Correlation ...... 202

ix LIST OF FIGURES

Figure 2.1 Instrumental Design for Simultaneous AFM and DREEM Imaging ...... 10

Figure 2.2 Representative Topographic AFM and DREEM Images of Nucleosomes ...... 16

Figure 2.3 Topographic AFM and DREEM Images of Mismatch Repair Complexes on 2 kbp DNA Containing a GT Mismatch ...... 19

Figure 3.1 Image Metrics Application Launcher ...... 30

Figure 3.2 Image Metrics User Interface ...... 31

Figure 3.3. Image Visualization ...... 35

Figure 3.4 AFM Scan Line Image Artifact ...... 38

Figure 3.5 Detrend Operation ...... 40

Figure 3.6 Auto Threshold by Gaussian Method ...... 41

Figure 3.7. Correcting Line-wise Image Artifact...... 44

Figure 3.8 Line Removal Algorithms Compared ...... 45

Figure 3.9 Feature Tracking and Enhancement through Cross-Correlation and Correlation Averaging ...... 48

Figure 3.10 Processing Periodic Patterns using FFT and Correlation ...... 50

Figure 3.11 Particle Detection – Connectivity and Bridging...... 53

Figure 3.12 Masking Operation ...... 56

Figure 3.13. Particle Analysis ...... 59

Figure 3.14 The Insufficiencies of Particle Metrics in Describing Shapes ...... 60

Figure 3.15. Shape Matching Analysis of Proteins UHRF1 ...... 63

Figure 3.16 Particle Classification Module ...... 67

Figure 3.17 DNA Tracing and Fiber Analysis ...... 72

Figure 3.18 Single-Molecule Analysis ...... 78

Figure 3.19 Different Contour Measurement Mechanisms ...... 83

x Figure 3.20 Perimeter Measurement of Contour by Pixel Center ...... 83

Figure 3.21 Models for Volume Measurement ...... 86

Figure 3.22 Fiber/Skeleton Segmentation...... 105

Figure 4.1 Mechanism schematics of DNA Mismatch Repair through the MutSα pathway ..... 114

Figure 4.2 Strand Discrimination Signal Searching Mechanism ...... 116

Figure 4.3 Mechanism of Repeat Expansion ...... 121

Figure 4.4 Experiment Design ...... 131

Figure 4.5 A Typical AFM Image of a Protein-DNA Complex and Its Analysis ...... 132

Figure 4.6 Selecting Particles for Stoichiometry and Position Analysis ...... 132

Figure 4.7 Position Analysis ...... 133

Figure 4.8 Volume Distribution of MutSβ Complexes ...... 145

Figure 4.9 Stoichiometry of protein complex per DNA ...... 147

Figure 4.10 Position Distribution of Protein Complex on DNA ...... 148

Figure 4.11 Position Distribution of Protein Complex on (CTG)1 (Single Cut) ...... 149

Figure 4.12 DNA Bending by MutSβ ...... 150

Figure 4.13 Measuring Kinks on Free DNA ...... 152

Figure 4.14 Native DNA Kinks and Their Locations ...... 153

Figure 4.15 AFM Images of MutSβ-MutLα-DNA Complexes ...... 154

Figure 4.16 Model of Repair Signaling ...... 162

Figure A.1 Topographic and DREEM Images of a Polarized Batio3 (BTO) Thin Film ...... 163

Figure A.2 Additional DREEM Images of Histone Alone and Nucleosomes ...... 164

Figure A.3 DREEM Imaging Reveals The Path Of The DNA In hMutsα Mobile Clamp Complexes Loaded Onto A Circular DNA Substrate (4 kbp) Containing Two GT Mismatches 2 kbp Apart ...... 165

xi Figure C.1 Frequency shift on the first overtone with different DC bias ...... 180

Figure C.2 DREEM imaging in attractive mode vs. repulsive mode ...... 180

Figure C.3 Frequency shift on image contrast ...... 182

Figure C.4 Operative Frequency on Image Contrast ...... 182

Figure D.1 Welcome Screen of a Typical Module ...... 184

Figure D.2 Image Filter Module ...... 185

Figure D.3 Action History ...... 186

Figure D.4 Macro Builder ...... 187

Figure D.5 Batch Processing through Macros ...... 188

Figure D.6 Script Editor ...... 189

Figure D.7 Image Acquisition...... 190

Figure D.8 Image Browser ...... 191

Figure D.9 Command Console ...... 192

Figure D.10 Shape Matching Analysis of MutSα-DNA Complexes ...... 193

Figure D.11 App Manager ...... 194

Figure E.1 Restriction map of DNA substrates ...... 195

Figure E.2 Position Distribution of Homoduplex from the Double-cut DNA ...... 195

Figure E.3 Stoichiometry of Protein Complexes per DNA on Homoduplex from the Double-cut DNA ...... 196

Figure F.1 Other Image Artifacts ...... 199

Figure G.1 Sub-grouping ...... 210

Figure G.2 Super-grouping ...... 211

Figure G.3 Hierarchical Ascendant Clustering (HAC) ...... 213

Figure G.4 Image Simulation using Smiley Faces...... 214

xii Figure G.5 Particle Classification using Artificial Data Set ...... 216

Figure G.6 Simulated AFM Image from Classified Particle Images ...... 217

Figure G.7 K-means Clustering ...... 218

Figure G.8 Cluster Silhouette Plot ...... 220

Figure G.9 Comparisons of K-means Clustering using Different Seeds ...... 221

Figure G.10 Hierarchical Ascendant Clustering – Flat Cutoff Scheme ...... 223

Figure G.11 Hierarchical Ascendant Clustering – Inconsistency Scheme ...... 223

Figure G.12 HAC Parameter Optimization ...... 224

Figure G.13 Eigenanalysis Parameter Optimization ...... 226

Figure G.14 Summary of Classes (w/ Highest Silhouette Means) ...... 228

Figure G.15 Summary of Classes (w/ Local Silhouette Maxima) ...... 229

Figure H.1 Image Simulator ...... 231

Figure H.2 Tip Modeling ...... 232

Figure H.3 Tip Geometry Parameterization ...... 233

Figure H.4 Tip Apex Simulation ...... 233

Figure H.5 Tip Dilation...... 234

Figure H.6 Image Tip-Dilation Comparison ...... 235

xiii LIST OF ABBREVIATIONS

AFM

CA Clustering Analysis; Correspondence Analysis

DM Dystrophy Myotonic

DREEM Dual Resonance Frequency Electrostatic Force Microscopy

EFM Electrostatic Force Microscopy

EM Electron Microscopy

FRDA Friedreich’s Ataxia

FXTAS Fragile X Tremor and Ataxia Syndrome

HAC Hierarchical Ascendant Clustering

HD Huntington’s Disease

IM Image Metrics

MMR Mismatch Repair

MRA Multiple Reference Alignment

MSA Multivariate Statistical Analysis

PCA Principal Component Analysis

ROI Region of Interest

SCA Spinocerebellar Ataxia

SNR Signal to Noise Ratio

SL MutSβ-MutLα-DNA Complex

SPIP Scanning Probe Imaging Processor

SXM Scanning X (Force, Probe, etc.) Microscopy

TNR Trinucleotides Repeat

xiv CHAPTER 1. OVERVIEW

Our DNA is subjected to constant damage and metabolic activities during our lifetime. It can be damaged through various exogenous and endogenous factors, or through ‘programmed’ damage during maintenance. It can undergo various structural changes, such as folding into super condensed protein-DNA structures called chromosomes, or unwinding from its double helical structure while undergoing DNA replication, transcription, recombination, and repair [1].

Maintaining the high fidelity of DNA replication during these activities is the corner stone of the central dogma that governs the precise flow of genetic information from DNA to proteins, which is key to our survival.

There are many important cellular mechanisms that govern the high fidelity of DNA replication. DNA mismatch repair (MMR) is one of them [2]. DNA mismatch repair corrects post-replication errors left by DNA replication machinery during DNA replication. Mutations in key DNA mismatch repair genes, first identified in the 1980s, are associated with many cancers, including colon cancer, endometrial cancer, and ovary cancer, collectively called Lynch

Syndrome [3]. It is important to note that DNA mismatch repair proteins also participate in many other cellular processes – most often they assist in positive roles such as signaling for cell apoptosis, double-strand break repair and homologous recombination [2], but they can also participate in undesirable activities such as promoting a number of trinucleotide repeat (TNR) expansion related neurologic disorders [4], such as Huntington’s Disease (HD).

To understand how MMR proteins interact with DNA in various cellular contexts, it is important to study the assembly and structure of protein-DNA complexes – an area where

1 Atomic Force Microscopy (AFM) is widely used. As a high-resolution, single-molecule imaging technique, AFM has the advantage of directly correlating the conformational structures of protein-DNA complexes to their functions in the context of a repair event [5, 6], while resolving conformational dynamic species that are not observed in crystal structures. However, a limitation of AFM is its inability to resolve the location of the DNA within the multiprotein complex. To tackle this limitation, we developed a new electrostatic force microscopy (EFM) method called

DREEM (Dual-Resonance-frequency-Enhanced EFM) that is capable of resolving DNA within protein-DNA complexes (CHAPTER 2).

Another severe limitation to single-molecule AFM studies is the lack of a comprehensive software for high-throughput single-molecule protein-DNA analysis. To reach a statistically relevant conclusion in a single-molecule study, one has to collect and process a large quantity of data, and data analysis is often the bottleneck. To improve the quality and throughput of data analysis, I developed Image Metrics, a new generation image analysis software package

(CHAPTER 3). Designed from the ground up, Image Metrics is a comprehensive MATLAB1- based image analysis package that not only implements features comparable to those found in leading software packages in its category (AFM), but also is specifically designed for easy customization, automation, and streamlining of data analysis. Notably, it features a single- molecule analysis module that is specifically tailored for high-throughput protein-DNA analysis, the first of its kind to my knowledge. With Image Metrics, users can not only perform all the image processing and analysis in a single package, they can also easily add custom functions, configure their own workflow, and automate laborious routines. I aim to develop Image Metrics as an indispensable tool for AFM image analysis and beyond.

1 MATLAB is a programming language designed for technical computing and developed by MathWorks, Inc

2 Finally, as a case study, DNA mismatch repair is examined in the context of TNR expansion (CHAPTER 4). AFM is used to visualize directly how MMR proteins interact with heteroduplex DNAs at the initial stages of repair processing. Because a variety of DNA substrates (both with and without the TNR context) and interaction conditions are involved, a large quantity of data had to be collected for all the conditions combined. The processing and analysis of the collected images of protein-DNA complexes were made possible by utilizing

Image Metrics’ high-throughput processing capabilities. Our data analysis shows that when interacting with heteroduplex DNAs without the TNR context, MMR proteins adopt conformations that are consistent with a repair signaling capable formation; when interacting with heteroduplex DNAs with the TNR context, MMR proteins adopt very different conformations that may not be competent for repair signaling.

3 CHAPTER 2. VISUALIZING THE PATH OF DNA THROUGH PROTEINS USING DREEM IMAGING2

Many cellular functions require the assembly of multiprotein-DNA complexes. A growing area of structural biology aims to characterize these dynamic structures by combining atomic-resolution crystal structures with lower resolution data from techniques that provide distributions of species, such as small angle x-ray scattering, electron microscopy, and atomic force microscopy (AFM). A significant limitation in these combinatorial methods is localization of the DNA within the multiprotein complex. Here, we combine AFM with a new electrostatic force microscopy (EFM) method to develop an exquisitely sensitive Dual-Resonance-frequency-

Enhanced EFM (DREEM) capable of resolving DNA within protein-DNA complexes. Imaging of nucleosomes and DNA mismatch repair complexes demonstrates that DREEM can reveal both the path of the DNA wrapping around histones and the path of DNA as it passes through both single proteins and multiprotein complexes.

2 This chapter previously appeared as an article in Molecular Cell. The original citation is as follows - Wu, D., P. Kaur, Z. M. Li, K. C. Bradford, H. Wang and D. A. Erie (2016). "Visualizing the path of DNA through proteins using DREEM imaging." Mol Cell 61(2): 315-323. Additional information on my contribution to the publication and additional work on optimization can be seen in APPENDIX B and APPENDIX C.

4 2.1. Introduction

DNA transactions in the cell, such as replication, repair, and transcription, require the assembly of multiple proteins on DNA. Determining the structures of these complexes is essential to understanding their function; however, several factors make characterization of multiprotein-DNA complexes particularly difficult. First, many of the individual proteins are large and contain structured domains connected to one another by intrinsically disordered regions, making them conformationally diverse. Second, the assembly of the different proteins is not necessarily an ordered process, which results in a heterogeneous population of complexes with different conformations and containing different protein stoichiometries [7]. Finally, the assembly process may occur over long DNA lengths and/or bring distal DNA regions together.

An emerging area of structural biology, which is beginning to address this problem, is the combination of high resolution data from crystallography and NMR with lower resolution data from techniques such as small angle x-ray scattering, which provides estimates of the distribution of conformational states [8-11], and electron microscopy (EM) and atomic force microscopy

(AFM), which provide images of individual complexes [12-26]. Although these hybrid methods are promising, a significant limitation to the existing lower resolution techniques is their limited capability for resolving the location of the nucleic acids within protein-DNA complexes.

Phosphorus mapping through electron spectroscopic imaging (ESI) has been used to characterize the nucleic acid distribution in transcriptionally active chromatin [27]. In addition, recent advances in sorting of particles in cryoEM datasets are beginning to allow visualization of multiple conformations [28], and the trajectories of DNA have been estimated by tagging the end of DNA with streptavidin [22, 29]. Finally, recent EM studies revealed the location of the DNA in human RNA polymerase [30] complexes and the RNA in the ribosome (e.g., [31, 32]).

5 Currently, no microscopy method allows visualization of DNA within flexible and/or large heterogeneous protein-DNA complexes. Because scanning force microscopy methods can provide images of individual complexes and because both proteins and DNA are significantly charged and interactions between proteins and DNA result in charge neutralization, we reasoned that it may be possible to visualize the path of DNA within individual protein-DNA complexes by high-resolution imaging of their electrostatic properties.

Electrostatic force microscopy (EFM) and Kelvin probe force microscopy (KPFM) have been used to image the electrostatic surface potential of a large variety of materials with high spatial resolution and sensitivity [33, 34]. There are several different modes of EFM and KPFM.

In many applications, a modulated bias voltage (VDC + VACsin(t)) is applied between the tip and sample. This bias generates an attractive electrostatic force between the tip and the sample,

1 ¶C 2 Fel = - DV , where DV = (VDC -DfTS )+VAC sin(wt) and is expressed as the sum of three 2 ¶z spectral components [34-36]:

1 휕퐶 푉2 퐹 = − [(Δ휙 − 푉 )2 + 퐴퐶] (2. 1) 퐷퐶 2 휕푧 푇푆 퐷퐶 2

휕퐶 퐹 = (Δ휙 − 푉 )푉 sin⁡(휔푡) (2. 2) 휔 휕푧 푇푆 퐷퐶 퐴퐶

1 휕퐶 퐹 = 푉2 cos(2휔푡) (2. 3) 2휔 4 휕푧 퐴퐶 where  and ¶C are the contact potential difference and capacitance gradient, respectively, TS ¶z between the tip and the sample, and z is normal to the surface. This force is used to induce a vibration in the cantilever at the frequency of the AC bias (). In KPFM, a feedback loop is used

to adjust VDC such that it compensates for TS, thereby nullifying Fw and generating a potential map of the surface; whereas, in EFM, there is no feedback voltage, and although EFM does not

6 measure surface potential, images of the electrostatic properties of the surface are produced by monitoring the amplitude and/or phase of the induced vibration. Dual-frequency single-pass techniques, where the topography and the surface electrical potential are monitored simultaneously have the highest sensitivity [33, 36-38]. In fact, dual-frequency KPFM has been used to obtain images of DNA [37] and transcription complexes [39]; however, no details about the DNA in the transcription complexes were revealed.

Considering the weak electrostatic signals generated by DNA and proteins, we developed a sensitive high-resolution Dual-Resonance-frequency Enhanced EFM (DREEM) to resolve the

DNA within protein-DNA complexes deposited on mica (Figure 2.1). This dual frequency technique enables simultaneous collection of AFM topographic and DREEM images. DREEM images reveal DNA wrapping around individual nucleosomes and the path of DNA passing through DNA mismatch repair proteins. These data yield unprecedented details about DNA conformations within individual protein-DNA complexes.

7 2.2. Design

We adapted and extended the dual frequency single-pass techniques that take advantage of the resonance properties of the cantilever [36-38, 40-42]. To obtain simultaneously both topographic and DREEM images, we mechanically vibrate the cantilever near the fundamental resonance (1), as is done in standard repulsive intermittent contact mode topographic imaging, while applying a static and a modulated bias voltage (VDC and VAC, respectively) to the tip at the first overtone (2) to monitor the surface electrical properties (Figure 2.1) [41]. Instead of using

the DC bias to nullify Fw as is done in KPFM, we use an AC bias at 2 to generate a vibration at

2 and apply the DC bias after engaging in repulsive mode to optimize the amplitude at 2 for electrostatic imaging. We then monitor the vibration amplitude ( A ) and phase ( j ) as a w2 w2 function of sample position. Because there is no feedback at the first overtone, the DREEM amplitude and phase signals depend on both the strength of the electrostatic force and force

' gradient, including the static force gradient ( FDC ) (APPENDIX A) [38, 43-45]. In addition, other forces may contribute to the signal at 2 if they are not canceled by the feedback at the fundamental frequency [38, 43-47]. Generally, the phase image produces higher contrast due to the nonlinear dependence of the phase on the force gradient and energy dissipation ( j w2 depends on the arcsine of the force gradient and the energy dissipation) [43-45]. For example, studies using dual frequency AFM (with mechanically driven vibration at both frequencies) to image antibodies found that the signal to noise ratio for the phase signal is ~50 times higher than that of the amplitude signal at 2 [47]. Because the force gradient depends on both the capacitance and the electrostatic potential of the sample, changes in either of these properties will contribute to the observed signals. To maximize resolution in both the AFM topographic and

8 DREEM images, we use highly doped sharp silicon cantilevers and operate in repulsive intermittent contact mode. Operating in repulsive mode keeps the tip at a constant minimal distance from the sample, which in turn, maximizes the sensitivity of detection of the electrostatic force gradient. Although highly doped silicon cantilevers are the only available cantilevers that are sufficiently sharp to provide high-resolution images, the variability of the oxidation layers on the silicon cantilevers limits the possibilities for quantitative comparison of

DREEM signals collected using different cantilevers (see Limitations).

9

Figure 2.1 Instrumental Design for Simultaneous AFM and DREEM Imaging

The AFM (MFP-3D, Asylum Research) is operated in repulsive oscillating (intermittent contact) mode with the cantilever mechanically vibrated near its resonance frequency (1 = 2휋f1) (f1 =~80 kHz, for the cantilever used in this study) to collect the topographic information. To simultaneously collect the DREEM image, AC and DC biases are applied to a highly doped silicon cantilever (Nanosensors, PPP-FMR, force constant ~2.8 N/m), with the frequency of the AC bias centered on cantilever’s first overtone (2 = 2휋f2) (f2 ~500 kHz). An external lock-in amplifier is used to separate the 2 component from the output signal and compare it with the reference input AC signal to generate the electrostatic amplitude and phase signals. The DC bias is maintained constant and is used to adjust the electrical vibration amplitude to produce optimal contrast in the DREEM images. In the current setup, the AC and DC biases can be adjusted from 0 V to 20 V and -2.5 V to 2.5 V, respectively. The inset shows the thermal motion of a typical cantilever used in our experiments as a function of the frequency. The frequencies and Q factors for the fundamental (f1, Q1) and first overtone (f2, Q2) frequencies are shown by each peak. Using the first overtone for electrostatic imaging and the fundamental frequency for topographic imaging has several advantages. First, it is preferable to conduct topographic imaging of soft samples with a minimal force to avoid damage, and the effective force constant

2 at 1 (~80 kHz) is approximately forty times less than that at 2 (~500 kHz) [k2=k1(2/1) ]

[48]. Second, 2 is more sensitive to changes in force gradient than 1, because the minimal detectable force gradient is inversely proportional to the frequency and the Q-factor of the

10 resonance peak, which is higher at 2 (Q(2) ~500) than at 1 (Q(1) ~170) [49]. Third, the contribution of the electrostatic interaction between the cantilever and the sample to the electrostatic force is minimized at 2, thereby enhancing spatial resolution in the DREEM image

[50]. Fourth, higher eigenmodes provide enhanced phase contrast compared to the fundamental mode of tip oscillation for both AFM and EFM imaging [38, 47, 51].

To determine the optimum voltage for obtaining the highest resolution DREEM amplitude and phase images, we hold the AC bias constant (usually VAC = 10 to 20 V) and vary the DC bias between +2.5 V and -2.5 V. The optimum DC bias depends on the tip, because the tips can have different extents of oxidation on their surface, which affects TS [52]. Operating in repulsive mode using a cantilever with force constant of ~ 2.8 N/m, the amplitude of vibration at 2 ( A ) is ~ 1 nm, which is 30 to 50 times smaller than the mechanical vibration amplitude w2

( A ) at the fundamental frequency. This A is sufficiently large to produce high quality w1 w2

DREEM images and yet small enough compared to A that no crosstalk from the DREEM to w1 topographic signals is observed (see below). depends not only on the force at 2, but also on the force gradient,¶F (i.e., F), because F changes the effective spring constant of the ¶z cantilever and shifts its resonance frequency, which in turn, changes [53]. Upon engaging

' in repulsive mode, the force gradient due to repulsive atomic interactions ( F a ) causes the resonance peak to shift to a higher frequency, significantly reducing A . In our experiments, w2

' decreased by approximately a factor of two upon repulsive engage. During scanning, F a and

Fa is kept constant via feedback on the topographic signal at 1, and therefore, changes in A w2

11 [ DA (x,y)] depend primarily on the electrostatic force and force gradient. For small changes in w2 electrostatic potential and/or capacitance, the frequency shift due to changes in force gradient will dominate DA (x, y), with the electrostatic force making only a small contribution w2

(Appendix A.II) [54]. Notably, monitoring F' instead of F significantly increases spatial resolution and sensitivity, because F ' has a shorter distance dependence than F [54-57].

12 2.3. Results and Discussion

We verified the capabilities of DREEM for detecting surface electrical potential by imaging a BaTiO3 thin film, which can maintain a stable polarization state after being polarized by external electrical field [58-60]. We generated a pattern of very weak negatively and

2 positively charged areas (~2 electrons/nm ) on a BaTiO3 film (Figure A.1A) [61] and then imaged the sample with AFM and DREEM with different DC and AC biases (e.g., Figure A.1).

The topographic image reveals only a rough surface with a large contaminant particle, with no evidence of the charge pattern. In contrast, both the DREEM phase and amplitude signals clearly show the charge pattern, which corresponds accurately to the differently charged areas (Figure

A.1B), but show no evidence of the contaminant particle seen in the topographic image. These results demonstrate the capability of DREEM for detecting weak surface charges (< 2 electrons/nm2), with no significant crosstalk between the topographic and DREEM signals.

Furthermore, the observation that the contaminant particle does not produce any signal in either the DREEM phase or amplitude images suggests that the dominant force acting at 2 is the electrostatic force.

Visualizing the Path of DNA within Protein-DNA Complexes

To demonstrate the power of DREEM for imaging protein-DNA complexes, we imaged nucleosomes and DNA mismatch repair (MMR) proteins bound to DNA, as well as free proteins.

In the crystal structure of a nucleosomal core particle, 147 base pairs of DNA wrap around the histone octamer 1.67 times [62, 63]; whereas, in MMR complexes, the DNA passes through

DNA mismatch recognition protein MutS [64-66], and multiple MutS and MutL proteins can assemble onto DNA containing a mismatch [2, 67-70]. The DREEM images of free histones, free mismatch repair proteins, and DNA show a decrease in the phase and an increase in

13 amplitude, relative to the mica surface, with proteins producing greater contrast than DNA

(Figure 2.2, Figure A.2, and Figure A.3A), as seen in previous EFM studies [37, 39]. The features seen in the DREEM images of free protein mimic those seen in the topographic images

(Figure A.2A and Figure A.3A).

Figure 2.2 shows AFM topographic and DREEM images of nucleosomes. In the topographic images, the nucleosomes appear as smooth peaks protruding above the DNA, consistent with previous work [18, 71-77]. In contrast, in the DREEM images, the nucleosomes show regions of decreased intensity within the nucleosomal core particle, and these features are reproducible in multiple scans, scans at different angles, and in trace and retrace images (Figure

A.2B). Furthermore, multiple nucleosomes in individual DREEM images display DNA paths at different orientations (Figure A.2). The decreased intensities indicate regions of weaker electrostatic interactions between the tip and sample, which likely results from neutralization of charge and possibly changes in capacitance associated with the interaction between the protein and DNA. Consistent with this suggestion, using these decreased intensities to trace the path of

DNA on the histone yields a model in which the DNA wraps around the histone core (compare the models and images in Figure 2.2) [62, 63]. In the crystal structure, the DNA is wrapped around histone 1.67 turns [62, 63], but nucleosomes exist in a dynamic equilibrium of states that have different extents of DNA wrapping [63]. Consequently, one or two strands of DNA may be revealed in the DREEM images, depending on both the orientation of the nucleosomes on the surface and the extent of DNA wrapping. In addition, the ability to resolve two DNA strands wrapping around the histone will depend the sharpness of the AFM tip, and the quality of the

DREEM signal. In half of the nucleosome images (n = 21 out of 41 nucleosomes), we observe one DNA strand wrapping around histones (Figure 2.2A, Figure 2.2B, and Figure A.2), and in

14 the other half (n = 20 out of 41 nucleosomes), we can visualize two DNA stands wrapping around the histone core, where cross-section analysis reveals two distinct peaks corresponding to

DNA (Figure 2.2C, and Figure A.2). The distance between the two peaks corresponding to two

DNA double strands is 4.2 ± 0.8 nm, which is slightly larger than that seen in the crystal structure (~3 nm) [62]. This difference is likely due to both different conformations of the nucleosomes on the surface and the limit of our resolution. In the images in which two DNA strands are seen, the tip was particularly sharp, as revealed by the width of the DNA in the topographic and DREEM images (e.g., Figure 2.2C). This result suggests that the spatial resolution of the DREEM images, like that of the topographic images, is limited by the tip size.

Notably, it is possible to overlay the crystal structure of the nucleosome onto the DREEM image of the nucleosome showing two strands (Figure 2.2C). Taken together, these results demonstrate that DREEM can be a powerful method for resolving the path of DNA wrapped around proteins.

15

Figure 2.2 Representative Topographic AFM and DREEM Images of Nucleosomes

(A & B): Topographic (A top, B left), DREEM phase (A middle, B center), and DREEM amplitude (A bottom, B right) images of nucleosomes showing one DNA wrapping around histones one time. (C): Topographic (left) and DREEM phase (right) images of a nucleosome showing DNA wrapping around nucleosomes twice. Insets show graphs of the height cross-section for the line drawn across the nucleosome in topographic (left) and DREEM phase (right) images. The two dots on the graph correspond to the positions of the two dots shown on the line across the image, which mark the position of the peaks corresponding to the DNA in the DREEM image. The distance between the two peaks corresponding to the two DNA double strands (dots on graph) is 3.4 nm, which is similar to that seen in the crystal structure (~3 nm) [62]}. Cartoon models of the DNA wrapping around histones are shown on each DREEM phase image (models are not to scale). The crystal structure of a nucleosome [62]} overlaid on the DREEM phase image is shown in the inset of the phase image in C. The white scale bars are 50 nm. All topographic images are scaled to the same height, and the height scale bar is shown in A. Both the topographic and DREEM phase images in C are sharper than those in A and B as a result of a sharper AFM tip. All features in the images are seen in both the trace and retrace scans (Figure A.2B). Nucleosomes were reconstituted on a 2743 bp linear fragment containing 147 bp 601 nucleosome positioning sequence. Unlike the images of nucleosomes, DREEM images of free histones show only smooth “hemispherical shape”, similar to the topographic images (Figure A.2A). See also Figure A.2.

To further test the capability of DREEM for visualizing DNA contained within protein complexes, we imaged protein-DNA complexes involved in DNA mismatch repair (MMR)

(Appendix A.III). In MMR, MutS homologs recognize DNA mismatches and subsequently

16 form multimeric complexes with MutL homologs in the presence of ATP [2, 67-70, 78]. MutS homologs are dimers with DNA binding and ATPase domains, and the DNA binding domains encircle and bend the DNA (Figure 2.3A) [64-66]. In addition, two MutS dimers can associate to form DNA loops [6, 79, 80]. Furthermore, in the presence of ATP, MutS homologs form a mobile clamp after mismatch recognition that can move away from the mismatch which allows multiple proteins to load onto DNA containing a single mismatch [78, 81-83]. Topographic AFM images of T. aquaticus (Taq) MutS bound to a GT mismatch (Figure 2.3B) and two MutS dimers forming a DNA loop between the mismatch and a DNA end (Figure 2.3C) show the typical smooth peaks on the DNA corresponding to Taq MutS [5, 6]. In contrast, in the DREEM images (Figure 2.3) the “peaks” corresponding to the position of MutS show regions of decreased intensity, similar to our observations with nucleosomes (Figure 2.2 and Figure A.2).

The regions of decreased intensity reveal the path of the DNA through MutS, which is hidden in the topographic AFM images. For example, in Figure 2.3B, MutS appears to be lying on its side

(relative to model in Figure 2.3A) such that the bend in the DNA is clearly revealed. In this orientation, only a small amount of protein is on top of the DNA, allowing the complete path of the DNA to be visualized. In Figure 2.3C, the path of the DNA is partially obscured by MutS, which appears to be sitting upright on top the DNA at the mismatch. As illustrated in the model, the DNA appears to come from underneath the protein (going from top to bottom of image) and exit on the top (where the DNA can be clearly visualized exiting the protein), with the DNA bend potentially occurring perpendicular to the surface and hidden by the protein. After exiting the protein at the mismatch, the DNA loops back to interact with the second MutS bound at the end of the DNA. Images of multiple hMutS proteins loaded onto DNA in the presence of ATP also clearly show the DNA passing through the proteins (Figure A.3). Inspection of these and

17 other images (not shown) suggests that the contrast between the DNA and protein in the DREEM images depends on how close the protein-DNA interaction site is to the tip. If the DNA is underneath a large amount of protein, then the electrostatic properties of the protein will likely screen out the effect of the DNA. This observation is similar to that seen with carbon nanotubes embedded in a polymer matrix, in which the contrast of the nanotubes decreases with increasing depth of the nanotubes in the matrix [38]. In addition to visualizing the DNA inside the complex, the DREEM data taken together with structural data on MutS [64] allow us to model the general orientation of the MutS dimers in the complexes (Figure A.3B, Figure A.3C). The potential power of DREEM is revealed in the image of a large multiprotein complex of human MutS and

MutL bound to DNA containing a GT mismatch (Figure A.3D). In the topographic image, a large protein complex is seen at the end of the DNA. This complex is one of the larger MutS-

MutL complexes that we observe, and it was chosen to demonstrate the capability on DREEM for resolving DNA in large multiprotein-DNA complexes. A detailed analysis of the properties of MutS and MutL complexes is the focus of another manuscript. The volume of this complex is consistent with it containing ~10 proteins [84]. The length of the DNA that is not inside the protein complex is ~120 nm shorter than the expected length for 2 kbp DNA. Inspection of the

DREEM amplitude and phase images reveals the path of the DNA in this large complex (Figure

A.3D). Including the DNA inside the proteins yields a DNA length that is within 5% of the expected length. These results suggest that DREEM may be a powerful tool for examining the path of DNA in large multiprotein-DNA complexes that may not be amenable to characterization by other techniques. In fact, the DNA path is often easier to discern in larger protein-DNA or multiprotein-DNA complexes because the DREEM signal of protein surrounding the DNA provides better contrast relative to DNA on the mica surface.

18

Figure 2.3 Topographic AFM and DREEM Images of Mismatch Repair Complexes on 2 kbp DNA Containing a GT Mismatch

(A) Space-filling model of the crystal structure of Taq MutS (generated from PDB 1EWQ). Subunits A and B and the DNA are colored blue, gold, and cyan, respectively. MutS bends the DNA by ~ 60° as it passes through the DNA binding channel. (B) AFM topographic (left) and DREEM phase (center) and amplitude (right) images of a Taq MutS-DNA mismatch complex. Model of the complex is shown overlaid onto the AFM images and next to the phase images. (C) AFM topographic (left) and DREEM phase (right) images of two MutS dimers forming a loop in the DNA between the location of the mismatch (375 bp from one end) and DNA end. Model of the complex is shown overlaid onto the AFM images and next to the phase images. The model is based on the volume of the complex in the topographic image (consistent with two dimers), the location of the DNA in the DREEM image, as well as the crystal structure and the location of the tetramerization (two MutS dimers) interface [85, 86]}. A topographic surface plot of this image is shown in Figure 2.1. (D) AFM topographic (left: surface plot) and DREEM phase (middle: surface plot; right: top view) images of a large MutS-MutL-DNA complex containing ~10 proteins. The path of the DNA is identified as the regions with highest reduction of the magnitude of DREEM signals compared to protein alone and traced in the inset in blue. Interestingly, the DNA appears to be sharply bent after entering the complex at the expected position of the mismatch (MM). Z-scale bars are in nanometers for AFM images and arbitrary units for the DREEM images. See also Figure A.3.

19 Limitations

Other than the requirement that the samples must be deposited on a surface to be imaged, which is common to all scanning probe microscopies, the primary limitation of DREEM relates to the use of highly-doped silicon cantilevers. Although doped diamond-coated cantilevers (tip radius ~100 nm) and metal-coated cantilevers (tip radius ~30 nm) are typical choices for EFM imaging [87], they are not sufficiently sharp to produce high-resolution images. Highly-doped silicon cantilevers are sharp (5-8 nm) and sufficiently conductive for high-resolution topographic and DREEM imaging; however, the quality of the DREEM image appears to depend on the oxidation layers on the surface. The oxidation layer on the silicon cantilevers requires that the

DC and AC biases be optimized for each cantilever. These differences in oxidation layers prevent quantitative comparison of the magnitudes of the DREEM signals collected with different tips, or the same tip after collecting a series of images. In addition, ~30% of prepared conductive silicon cantilevers do not generate sufficient contrast between the protein and DNA to allow us discern paths of DNA in protein-DNA complexes in DREEM images. Argon plasma cleaning of the cantilevers prior to use appears to improve their quality for DREEM imaging.

Finally, the quality of the DREEM images degrades during imaging faster than that of the topographic images. Typically, ~10 to 12 high-quality DREEM images can be obtained from a single AFM tip.

Similar to conventional AFM imaging techniques, DREEM imaging can also experience tip artifacts, due to the asymmetry in the electric field between the AFM tip and sample surface.

For example, in some cases, half-moon like asymmetries, with one side of the DREEM signal consistently stronger than the other side, are seen in the same orientation for all complexes in a single DREEM image. As with tip artifacts in topographic images, these artifacts can be

20 identified by the repetitive features in different molecules from the same image and by scanning at various angles.

A final limitation of DREEM is that it is currently limited to imaging in air. At present, we have not been able to identify operating parameters that allow contrast in aqueous environment. A few studies demonstrate EFM imaging of solid materials at low ionic strength using lift mode [88, 89]; however, the resolution and detection limit in these images appears low.

It is likely that the electrostatic double layer significantly damps the DREEM signals from proteins and DNA in electrolyte solutions.

Conclusions

In summary, while the paths of DNA are hidden in protein complexes in traditional microscopy imaging techniques, such as AFM and EM imaging, DREEM allows the visualization of the conformation of DNA within individual protein-DNA complexes. In addition to the studies presented here, DREEM also has been utilized to visualize DNA conformations within telomere binding proteins (Benarroch-Popivker et al., 2016; unpublished results, Kaur,

Wu, Lin, Countryman, Bradford, Erie, Riehn, Ospresko, Wang). Taken together, the capability of DREEM to detect very small changes in electrostatic force gradient with high resolution makes it a powerful tool for characterizing the structure of protein-DNA complexes at the single- molecule level. It will be especially useful for characterizing protein-DNA complexes with long length scales and those that result in heterogeneous populations of proteins on the DNA.

Furthermore, a growing area in structural biology is the combination of atomic resolution crystal structures with lower resolution data from small angle x-ray scattering (SAXS), EM, and AFM to generate atomic level structures of complex assemblies and conformationally flexible proteins

[8-25]. DREEM has the capability to significantly increase the constraints on the possible

21 orientations of proteins in multiprotein assemblies on DNA, as demonstrated by our ability to dock the crystal structure of the nucleosome into a subset of the images. In addition, DREEM allows the path of DNA to be resolved in large heterogeneous multi-protein-DNA complexes. It also will be applicable for characterizing the electrostatic properties of other biological specimens, such as viruses and membranes, as well as non-biological samples. With sharper tips and further refinement of the technique, it is highly likely that the resolution can be further increased in the future. Finally, with the addition of only two components (a function generator and a lock-in amplifier, Figure 2.1), DREEM can be implemented on many of the commercially available AFMs, making it readily available to many labs.

Experimental Procedures

Instrument design

Our experimental setup for simultaneous AFM and DREEM is described in Figure 2.1.

In our setup, we apply an AC bias at the first overtone (2) and monitor the vibration amplitude

( A ) and phase (j ) as a function of position, while simultaneously collecting the w2 w2 topographic image at the fundamental frequency (1).

The detailed methods for conductive cantilever preparation, substrate grounding, selection of imaging conditions, sample preparation, deposition, and analysis are described in

Appendix A.III.

Supplementary Information

Supplementary Information, which includes the theoretical basis of DREEM,

Supplementary Experimental Procedures and three figures, can be found in APPENDIX A.

22 Author Contributions

D.W. and D.A.E. invented the DREEM method. KCB prepared human mismatch repair protein-DNA samples. D.W., P.K., Z.L., H.W., and D.A.E. designed and conducted experiments, analyzed data, and wrote the manuscript.

Acknowledgements

We would like to thank James Jorgenson for the use of the lock-in amplifier, Peggy Hsieh for providing human MutS and MutL proteins, and Elizabeth Sacho, and Keith Weninger for helpful discussions and critical reading of the manuscript. This work was supported by National

Institutes of Health grant R01 GM079480 (DAE), R00 ES016758 (HW), R01 GM107559 (HW), and NCSU start-up fund (HW).

23 CHAPTER 3. IMAGE METRICS, A NEXT-GEN AFM IMAGE ANALYSIS SOFTWARE

One of the biggest hurdles in single-molecule AFM studies is the lack of comprehensive software package that allows for high-throughput analysis. In fact, data analysis is often the bottleneck in single-molecule studies. To tackle this issue, I developed Image Metrics, a full- featured MATLAB3-based AFM image analysis software package. Relying on MATLAB’s enormous scientific library, Image Metrics is able to blend powerful features and flexibility into a user-friendly interface, and enable users to perform high-throughput multi-faceted image analysis. In particular, Image Metrics features unique modules for single-molecule analysis and shape analysis such as particle classification that are not available in other AFM software. With

Image Metrics, single-molecule AFM analysis is streamlined - image correction, measurement, and analysis are all processed in a single software package, and users can easily program user functions, customize workflows, and automate laborious routines. The software is designed to be the next generation research tool for AFM and other imaging fields.

3 MATLAB is a programming language designed for technical computing and developed by MathWorks, Inc

24 3.1. Introduction

In single-molecule AFM studies, usually a large amount of data has to be collected and processed to reach a statistically relevant conclusion. Increasing the quantity of data should improve the quality of data analysis; however, the time required for image analysis is a major limiting factor in increasing the sample. Consequently, a software package capable of analyzing single-molecule data in high throughput is sorely needed.

Despite the many software packages available ([90], Ref. Appendix C), none of them are tailored for single-molecule AFM studies. Table 3.1 (white rows) gives an overview on major software in the market. While particle analysis exists as a stock feature for many programs

(Table 3.1 Particle Analysis) [91, 92], the particle metrics (Table 3.3) analyzed are too generalized to describe complex conformations such as those of protein-DNA complexes. To analyze such conformations, users have to zoom manually in to the region of interest and perform a variety of measurements, such as DNA profile and DNA bend angle, which is a highly repetitive and time-consuming task. On top of that, none of the software offers vertically integrated capabilities to manage data recordings for such analysis4. While certain workflows can be automated with preset functions in some programs (Table 3.1 Macros), they are often basic or difficult to use. Few software is comprehensive enough to complete all the image processing and analyses required for single-molecule AFM studies in a single package, and the technicality of programming custom functions is too complicated for users without a computer science

4 The recording and management of measurement data on existing software is deferred to the users – often requiring users to manually input data into an external spreadsheet program.

25 background5. Many software packages are also poorly maintained, documented, or interfaced6.

Overall, none of the software is capable of performing single-molecule analysis in high- throughput. Their shortcomings add significant time overhead and costs to users’ workflow, which typically involves three or more programs, and present a huge challenge to single- molecule AFM studies.

To tackle this problem, I developed Image Metrics (Table 3.1 grey row), a professional image processing and analysis software. Image Metrics is written in MATLAB from the ground up and incorporates innovative and comprehensive features into a user-friendly, well documented interface. It is specifically designed to allow for high-throughput single-molecule analysis, and is compiled as a royalty free standalone application that runs across all major platforms. With the advent of Image Metrics, single-molecule AFM analysis is streamlined: image import, calibration, processing, analysis, and results output can be carried out all within the same program. Notably, Image Metrics features advanced fiber analysis and shape analysis that are unique to the field of single-molecule AFM studies such as the study of protein-DNA conformations and the classifications of molecular conformations. Image Metrics also provides unique flexibility that allows users to customize and automate workflows without advanced coding skills and to write custom functions through the built-in scripting interface. Collectively, these strengths provide researchers a cost-effective and incentive application platform to port

5 For example, writing extensions for these programs requires advanced knowledge of general purpose programming language (C, C++, Python, Java, etc. Table 3.1 Extensions), which requires a lot of heavy-lifting on the users’ end to take care of many general aspects of computing such as memory management, complicated syntax, and integration of external libraries, etc.

6 For example, many software packages (Asylum Research, Nanoscope analysis, ImageJ, WSxM, etc.) are not adapted to high DPI (dots per inch) display, making them difficult to use on modern screens. Other than the commercial software SPIP, most software packages are buggy and frequently broken in features, and have poorly designed user interface and outdated documentations. In the case of Gwyddion, the interface can be difficult to understand for many users even with documentations.

26 their own applications, which could open up Image Metrics to fields far beyond its original scope.

Image Particle Fiber Bend Software Image Formats Platforms Processing Analysis Analysis Angle Asylum Asylum Research Windows, 1 Yes Yes Yes No Research format macOS 2 Gwyddion Various Multiple Yes Yes No No Image 3 Various Multiple Yes Yes Yes Yes Metrics 4 ImageJ Nanoscope format Multiple Yes Yes Yes No 5 ImageSXM Various macOS Yes Yes No No Nanoscope 6 Nanoscope format Windows Yes Yes No No Analysis 7 SPIP Various Windows Yes Yes Yes Yes 8 WSxM Various Windows Yes No No No

Macros Extensions Source License Ref. URL 1 Yes Igor Pro Available7 Free http://support.asylumresearch.com 2 Yes C, Python Open GNU [92] http://gwyddion.net 3 Yes MATLAB Available8 Free http://im.zimengli.com [93, 4 Yes Java Open BSD http://imagej.net 94] 5 Yes No Closed Free [95] https://www.liverpool.ac.uk/~sdb/ImageSXM 6 Yes No Closed Free http://nanoscaleworld.bruker-axs.com/ 7 Yes C++, C# Closed ~$10k [96] http://imagemet.com 8 Yes No Closed Free [97] http://www.wsxmsolutions.com Table 3.1 Overview of major AFM Software for Single-Molecule AFM Study

Other AFM software includes SFMetrics [98], GXSM [99], DockAFM [100], OpenFovea [101], FRAME [102], DeStripe [103], DNA Trace [104], FiberApp [105], and various manufacture software from JPK instrument, Bruker, etc. Image Formats: ‘Various’ means support for images from multiple AFM vendors. Platform: ‘Multiple’ means support for Windows, macOS, and . Image Processing: standard image corrections (Section 3.4). Particle Analysis: batch analysis of particle metrics (Section 3.5.2A). Fiber Analysis: measuring profile and length of fiber such as DNA (Section 3.5.3B). Bend Angle: measuring DNA bend angle and curvatures (Section 3.5.3D). Macros: scripts that automate program functions (Section 3.6), also known as global analysis or batch analysis. Extensions: external programs that acts natively and implement additional functions (Section 3.6). Also known as plugins or modules. Source: Open source licenses such as GNU and BSD allow for free modification, contribution, and distribution of code. Source available software limits the modification and distribution of code in some way. Closed source software is also referred as proprietary software. License: GNU and BSD are open and free licenses. SPIP pricing is quoted as the full/premium-feature installation price per seat.

7 Users can modify, but not contribute to the code.

8 Users can modify and contribute to the code, but distribution of code may be limited.

27 Due to the scope of this study, I will mostly discuss applications of Image Metrics that are relevant to single-molecule AFM studies, and briefly mention other applications where applicable. Details of all its functions and instructions can be seen on its website

(http://im.zimengli.com).

28 3.2. Interface

Image Metrics is designed to maximize the accessibility, flexibility, and availability to users of all major desktop platforms. It runs on the latest Windows, macOS, and Linux distributions. It consists of different modules (called ‘apps’) united under a single application launcher (Figure 3.1). The design language of a typical module consists of ribbons9, toolbars, search bar, workspace, and a status bar (Figure 3.2).

A welcome screen serves as both recent file lists and tutorials on most modules (Figure

D.1). Tutorials are composed of step-by-step guides, demos (usually images), online videos, and online instructions. Functions are designed to be easily accessible throughout the program. Users can place their most often used functions in the mini toolbar or in the quick access bar, and keyboard shortcuts can be assigned to most functions. Accessibility-wise, the software is DPI10 aware and fits into most monitor resolutions, sizes, and dimensions. Font and window size can be adjusted without compromising software usability. Most functions will also remember their last known parameters, so user preference will be saved upon exiting and reloaded upon relaunching.

9 Also called Toolstrips in software such as MATLAB

10 Dots per inch or DPI is a measure of display scaling factor. Larger DPI means larger, more visible interface elements.

29

Figure 3.1 Image Metrics Application Launcher

The Application Launcher is the central place to manage and launch Image Metrics apps. The apps are arranged by groups based on their categories. Apps can be added to favorite as shortcuts. Users have the ability to load custom apps in their source form or compiled form, or download apps from user published apps in the App Store (in development). Read more on user extensions in Section 3.6.

30

Figure 3.2 Image Metrics User Interface

Application modules are launched through the Application Launcher. Featured in the image is the Region Inspector module. Toolbar and Ribbon provides access to major application functions. The search bar allows users to find a function by its description or name or used to search for help. Workspace is the main working area of an application module. Frequently accessed functions can be placed on the mini toolbar and/or the quick access bar. The mini toolbar is context sensitive, but can be pinned on top of the workspace. The status bar can display helpful information such as tooltips, computing progress, and axes location of the mouse pointer.

31 3.3. Basic Operations

A. Image I/O and Calibration

Similar to many programs (Table 3.1 Image Formats), Image Metrics supports multiple image formats. Currently, Image Metrics supports direct import of raw AFM data from several manufactures (Asylum Research, Veeco/Bruker, JPK Instrument, etc.)11 in addition to conventional image and video formats12. Users can also set up custom imports if file I/O codes are provided, of which many can be found on MATLAB File Exchange or on third-party websites. Image Metrics also features an Image Acquisition module (Figure D.7) and Image

Simulator module (Figure H.1) for acquiring images from external and synthetic sources. The

Image Acquisition module relies on MATLAB’s Image Acquisition Toolbox, which supports many microscope hardware13. While not directly supporting AFM image acquisition, it is possible to communicate with external AFM hardware if a software interface is provided by the

AFM manufacturer14. The Image Simulator module provides simulation of AFM images using

3D models that can either be parameterized shapes or 3D models from external files such as those from the protein data bank (PDB). The module also lets users perform tip modeling and apply tip dilation to images using methods described in previously published papers [106, 107].

Interested readers can find more details in APPENDIX H. To save processed or analyzed data,

11 Manufacture file I/O is achieved via several submissions in MATLAB File Exchange, see Section 3.8 for license and copyright information.

12 Formats that are natively supported by MATLAB.

13 Currently, most hardware packages are based on camera systems. AFM vendors traditionally use proprietary software to control their system. However, the potential to open such access to users is there.

14 For example, in Asylum Research software, many AFM commands can be accessed at the command line and therefore programmed externally.

32 users can save them into an Image Metrics specific file (.im) that also can be imported into

MATLAB as a MAT file15 for further processing in MATLAB.

After images are imported, they can be automatically calibrated to proper units if calibration data are contained in image files. Otherwise, images can be calibrated manually in the four dimensions (x, y, z, t) in user defined units. The wide range of image formats supported and the flexibility to use custom units allow Image Metrics to adapt to a wide range of imaging applications, ranging from microscopic (molecular and cellular biology, material sciences) to macroscopic (geography and astronomy), from still images (e.g. static AFM imaging) to dynamic images (e.g. high-speed or time-lapsed AFM imaging, fluorescence microscopy, etc.).

B. Image Navigation

In Image Metrics, images are treated as containers. Each image can contain one or more layers (named data channels in Image Metrics). For example, an AFM image can contain multiple channels (height, amplitude, phase, etc.); a color image contains three color channels

(RGB); a dynamic image (video) uses its frames as channels. Similar to other AFM programs, an

Image Browser module (Figure D.8) is incorporated in Image Metrics to help navigate the images and different image layers more easily. From there, users can remove images, open new images, and duplicate images. Image Metrics also features an Image Viewer module (similar to

SPIP’s inspection window) so that users can open the same image, or different images, in different windows for close-up inspections, cross comparisons, and/or synchronized views.

Imported images can be edited (e.g. resized, cropped, corrected) and exported as individual images or videos for publication or processing in other programs.

15 MAT file is the typical data file used in MATLAB

33 C. Image Visualization

AFM images are intensity images. Similar to other AFM programs, they are visualized via pseudo-color or false colormap [92, 97, 108]. Image Metrics offers a variety of color maps to help image visualization. Users can change the contrast of a color map or customize the color map with easy slider-based adjustment tools. The color map can also be loaded from or saved to external files for easy sharing and backup, and many professional colormaps and colormap utilities16 from various fields of study can be downloaded from MATLAB file exchange. To change the data range to which the colormap applies, users can either manually enter the data range or use the program’s built-in function to determine the data range based on region of interest, which can be either the whole image or a custom drawn region (Figure 3.3A). Users can also make the intensity distribution plot, where they can place range markers to change the lower and upper end of the color map interactively (Figure 3.3B). Many AFM programs interpolate data when zoomed in (e.g. Asylum Research). Although Image Metrics does not interpolate data by default when zoomed in (Figure 3.3C), it can be done if the user chooses (Figure 3.3E). As with many AFM programs, 3D surface plots can be rendered to visualize the topographic information. Apart from the conventional 3D operations, the program also features a very useful image overlay function that allows another image channel (usually represents another AFM signal channel such as force, phase, voltage etc.) to be overlaid on top of the topography as a color layer. Contour plots can also be overlaid to visualize easily the change of intensities in regions of interest (Figure 3.3F). The image and contour overlay function is only available in the commercial software SPIP besides Image Metrics.

16 For example, ColorBrewer and Cubehelix are third-party utilities that can be installed as extensions (Section 3.6) in Image Metrics to provide advanced colormaps and adjustment capabilities.

34

Figure 3.3. Image Visualization

A. Image to be visualized (height channel). In Image Metrics, users can draw an area (blue box) that automatically scales the colormap to the data range within that area. B. Intensity distribution plot. Users can place two marker bars to adjust the upper and lower limit of the current colormap and interactively affect the image appearance in A. C. DREEM phase channel (see DREEM imaging on CHAPTER 3) of the same image as A. D. Zoomed-in view of the red box region in C. E. Interpolated view of D, which is much finer in details compared to D. Use discretion when using data interpolation, which may or may not improve the interpretation of data. F. 3D view with DREEM phase and contour overlay. The combined visualization allows easier tracing of DNA paths along the topography.

35 3.4. Image Processing

Similar to other types of imaging techniques, AFM imaging comes with image artifacts that usually require correction before the images can be analyzed. AFM image artifacts typically derive from factors that impact the probe-surface interaction, such as: (1) tip degradation and contamination; (2) excessive or lack of image force that leads to improper tracking of the surface; (3) factors that impact the performance of the AFM scanner and feedback system, such as environmental noise, piezo drifting, and improper gain settings on the feedback system.

Although many artifacts can be corrected to various degrees, not all the corrections make for easier or better analysis. Some of them require complex calculations and calibrations and may introduce new artifacts due to over-correction, therefore, the decision to correct various artifacts must be determined at the level of individual studies. In Image Metrics, the Image Processor module (Figure D.1) is specifically designed to process and correct images. Here, I discuss the most relevant features in Image Metrics to correct some of these image artifacts. For processing other typical image artifacts, see APPENDIX F.

3.4.1. Surface Correction

Perhaps the most obvious difference between an AFM image and an image obtained by an optical imaging technique is the apparent line-wise image artifacts in AFM as indicated by the constant changes in heights across an image even for a flat surface (Figure 3.4, Figure 3.7A).

These line variations could result from multiple causes - (a) AFM piezoelectric scanner drifting

(mechanical or thermal) during the scan ([109] Ref. Section 3.5, [110]): The drift in Z piezo results in changing heights along the scan line (the fast scan axis or x axis, Figure 3.4 orange and green box) and among different scan lines (the slow scan axis or y axis, Figure 3.4 green box); the drift in X and Y piezo results in shortening or lengthening of distance between surface

36 features, and in the worst scenario, transformation of surface features (Figure 3.7B red box). The amount of drifting is directly impacted by scan speed, with faster scanning resulting in lower drifting. (b) Non-linearity of piezo response along the scan line ([110], [90] Ref. Section 2.2.1):

Piezo typically suffers from non-linear response17 with increasing scan size, especially when the scan size approaches the scanner limit, resulting in ‘scan bow’ effect (Figure 3.7C), and from hysteresis effect18 between tracing and retracing of the surface, resulting in differential heights between the trace and the retrace images. (c) Scanning over an uneven or tilted surface ([90] Ref.

Section 5.1.1): Even for a flat even surface like freshly peeled mica, tilt could still occur if the mica is not glued evenly to its mounting substrate, the substrate is unevenly mounted, or the scanner is not leveled parallel to the platform upon imaging, resulting in an uneven image

(Figure 3.4).

These changing heights along the line or from line to line, if not normalized, results in an uneven image (Figure 3.4, Figure 3.7C) that otherwise should be an even flat surface like the mica used in single-molecule protein-DNA AFM studies. Even if the surface is inherently uneven, it may still require normalization for analyzing some surface features, especially if height-based calculation (e.g. masking, volume calculation, etc., more on that later19) is needed.

Without correction, the unevenness offsets surface features differentially by their local surface heights and presents a challenge to measure or process anything that is height related accurately.

17 Piezo linearity is indicated by a linear curve of applied voltage vs. piezo displacement.

18 Hysteresis is a common effect in ferromagnetic, ferroelectric, and piezoelectric materials. It’s indicated by the non-linearity of the material response when upping and lowering the applied voltage within the range of two set voltages.

19 Also discussed in Section 3.5.1 and Section 3.5.3B.

37

Figure 3.4 AFM Scan Line Image Artifact

A typical, unprocessed AFM image is shown. Three types of scan line artifacts may arise. Green box: Uneven heights along the slow scan axis (y axis) originate from Z-piezo drifting. Orange box: Uneven heights compared to green box along the same scan lines. This artifact could be originated from multiple causes: (a) non-linearity of piezo response along the scan line. (b) Z-piezo drift (thermal or mechanical). (c) uneven or tilted surface. Blue box: line artifact along the scan line when tip stumbles upon a tall feature (resulting in spikes) or sucks in a low feature (resulting in dark stripes). Surface normalization is a staple feature for many AFM programs (e.g. flatten in Asylum

Research and Nanoscope Analysis software, detrend in SPIP, leveling in Gwyddion, etc.). An example step-by-step correction for line-wise artifacts in Image Metrics is shown in Figure 3.7.

Despite their difference in naming, they all use a polynomial curve fitting approach to normalize the surface height ([109] Ref. Section 3.7.4, [90] Ref. Section 5.1.1) (Figure 3.5). By subtracting a polynomial curve (Figure 3.5 red curve) that fits along every scan line, the median of the height distribution (i.e. the surface) is normalized to zero (Figure 3.5 right), resulting in a flat surface (Figure 3.7E). Indeed, a first order flattening is often applied during scan in real time to help visualization of the image (Figure 3.7B), but further corrections upon investigation of

38 individual image are usually needed to normalize the surface accurately. To validate the algorithm used in Image Metrics, a comparison between Image Metrics and other AFM software is shown in Table 3.2. The differences (Table 3.2, red highlights) are orders of magnitudes below the pixel variations of a flat mica surface (Table 3.2, std. deviation and avg. deviation), and are likely resulting from rounding errors in the underlying numerical packages used by the different programs. In addition, surface features, including high features such as protein-DNA complex or low features such as surface wells or pores, can be excluded or masked from the curve fitting using a height-based threshold ([90] Ref. Section 5.1.1) or clustering methods [108]

(Figure 3.7D), thereby eliminating outliers that could derail the surface normalization. Such operation is often called masked flattening (Asylum Research software), or flattening with thresholding (Nanoscope Analysis). Image Metrics also provides a novel Gaussian-based algorithm that automatically calculates the optimal surface threshold used to mask the features without masking too much surface (Figure 3.6). It should be noted that the line-wise correction will not correct a real warped or slanted surface, but a flattened surface may still be desirable for easier measurement of relative heights. In those scenarios, a planefit (two-dimensional polynomial fit or surface fit) is used instead ([109] Ref. Section 3.7.4).

39

Figure 3.5 Detrend Operation

A section profile along the scan line is shown on the left (blue), which can be fit into a polynomial curve (red curve, here the 3rd order polynomial fit is used). Subtracting the polynomial curve from the section profile results in a flat, normalized base line in the profile (right). In this example, thresholding is used to exclude the outliers (spikes) from polynomial fitting.

Flatten order 1st order flattening, image 1 3rd order flattening, image 2

Programs Nanoscope Asylum Image Metrics Image Metrics Analysis Research Std. deviation 1.97E-10 1.97E-10 2.11E-10 2.11E-10

Avg. deviation 1.02E-10 1.02E-10 1.16E-10 1.16E-10

Max 1.63E-08 1.63E-08 6.90E-09 6.90E-09

Min -9.94E-10 -9.94E-10 -6.90E-10 -6.90E-10

Avg 4.04E-26 2.95E-14 -1.48E-25 4.10E-21

Median -1.69E-11 -1.64E-11 -2.67E-11 -2.67E-11

Table 3.2 Comparison of Image Flattening Results between Image Metrics and other AFM software.

Examples of two different orders of flattening (1st and 3rd) in Image Metrics compared to two different AFM software. A different image is used for the two comparisons because the two AFM software do not open the same image format. The samples imaged are composed of mica surface with protein-DNA sample (e.g. image 1 can be seen in Figure 3.7A-B), and therefore are relatively flat (with surface standard deviation of around 200pm). The difference resulting from the flattening procedures used by different AFM programs are highlighted in red. As can be seen in the table, the difference is orders of magnitudes below the surface’s standard deviation, and because the surface is relatively flat, the difference can be seen as negligible, and likely results from rounding errors used by the numerical packages used in each software.

40

Figure 3.6 Auto Threshold by Gaussian Method

A. Usually when user chooses a height threshold, it is often subjective and not accurate, resulting in fluctuations in the excess of how much surface is masked (with consequence of affecting the accuracy of the masks). B. The threshold is manually chosen by looking at the intensity distribution. C. Using the built-in Gaussian method, threshold is chosen accurately by the computer, resulting in less subjectively and enhanced reproducibility and accuracy. D. The way the Gaussian method works is to fit the intensity peak to a Gaussian curve, and an offset to the center of the Gaussian is used to determine the surface threshold. The offset is predefined as a percentage drop from the peak. Because the breadth of the peak is indicative of the intensity of the background noise, this method adjusts the offset proportionally to the breath of the peak, and therefore is background noise independent and is able to achieve high accuracy in masking the features without masking the surface. A different kind of line-wise artifact arises when the tip temporarily detaches from or digs into the surface when the tip “stumbles” over a surface feature, or the gain settings are too

41 low or too high such that the feedback system fails to properly track the surface ([110], [90] Ref.

Section 4.2.3). The abnormality in surface tracking could result in elongated feature along the scan line, or a sharp scan line (also called stripe, spike, or shot noise) that stands far above or below normal scanning height (Figure 3.8A red boxes, Figure 3.4 blue box, Figure 3.7B white box), or in the case of excessively high gain settings, periodic recurrence of up and down patterns known as ringing noise. Generally, the information is lost or distorted, and features affected by the artifacts should be discarded, but scan line noise can usually be partially negated by filling with neighboring pixels, in which case a threshold-gated median filter is often used

(Figure 3.8D) [96, 108]. The median filter20 defines a pixel as an outlier if its value surpasses a certain threshold beyond the average value of it and its nearby neighbors (also called the kernel window or kernel size, which can be set by the users). In Image Metrics, the threshold is defined as a percentile of the standard deviation of the kernel window. It can also be defined as a percentage of worst outliers to be modified (SPIP) [96]. The outlier pixel, once determined, is then replaced by the median value of the filter kernel. Locating and removing the stripe artifact can be manual (e.g. erase line function in Asylum Research, Figure 3.8F) or automatic (e.g. remove thin bright line function in SPIP, Figure 3.8E). The automatic procedure used by SPIP applies the median filter across the whole image, which could incorrectly filter information that is not shot noise (such as a tall protein) (Figure 3.8D). To prevent such situations, it is preferable to apply the filter only on the line or line segment where the shot noise is. In a previous published algorithm [108], the line is located by comparing the average of each line to identify the outlier as the aberrant line with the stripe. In Image Metrics, in addition to filtering the whole

20 There are several kernel filters available in Image Metrics, including median, gaussian, wiener, standard deviation, entropy, and range. Users can also use custom-scripted kernels as filters.

42 line by outlier, the line can further be segmented by filtering the line by area or by an outlier threshold21 (Figure 3.8E). For example, in the case of a stripe line segment, its area is always larger than isolated “noise” from misinterpreted tall proteins. The outlier threshold further singles out the true noise from randomly misinterpreted “noise” by filtering out data with only small variations. In Image Metrics, all the filter operations are handled in the Image Filters module (Figure D.2), where filter parameters can be adjusted interactively to preview results in real time, and they be saved as custom filtering profiles for batch processing (Section 3.6).

21 The outlier threshold is a threshold placed on the changes of data (e.g. Figure 3.8D) after applying an initial median filter. It defines whether the change is big (i.e. above the threshold), which would represent true noise, or small (i.e. below the threshold), which would represent non-noise instances (such as a tall protein). The outlier threshold makes sure only data (after applying the initial median filter, e.g. Figure 3.8D) that have big changes (i.e. above the threshold value) are modified.

43

Figure 3.7. Correcting Line-wise Image Artifact

A. Raw AFM trace image (i.e. scan direction is from left to right) of a protein-DNA sample deposited on mica. Notice the variation in heights among different scan lines along the y axis. B. 1st order flattening of the raw image. White box – scan line noise resulted from tip stumbling over a protein complex. Notice that a black line occurs along the same scan line due to overcorrection by the flattening procedure. Red box – features distortion (elongation diagonally) caused by piezo drifting and/or residual movement of the piezo. C. Image in B is masked by a height-based threshold to reveal the shape of the measured surface. In this case, the surface is warped in the middle, resulting from a parabolic tracing pattern along the scan line. D. A new threshold is chosen to mask only the protein and DNA molecules, thereby excluding them from the surface normalization. E. A 2nd order flattening is then applied (to correct the parabolic patterns), resulting in the complete flattening of the surface (comparing to B). A threshold gated median filter is applied locally to remove the sharp scan noise (white box in B).

44

Figure 3.8 Line Removal Algorithms Compared

(A) Raw image. Shot noise is highlighted in red boxes. (B) Automatic line removal through Image Metrics. (C) Manual line removal through Asylum Research software. (D) The difference image in binary (black and white image) generated from Image Metrics’ or SPIP’s median filter procedure. The difference image is obtained by subtracting the raw image by the modified image, followed by a logical operation that converts modified pixels to 1 (bright color) and unmodified pixels to 0 (dark color). As seen in the image, the modified pixels include both the stripe noise and non-noise information. (E) The difference image in binary generated after applying an outlier threshold and/or an area threshold. The outlier threshold makes sure only data (after applying the initial median filter in D) that have big changes (i.e. above the threshold value) are modified. The area threshold makes sure the noise is line in nature and not random isolated noise (as a line stripe contains more pixels than isolated pixels). As seen in the image, only the line segments representing the stripe noise is removed. (F) The difference image in binary generated from Asylum Research software’s manual line removal procedure. The manual removal erases the whole line, and therefore pixels along the whole line are modified. This result could have undesirable consequences if feature of interest is along the line where their information could be altered.

45 3.4.2. Cross-correlation

AFM can suffer from sporadic noise when the signals are weak, such as during fast scan in liquid (Figure 3.9E) or DREEM22 imaging (Figure 3.9G). Generally, this noise can be smoothed out by two dimensional filters (gauss, median, etc.). It can be also be reduced through over-sampling – a technique often used in photography – when a number of similar images are stacked together [111]. Over-sampling requires feature tracking for image stabilization, where image correlation methods such as cross-correlation are routinely used. Although cross- correlation (and related FFT technique, see Section 3.4.3) is incorporated in many AFM software packages (e.g. SPIP and ImageSXM), it is often used to study repeating unit cell features such as

2D crystals ([96], [109] Ref. Section 3.7.7), to track the movement of sample features in videos or images taken sequentially [95, 112], to study sample deformations [113], and to align, stabilize, and stitch images [114]. It has rarely been used, however, as an oversampling technique to enhance signal-to-noise ratio for AFM images.

Image Metrics is the first AFM software to my knowledge to provide over-sampling through correlation methods23, in addition to feature tracking and/or image stabilization. Figure

3.9A, B shows the tracking of DNA molecule from time-lapse AFM imaging in solution. Once tracked, all frames of the tracked feature can be averaged to produce an image ensemble that enhances the stationary part and cancels out the dynamic part (also known as correlation averaging [96, 115]), thereby improving signal-to-noise ratio (Figure 3.9E-H) and/or

22 See CHAPTER 2 for the DREEM imaging technique.

23 Cross-correlation and others, see Section 3.5.2B.

46 highlighting the dynamics of a feature (Figure 3.9C, D)24. Correlation averaging is similar to signal processing in two dimensions - where one can enhance the signal if the signals are in phase (‘constructive interference’) or reduce the signal if the signals are out of phase

(‘destructive interference’). Since every AFM image is scanned bi-directionally and contains data from both the trace and retrace scan, even a single scan can benefit from the over-sampling technique.

24 Dynamics analysis using correlation and correlation averaging is also used in fluorescence correlation spectroscopy (FCS) [116, 117]. In FCS, the cross-correlation of images between time frames are called temporal autocorrelation, which literally means the correlation of the same image(s) shifted by a time interval.

47

Figure 3.9 Feature Tracking and Enhancement through Cross-Correlation and Correlation Averaging

A. One image frame with tracked DNA cropped in green box. Notice the place boxed in red is empty. B. Another image frame tracking the same DNA (green box) using cross-correlation. Notice that the DNA has relocated in the image (through image drift) and a DNA loop makes an appearance (red box) where it is not in A. C. Standard deviation of tracked feature from 13 frames (e.g. green boxes from A and B). Notice the DNA loop is revealed (red box), indicating the loop is dynamic. D. Averaging of tracked feature from 13 frames. Notice that the DNA loop in red box is weak in signal, indicating it is dynamic. Features that are stationary are enhanced. E. One phase image frame from fast scan in solution imaging of DNA molecules. F. Correlation averaging of 37 frames resulting in significant improvement in signal-to-noise ratio (when features are stationary). G. One image channel of a DREEM image (the trace of DREEM phase channel is shown). H. Average of two image channels (trace and retrace of DREEM phase) shows a reduction in background noise and slight improvement in contrast inside the protein-DNA complex (see also APPENDIX C). The DNA path inside the protein is more visible and marked by blue cartoon in the red box inset. 3.4.3. Fast Fourier Transform

AFM images can also suffer from periodic noise originating from building (low frequency) or electromagnetic interference (usually high frequency). For example, illumination of the sample and/or tip from an external light source typically results in some high frequency noise in the image. Fast Fourier Transform (FFT) and cross-correlation techniques are routinely used to study structural patterns in the frequency space and the coordinate space ([109] Ref.

Section 3.7.6, [90] Ref. Section 5.3.4), and they are therefore ideal for removing or keeping periodic image patterns.

48 Like many AFM programs (e.g. SPIP [96], ImageSXM [95], and Asylum Research, etc.),

Image Metrics provides powerful FFT tools to transform images conveniently between frequency domain and space domain and to keep or discard frequency patterns selectively.

Unlike many programs, Image Metrics provides real-time previews of the results. Figure 3.10A-

D shows an example of high frequency noise removal. The image before processing (Figure

3.10A) contains periodic noise embodied as wavy lines. The image is then transformed into its frequency space (Figure 3.10B), where high frequency patterns reveal themselves as dense patches. To identify the right frequency pattern for the periodic noise, a trial and error approach is often used. Image Metrics simultaneously outputs the reverse FFT image (Figure 3.10D) and its differential image with the original (Figure 3.10C). By marking regions to keep or discard in the FFT image (Figure 3.10B, marked regions of interest pointed by blue arrows), output reverse

FFT and differential images are displayed in real time, therefore users will know if the marked regions contain the information (e.g. the periodic noise) for which they are searching. In the example shown, the periodic pattern is identified as two dense patches in the frequency space

(Figure 3.10B, blue arrows). The wavy lines (Figure 3.10C) are removed after removing the dense patches and performing reverse FFT (Figure 3.10D).

49

Figure 3.10 Processing Periodic Patterns using FFT and Correlation

A. Original image before FFT. A high frequency noise pattern can be seen. B. FFT of the image in A. Depending on the intensity, high frequency information is revealed as patterns in red while low frequency information revealed in yellow or blue. Patterns can be masked by regions of interest (grey mask overlays pointed by blue arrows) to be discarded (this example) or kept depending on which frequency information users want to keep. C. Differential image between the original image (A) and the output image after reverse FFT (D). Here show the regions of interest masked in B is indeed the high frequency noise seen in A. D. Output reverse FFT image after discarding the masked information in B. E. Original image of a bacterial rhodopsin 2D crystal. A select unit cell is boxed in red. F. Zooming into a unit cell of arbitrary size in E (red box) shows notable background noise. G. Correlation averaging of repeating unit cells identified in I. H. Autocorrelation of the image in E, useful to determine the size of unit cell by the spacing between the autocorrelation patterns. I. Cross-correlation of selected unit cell in F (i.e. red box in E) to the whole image (E) identifies all repeating cells (numbered in yellow) above a chosen correlation threshold (red masks). On the other hand, if users want to enhance high frequency pattern (such as that of a repeating structure pattern) while discarding low frequency, non-repeating patterns, both correlation and FFT can be used. FFT is performed similarly, except that regions of interest are marked to be kept instead of being discarded. With correlation, the repeating pattern size (known as the lattice size or the ‘unit cell’ size) needs to be identified by the user, and autocorrelation or

50 PSDF25 can be used to determine the unit cell size ([96, 118], [119] Ref. Section 4.9). Once the unit size is determined, cross-correlation can be performed to identify all repeating units and correlation averaging over the repeating units (i.e. over-sampling) is used to enhance the structural patterns ([96], [119] Ref. Section 4.17). Figure 3.10E-I shows an example using correlation to obtain a high-resolution image of bacterial rhodopsin 2D crystals26. The raw image

(Figure 3.10E, F) shows significant noise in the background. Autocorrelation is performed where spacing between intensity peaks are used to determine size of repeating unit cell (Figure

3.10H). Cross-correlation is carried out with one unit cell of designated size (known as ‘template cell’) over the whole image to identify all repeating cells using a threshold filter (Figure 3.10I).

Cells that identify themselves with the template cell are number labeled. Threshold (masked in red, Figure 3.10I) is manually chosen to optimize between the quality of the match to the template and the number of cells. Finally, all repeating cells are averaged to produce a high- resolution ensemble image of all the cells with background noise significantly reduced (Figure

3.10G).

25 Power Spectral Density Function. The PSDF of a lattice corresponds to the Fourier transform of the autocorrelation function of the lattice, also known as the inverse lattice. FFT is typically used to calculate PSDF.

26 The image is obtained from Asylum Research website: https://afm.oxinst.com/learning/uploads/asylum_gallery/CypherBRStoryOI.jpg

51 3.5. Image Analysis

Image Metrics provides all the basic analyses found in other AFM software (Table 3.1), including but not limited to, section analysis, roughness analysis, and particle analysis. The greatest strength of Image Metrics, however, is the prowess in its unique shape analysis and single-molecule analysis, and properly masking the features is critical before these analyses can proceed.

3.5.1. Particle Detection

In the previous section (Section 3.4.1), a height-based threshold is introduced to mask the features for surface correction (Figure 3.7, Figure 3.12B). The same mechanism can be used to mask the features for analysis (also known as particle detection). The threshold can be picked manually by the user, or automatically from thresholding algorithms such as Otsu’s method

[120] and the Gaussian method described earlier (Section 3.4.1). Similar to ImageSXM and

SPIP, Image Metrics provides density slice feature that allows users to adjust interactively the threshold value from the image’s intensity distribution (Figure 3.3B). As in many AFM software packages, users can also specify threshold direction to detect between particles (up direction) and pores (down direction). In addition, Image Metrics provides double density slices such that both an upper threshold and a lower threshold can be specified, which allows users to mask features based on heights ranging in between those two thresholds.

If two particles border each other, they can be detected as two particles or a single particle based on the continuity of their masks (also known as connectivity) and whether unconnected pixels are reconnected through bridging27 (Figure 3.11). In Image Metrics, the

27 If two unconnected pixels are separated by a single void pixel (unmasked pixel), they can be re-connected by filling/masking neighboring void pixels, a process known as bridging. Refer to bwmorph function in MATLAB documentation.

52 connectivity defaults to 8 and bridging can be enabled through morphologic operations

(discussed below). Particle detection is validated against other AFM software and in testing

Image Metrics is able to detect the same particles when other variables are controlled (Table

3.4).

Figure 3.11 Particle Detection – Connectivity and Bridging.

A diagram of 16 pixels (labeled) is shown. The masked pixels are in blue. The connectivity is defined as how a pixel connects to neighboring pixels. A connectivity of 4 is defined as the four directions neighboring a pixel. For example, for the yellow labeled pixel No. 10, its neighboring pixels in the four directions are labeled in red. A connectivity of 8 counts the four corner pixels (orange) in addition to the four direct neighboring pixels (red). Left. Pixel 7 and pixel 10 are both bordered by pixel 11 in either connectivity, and therefore are connected pixels, which form a single mask/particle. Middle. Pixel 7 and pixel 10 are connected by their corners (connectivity = 8) while disconnected with their direct four neighbors (connectivity = 4). They can be reconnected by bridging, which additionally masks the single pixel(s) that separates them (pixel 6 and 11, light blue). Right. Even if two pixels (e.g. pixel 10 and 4) are completely separated in either connectivity, they can still be reconnected through bridging pixels (which is 8 connectivity by default). The threshold-based masking does not always work – the surface image needs be flat28, the image cannot be too noisy, and sample features cannot be too crowded. In these scenarios, features are either over-masked or under-masked, resulting in incorrect registration/detection of

28 To produce a flat surface image, the surface needs be flat, and the tip-surface interactions need be stable. In the case of mode hopping (APPENDIX C) and attractive surface (APPENDIX F), the surface imaged will not look flat.

53 features in the forms of broken, disconnected masks or dilated, connected masks. In many AFM software packages (e.g. Asylum Research software and SPIP), masks can be manipulated manually (e.g. draw custom masks) or automatically through morphologic transformations (e.g. erosion and dilation) [121]. Image Metrics implements both options. A polygon tool is used to draw custom shapes to mask or unmask an area. The bwmorph function from MATLAB’s Image

Processing Toolbox is used to perform morphologic operations. Unlike other programs, Image

Metrics allows users to script and configure step-wise morphologic operations and save the configuration as profiles. The profile can then be used to automate a set of morphologic operations for batch processing. In addition, Image Metrics also allows users to copy and paste masks between different images, which is very useful for masking features from data channels that are not height based.

To remove particles from detection, masks can be also filtered by their particle metrics. A comparison between unfiltered masks and filtered masks is shown in Figure 3.12B-D. Most

AFM software (with the exception of SPIP) only offers the area and edge filter (Figure 3.12C).

In Image Metrics, masks can be filtered using a combination of multiple particle metrics (e.g. height, area, and fiber length), as well as edge clearance (Figure 3.12D). To help determine the parameters for the chosen metrics used for filtering, Image Metrics allows users to preview the distribution of the metric within the masked features and interactively pick the upper and lower bound of the filter, also a unique feature of the software. Most importantly, the mask filters and other aforementioned masking techniques (e.g. morphologic operation), as well as the ability to perform set operations on different sets of masks (Section 3.7.2), can be used in conjunction with macros to automate very complex particle detection tasks (Section 3.6), which makes Image

Metrics a much more powerful tool for particle detection than existing AFM software.

54 In Image Metrics, masks are visualized through a binary layer on top of the original image (Figure 3.12B). Both the color and the opacity of the mask can be adjusted. To help distinguish disconnected features, the masks can be mapped into assorted colors (via the colorcube function) so that masks that are close, but separate, can be visualized through their different colors (Figure 3.12C). This feature is very useful when used interactively with aforementioned masking operations (e.g. adjusting height threshold and morphologic parameters). For example, users can use this feature to re-connect disconnected DNA masks

(Figure 3.12D) and re-connect disconnected protein masks such as those with long linker arms

(Figure 3.16) or flexible dimerization domains (Figure 3.15). Users can also overlay automatically traced DNA fibers (discussed in Section 3.5.3B) to verify the connectivity of the

DNA masks (Figure 3.12D).

55

Figure 3.12 Masking Operation

(A) Raw image of protein-DNA deposition. (B) Masking using a height-based threshold, unfiltered. (C) Filter masks using an area filter. Notice that separated masks are colored differently. This feature is useful to verify whether the DNA masks are disconnected (blue box) or are connected (red box). The masks bordering the edges of the image are also removed. (D) Filter masks using a combination of area, height, and fiber length filter. In this scenario, only protein-bound DNAs are detected (i.e. DNA has large area, protein-bound DNA has taller height, and DNA has longer fiber length). The height threshold used for masking is adjusted to re-connect the DNA masks that are broken in C. The reconnected masks can be seen by their continuous fiber plot (red) and their detected particle order (number).

56 3.5.2. Shape Analysis

Shape analysis is an important application in computer vision that often involves image registration, recognition, and classification [122-125]. To describe and distinguish the shape of an object, shape descriptors (heights, contours, critical points, etc.) and shape contexts [126-129] are often used, and the shape has to be transformed to adjust for its scaling, three-dimensional orientations, and deformations [130-133]. In AFM, since particles such as proteins and nucleic acids are often imaged on a flat, blank surface, shape descriptors are routinely used to describe particle shapes. In particular, shape description via particle metrics (Table 3.3) is a staple feature for many AFM programs (e.g. Particle and Pore Analysis in SPIP [96], or Grain Analysis in

Gwyddion [92]). Image Metrics provides a similar particle analysis module called Particle

Analyzer that can quantitatively measure particle metrics in batch (Section 3.5.2A). Image

Metrics can also match and classify particles using a correlation-based particle alignment and classification method (Section 3.5.2B, C) that is unique in AFM. Like particle alignment and classification technique used in Cryo-EM [134], Image Metrics can adjust for orientation differences and classify molecules based on their shape and take class averages to obtain a more refined molecular conformation [115, 135, 136].

A. Particle Metrics

Particle metrics are geometric descriptions of particles, such as height and area (Table

3.3). The analysis of particle metrics is known as particle analysis in AFM. In Image Metrics, they can be measured in batch via the Particle Analyzer module. First, the particles of interest need to be masked and detected (Section 3.5.1). After that, batch analysis for most particle metrics can be performed by Image Processing Toolbox in MATLAB. Unlike most AFM programs (with the exception of SPIP), which only allow batch processing of particles in a single

57 image, Image Metrics allows batch processing of particles in a collection of images. Image

Metrics reports the measurement results and statistics to the end users in a table (Figure 3.13D,

E). From there, users can then sort, filter, categorize, and locate particles by their metrics, and graph the statistical results (Figure 3.13F).

Particle Metrics

Area/Convex Area/Filled Area Intensity (max, mean, min)

Eccentricity Major/Minor Axis Length

Equivalent Diameter Orientation

Euler Number Perimeter

Extent Solidity

Fiber length Volume

Table 3.3 Selected List of Particle Metrics

The description of the majority of metrics can be seen in MATLAB’s regionprops function29. Volume and fiber length are measured as described in Section 3.5.3B, C. Measurements on selected particle metrics are compared across multiple AFM programs in Section 3.5.4. To improve performance, users can select only the metrics they want to calculate. Pre- filtering unnecessary particles while making the mask also helps (Section 3.5.1). After the measurements are made, users can filter particles by filtering the numeric values of their respective metrics and, if the particles are categorized (Section 3.5.2B), using set operations on their respective categories. The filters can be applied to all the particles, or to particles that are currently selected. Users can apply multiple filters at the same time. Users can also make a subset using the current particle selections and perform all filtering operations on the particles within the subset only. Users can manually remove particles from or add particles into filtered

29 https://www.mathworks.com/help/images/ref/regionprops.html

58 particles. Similar to SPIP, filtered particles can be categorized into sets using color and name labels, and users can perform set operations (intersection and union) to further dissect or group the categorized particles. In addition, particles can be displayed in a grid for better visibility

(Figure 3.16), a feature not found in other programs. Collectively, these operations give users maximum flexibility in filtering, categorizing, and ultimately picking out the particles that satisfy given descriptions and criteria.

Figure 3.13. Particle Analysis

The main interface of the Particle Analyzer module is shown with different panels cropped in different color boxes. A. Processed image. Particles are number and color labeled - different color represents the different categories they are assigned. B. Regional inspection (zoomed in view of blue box in A). C. Particle inspection of particle #22. (labeled in A) D. Table listing metrics of all selected particles. E. Statistics of a selected metric. F. Distribution (histogram plot) of the selected metric. B. Particle Matching

It is often difficult to distinguish fully the shape of a particle by using simple geometric metrics. For example, in Figure 3.14, particles of very different shapes may have the same size

(area), and particles of the same conformation may alter in slight ways that results in notably

59 different measurement such as fiber length. Using a combination of mathematical descriptions listed in Table 3.3 can help recognize, but not completely, distinguish different shapes.

Figure 3.14 The Insufficiencies of Particle Metrics in Describing Shapes

Three hMutL훼 proteins (CHAPTER 4) are shown in A, B, C. Under the same height threshold that outlined them in red contours (D, E, F), their masks all have similar sizes (area), but particle in B has a globular shape, which is notably different from the other two particles (A, C) that have more open conformations. On the other hand, although particles A, C both have extended conformations and may be categorized into the same shape group, they have varying fiber lengths (black lines, D, F).

Correlation methods have traditionally been used in EM to identify and match particles precisely with a reference shape [115, 137, 138], which could be a custom shape or the shape of a template particle. To calculate the correlation between two particles, they have to be aligned both translationally and rotationally. In EM, auto-correlation and cross-correlation are used to solve for both the translational and rotational shifts between a given particle pair [115]. After particles are aligned, their correlation can be used for particle classification, and class averages through correlation averaging can be used to obtain higher resolution of a particle image [136,

139].

60 Compared to EM, Image Metrics has also implemented a similar rotary correlation matching algorithm – which has not seen implementation in mainstream AFM software. In this method, a particle is chosen as the template, and another particle is chosen as the target. The two particles form a particle pair and the program will try to align the particles by rotating the target particle at designated intervals, and calculating its correlation with the template particle. When the target particle aligns with template particle, a maximum correlation is reached and the rotation angle is recorded. This procedure is repeated for all possible particle pairs. Eventually a correlation and a rotation map are generated that allow the program to identify the correlation and alignment angle of any particle pair. A threshold correlation (called quality score in Image

Metrics) is designated by the user to determine if a particle pair is a match. Therefore, by cycling through all possible template particles, users will be able to find all matching particles and put them into categories according to the shape of the templates.

In Image Metrics, users can choose one of two different correlation metrics for the aforementioned operation – normalized cross correlation (NCORR [140]) and sum of square difference (SSD [141]). Although they are both targeted at matching features, the emphases are different. NCORR obtains normalized correlation by using normalized image data for correlation calculation, whereas SSD obtains normalized correlation by using raw image data for correlation calculation before normalizing it. Therefore, NCORR tends to match features with the same textures regardless of their intensities; while SSD tends to match features with the same intensities [142]. In general, NCORR is better suited at finding matches if the surface is not flat

(or cannot be flattened), while SSD is better suited at finding matches if the surface is flat and every particle is on even “ground”. The results from NCORR may not be desired if a tall particle matches a short particle in texture, whereas results from SSD may suffer from uneven surface

61 heights such that a short particle on high background matches a tall particle on low background in absolute heights.

An example calculation of 70 particles with 360-degree rotations involves the calculation of the correlation of particle pairs for nearly 2 million times (~70*70*360). To speed the calculation in Image Metrics, symmetry30 can be applied, and parallel computing31 on multiple- cores computer system or computer clusters at MDCS32 can be used. The speed can be further improved if some of the EM alignment methods mentioned earlier [115] are used33.

Once the correlation and rotation map are obtained, finding matching particles is an iterative process. Figure 3.15 shows an example of this process to characterize the shape of protein UHRF1, which is studied in a paper that I co-authored [143]34. Another example characterizing MutSα-DNA conformations is shown in Figure D.10. Users start by choosing a template particle (e.g. bottom left particle, Figure 3.15A) and defining a correlation score

(quality score). Image Metrics finds all particles above the designated score, aligns them to the template, and displays their correlations (Figure 3.15B). Users can further filter them by hand, or

30 Symmetry includes rotation symmetry and particle pairing symmetry. For example, if particle 1 is aligned with particle 2 by rotating particle 1 clockwise by 60 degree, then particle 2 should be aligned with particle 1 by rotating particle 2 counterclockwise by 60 degree (in reality it may be slightly different due to interpretations in image rotation operations). To save speed, the two rotations can be obtained by rotating only one of the particles. Similarly, the particle pairing is also symmetric. Correlation of particle 1 to particle 2 should in principle be the same as correlation of particle 2 to particle 1. However, minor difference might occur depending on which particle is used as template for cross-correlation. To save speed, only one sequence of pairing needs be calculated instead of calculating both.

31 https://www.mathworks.com/help/distcomp/index.html

32 MDCS - MATLAB Distributed Computer Server

33 The current implementation in Image Metrics performs one cross-correlation for every rotation. Some of the EM methods use auto-correlation function (ACF) of the particle image, which is translationally invariant. Therefore, the rotations can be performed on the ACF without needing translational alignment, and only one cross-correlation calculation is required to align the image translationally after the optimal rotation angle is found.

34 UHREF1 is short for Ubiquitin-like, containing PHD and RING finger domains, 1. It is important in the epigenetic inheritance of the DNA methylation process.

62 by using a new quality score (Figure 3.15B). After the particles are filtered to the user’s satisfaction, they are assigned a category (e.g. a category of a specific conformation). The aligned particles can also be averaged to obtain a refined image of the particle category (Figure

3.15C, also see Section 3.4.2). This process is repeated by using found particles as new templates, or by using a new particle as template, till users have cycled through and assigned categories to all the particles. After the initial rough assignment, users can review the assignment by category and make changes. This reviewing process is repeated until users have fine-tuned the assignment for every particle. Image Metrics can plot the categories into a pie chart so that the populations of each category can be compared (Figure 3.15D).

Figure 3.15. Shape Matching Analysis of Proteins UHRF1

A. Overview of UHRF1 proteins. B. Shape analysis – UHRF1 are aligned to a template, correlation matching scores are calculated and labeled, proteins are removed (hollow circles) automatically by their low scores (purple) or manually. The shape conformation is named (‘double’). C. Selected proteins are correlation averaged, and the averaged protein image is compared to the template. D. Categorization of conformations of proteins after inspecting all their conformations and repeat the process in B and C. Their populations are calculated.

63 C. Particle Classification

Putting particles into categories is a process of classification, one of the most common tasks in computer vision. In the previous sub-section, I described a manual classification process through particle correlation and alignment. Computer-assisted automatic classification methods have also been developed, of which clustering analysis (CA) is routinely used ([124] Ref.

Section 8.3). In a typical clustering analysis, objects are clustered based on their similarity, where distance of object attributes is calculated as their similarity measure35. In shape analysis, the attributes could be one or more of their shape descriptors (metrics such as height, area, contour, etc.) ([129], [124] Ref. Chapter 6) or their individual pixels36. A variety of distance metrics exist, such as Jaccard distance and Euclidean distance [144, 145]. After the distances between objects are obtained, the objects can be clustered by using one of many clustering methods available, such as hierarchical ascendant clustering (HAC) and K-means clustering

([145], [124] Ref. Section 8.3).

Particle classification by shape is widely used in electron microscopy (EM). In EM, particles are usually classified through multivariate statistical analysis37 (MSA) followed by clustering analysis [135, 139, 146]. Particle classes can then be used to categorize conformations and reconstruct their 3D models [138, 147]. The advancement in algorithms, computing power, and cryogenic technology has enabled scientists to resolve molecular structures at near-atomic resolution in their native state [148], and it may be one of the most important scientific

35 The shorter the distance between objects’ attributes is, the higher the similarity between objects is.

36 Similarity between individual pixels of two objects are also called correlation [115]. For example, cross- correlation is one of the correlation metrics.

37 Specifically in EM, principal component analysis (PCA) and correspondence analysis (CA) are used.

64 breakthroughs in recent history [149]. However, particle classification in AFM is still, to my knowledge, a novelty.

Image Metrics aims to bridge that gap. Similar to EM, I have implemented automatic classification methods (APPENDIX F). Initially, I developed a custom clustering scheme that involves a variant of multivariate statistical analysis by using Jaccard distance (the author named it eigenanalysis, see Appendix G.I). I have since implemented standard hierarchical clustering and K-means clustering based on distance between particle correlations using MATLAB’s

Statistical Toolbox (Appendix G.II). In Appendix G.III, the concept and usage of different clustering schemes in Image Metrics is explained through using simulated data set, and the accuracy of clustering is compared and verified. The section also explains how to optimize classification by screening major parameters and/or by hybridizing different clustering methods.

Compared to EM, where the goal of clustering is to generate distinct classes with the ultimate purpose of minimizing the intra-class distances, Image Metrics focuses on maximizing inter-class distances so that distinct conformations can be separated. The results may be similar, but the emphases are different. In EM, data that contain minor conformational species or subtle conformational changes are often discarded in favor of obtaining higher resolution class averages for 3D reconstruction [139], whereas in Image Metrics data often encompass some variations of the same conformation inside the same class. In other words, EM clustering aims to generate more, but refined classes so that classes are distinct from each other even for minor conformational changes; Image Metrics clustering aims to generate fewer, but “messier” classes that distinguish particles by major instead of minor conformational changes. As discussed in

Section 3.7.3, the methods implemented in Image Metrics is well suited to the type of data

65 (AFM) it processes, and they can also be useful for image classification of a variety of image types in other fields.

In Image Metrics, a module called Particle Categorization is provided to process particle classification. Particles can be classified into groups, and groups can be merged into categories, either manually or using one of the aforementioned methods. In a typical workflow, users group particles first by choosing a number of parameters that gauge the quality of the match and the size of the groups. Once initial grouping is performed by the computer, users can inspect the groups, throw away outlier particles, regroup the particles, and merge groups into categories.

The module can classify particles with high-throughput. For a sample size of a thousand particles, initial grouping on an average computer only takes seconds to finish, and adding user inspections to complete the whole classification process (grouping and categorization) usually takes about half an hour. For example, in Figure 3.16, more than a thousand Saccharomyces cerevisiae MutLα proteins38 are being categorized using the clustering schemes aforementioned.

After automatic grouping, the proteins of a group or a category are displayed in a gallery (Figure

3.16 red box). Users can choose a template protein (Figure 3.16 orange box) within the group or category (or let the computer choose the one that has the best matches) and align all other proteins to that template protein (Figure 3.16 red box). The matching scores can be displayed for each protein and sorted in order (Figure 3.16 blue numbers in red box). From the gallery, users can discard outliers into trash, or put them in another group (Figure 3.16 blue box). After confirming each group to the user’s satisfaction, the groups can be automatically or manually merged into categories (Figure 3.16 dark red box). After confirming each category to the user’s satisfaction, the classification process is complete.

38 The images were collected by Elizabeth Sacho, a former graduate student.

66

Figure 3.16 Particle Classification Module

Shown in the figure is the classification process of 1112 Saccharomyces cerevisiae MutLα proteins. Red box – Gallery that displays particles from a selected group or category. The particles are centered in the images and masked with assorted color overlays (to separate nearby particles). Orange box – Panel that displays the template particle that used for alignment of all other particles in its group or category. Blue box – Groups that are created manually or automatically. Numbers indicate correlation quality to the template. Dark red box – Categories that are created manually or automatically.

67 3.5.3. Single-Molecule Analysis

The shape analysis described in Section 3.5.2 is batch analysis in nature, which often ignores the local complexity of individual particles. Batch analysis becomes problematic when analyzing complicated molecules such as the protein-DNA complex, where the complex needs to be analyzed in parts, and manual inspections are often required for proper tracing, cropping, measuring, and classification of the region of interests. None of the existing AFM software is designed to process particles of this complexity, so establishing a workflow using these programs is very difficult and is highly inefficient. To tackle this problem, Image Metrics introduces a specific module called the Region Inspector (Figure 3.2) to facilitate high-throughput single- molecule analysis such as the analysis of protein-DNA complex. Unlike shape analysis, which allows users to conduct batch analysis in particle filtering, alignment, correlation averaging, and ultimately, classification in a single package, single-molecule analysis allows users to frame complex particles such as protein-DNA complexes into regions of interest and precisely measure their conformations in a sequential manner. Whereas shape analysis excels in throughput and big data, region analysis excels in measurement precision. The two modules complement each other and are tightly integrated in Image Metrics. Together, they allow users to conduct feature analysis both in precision and in scale.

Here, the workflow of single-molecule protein-DNA analysis in Image Metrics is described. In single-molecule AFM studies of protein-DNA complexes, the following aspects are typically analyzed: (1) Specificity (defined as the relative binding affinity for the specific site versus a non-specific site): by measuring the position of the protein complex along the DNA

[150]; (2) Stoichiometry and binding affinity: by measuring the number of proteins in each complex, the number of complexes on each DNA, and the free protein and DNA molecules [84,

68 150]; (3) Conformation: by evaluating the structure of the complex and the DNA bending at the complex [5, 151]. The specificity is the relative affinity of a protein on a specific sequence versus any other sequence on the DNA [150]; the stoichiometry is the number of proteins bound to DNA and/or within a protein complex; the binding affinity is a measure of how tight the protein binds to the DNA; and the conformation yields information about the structure of the protein-DNA complexes and how it relates to functions. The combined information above makes

AFM a very powerful tool in dissecting how biology works at the molecular level, but it also presents a significant challenge due to the complexity and quantity of the analysis.

A. Identifying the Protein-DNA Complex

First, we need to isolate the protein-DNA complexes from the image, this process usually occurs after the image is processed in Image Processor (Section 3.4). Typically, features such as protein-DNA complexes (called particles in AFM software terminology) can be cropped manually or automatically. In the automated procedure, particles are masked and detected as described earlier (Section 3.5.1). Their particle metrics (Table 3.3) are then analyzed based on the masked region of interest, and features of interest can be further located by finely filtering their particle metrics. Like many AFM programs, masked particles can be filtered by their particle metrics after they are analyzed (post-analysis filtering). In Image Metrics (and in SPIP), users can also filter the particles before the analysis (pre-analysis filtering) to save computation time on irrelevant features (see also Section 3.5.1). Users can use any combination of metrics in

Image Metrics to pre-filter particles of interest (Figure 3.12D, Figure 3.18A) - the most common metrics are their size (area, height, and volume), their fiber length, and whether they are bordering the image edges [91, 96, 152].

69 Due to the threshold applied, image artifacts, the feature being too low itself, or inadequate filtering parameters, sometimes the masks can split up a sample feature, bridge two sample features, or incorrectly mask features that users do not want to analyze. In these scenarios, Image Metrics (and AFM programs such as Asylum Research software and SPIP) allows users to manually add, remove, split-up, and group features. Users can also automatically group features using morphologic open operation (erosion followed by dilation) so separated features that are perceivably linked can be re-united. More details on particle detection technique are discussed in Section 3.5.1.

After features (particles) are identified, they then can be cropped (Figure 3.18A – blue box) and zoomed in for further inspection (Figure 3.18B). In other AFM software, this process is performed mostly manually. For example, in SPIP, you can use the inspection box feature to crop a region of the image, which opens a new window showing the zoomed-in area of the image

[96]. In Asylum Research software, you can pop up an inspection window of the particle itself, but you cannot adjust the window size (in case the view of a particle is cropped), nor can you easily navigate among the particles. In Image Metrics, however, the process is mostly automatic.

Individual particles are blown up and inspected within the Region Inspector module (Figure 3.2) in a sequential manner, where measurements of particles (specificity, stoichiometry, etc.) take place. The next particle will automatically pop up into view once the measurement on the current particle is finished. The separate inspection module allows the rest of an image be removed from distraction and only features of interest be presented to the end users.

B. Specificity

The specificity of protein binding to the DNA can be estimated by measuring the position distribution of proteins along the DNA (Figure 3.18E) [150]. To measure the position of a

70 protein complex along the DNA, the DNA molecule is traced and profiled (Figure 3.18B,

Figure 3.17). The tracing of DNA molecule is a type of fiber analysis (Table 3.1 Fiber

Analysis), which is found in many special software packages39 measuring neurites (e.g. NeuronJ

[94]) as well as DNAs (e.g. DNA Trace [104], FiberApp [105]). In most AFM-specific software packages (e.g. Nanoscope Analysis and Gwyddion), the standard profile analysis (also known as section analysis) does not allow a freehand tracing option, which makes them unsuitable for analyzing DNA molecules. Other software offers a freehand tracking option (e.g. Asylum

Research software and NeuronJ). However, the freehand tracing process is performed manually by users tracing and locating individual molecules, making it time consuming, and its accuracy could suffer from users’ subjective choices40. Automatic and semi-automatic tracing options

[152, 153], however, do not have this problem. In Image Metrics, automatic and semi-automatic tracing are achieved by morphological transforms41 and geodesic distance transformations42 to trace and profile untangled linear DNA molecules [152, 154]. A step-by-step procedure of this operation is shown in Figure 3.17. A similar automatic tracing feature is also found in SPIP [96] and Asylum Research software43. After the DNA fiber is traced, the measurement of DNA fiber

39 These software packages are either plugins for ImageJ (such as NeuronJ) or MATLAB-based application (DNA Trace and FiberApp). A list of fiber analysis software (including ones not listed in the text) can be found in [105].

40 In my own testing (data not shown), depending on the orientations of the molecule and how the freehand trace is digitized, fiber length obtained from freehand tracing can be systematically longer or shorter than automatic tracing methods.

41 The specific morphological transformation used is the skeletonize or thinning transformation. See also Section 3.5.1. This transformation is required for semi-automatic tracing.

42 This distance transform, achieved by the bwdistgeodesic function, allows the tracing of the longest, non-crossover distance between any two fiber points, which are essentially the two farthest distal points (ends) of the fiber. The transform allows measuring the shortest distance between the two fiber ends by eliminates distances from branched fiber detours. This transformation, in conjunction with the morphological thinning transformation, is required for fully automatic tracing.

43 The Asylum Research software only allows semi-automatic tracing.

71 length can be carried out as described previously [153, 155]. In Image Metrics, the Euclidian distance is used to estimate the length44. In addition to automatically tracing the DNA, the proteins’ positions, represented by peaks on the DNA profile, can also be located and recorded automatically (Figure 3.18B inset), which is not possible in other AFM programs. The program can also record short arm distance (the position of the protein to the nearest end of DNA).

However, the automatic procedure could fail to properly trace the DNA when the DNA molecule could not be masked properly or the DNA molecule is closed, branched, or tangled with other

DNA strands. To resolve these more complicated scenarios, users can either: (1) tailor, split, or bridge the masks (Section 3.5.1), or (2) use region of interest tools (Figure 3.18B, blue crops), or (3) use manual freehand or semi-automatic options to assist tracing the DNA.

Figure 3.17 DNA Tracing and Fiber Analysis

Showing in the figure is the automatic fiber tracing feature in Image Metrics. (top-left) A DNA image is shown. (top- middle) DNA is masked (orange) by a height threshold, and the mask is morphologically transformed into skeleton (green). (top-right) The fiber (black) is determined as the longest path between any two points inside the skeleton without looping, which is the shortest path between the two ends of the skeleton. The location of the protein (cross

44 This distance is given by the bwdistgeodesic function.

72 mark) is determined as the location of the peak in the height profile of the fiber (bottom figure). The start and end anchor point of the DNA fiber used in the height profile is marked by the square (start point) and the circle (end point). (bottom) The height profile of the fiber from top-right figure. The location of the protein is marked by a cross symbol. C. Stoichiometry and Binding Affinity

Several stoichiometry metrics can be estimated – including how many protein complexes per DNA, how many proteins per complex, and ultimately how many proteins per DNA. To extract how many proteins per complex, we use volume analysis [84, 91, 150, 156]. In AFM terminology, volume describes the summation of pixel intensities within a mask, and it is usually proportional to the size (area and intensity) of the masked feature. Like many AFM software packages (e.g. Asylum Research software, SPIP, and Gwyddion), Image Metrics supports volume measurement as a particle metric (Table 3.3). Particle volumes can be used to extract stoichiometry information because the volume of a protein complex is roughly proportional to the number of proteins inside the complex [84]. To measure the volume of a protein complex on the DNA, a typical workflow includes the following steps: (1) The protein complex is first masked, and then separated from the DNA using region of interest (ROI) tools (Figure 3.18B, yellow crops). In Image Metrics, a freehand drawing tool is provided to draw ROIs or reverse

ROIs. Similar ROI tools are also available in SPIP. (2) The volume of the protein complex is measured within the ROI, and its distribution is plotted (Figure 3.18C). The number of protein complexes on the DNA is also counted. In Image Metrics, this operation is performed automatically as their volumes are being measured. (3) If the protein is known as monomer dominant, the first peak of the volume distribution can be used to identify the monomer state of the protein complex45. Users can also identify the stoichiometry of the peak by comparing the

45 The assumption here is that the protein will have a notable monomer state population that presents itself as a peak in the volume distribution, which may not be the case. Therefore, it is recommended to perform verification using other methods like dynamic light scattering (DLS), analytical ultracentrifuge, or mass spectrometry.

73 volume of free protein to that of DNA-bound protein. (4) Finally, the number of proteins per complex and per DNA can be calculated by normalizing the volume of the complex(es) to that of a single protein (Figure 3.18D).

It should be noted that the volumes of protein-DNA complexes can vary greatly because the conformation of the complex can affect its volume measurement46, so the number of proteins per complex estimated through normalization from step (4) above may not be accurate. But since the volumes of larger complexes usually do not overlap with those of smaller complexes, the peak with a higher volume roughly corresponds to a complex with an increased number of proteins. In addition, because volume measurement depends directly on the height and area measurements, any factors that affect those measurements also affect the volume measurement.

For example, the height may be higher or lower if the surface is not normalized properly at the local level or if tip-sample interactions change (such as tip degradation or contamination). In that case, users can try to flatten the region locally (instead of flattening the whole image), calculate and offset the surface height from the height measurement of the feature, or normalize the volume distribution to help offset the difference. The masked area may also be larger or smaller than desired depending on the height threshold used for masking47, and users may have to re- adjust the threshold as they take the volume measurement.

In addition to counting protein-DNA complexes, free proteins and DNAs can also be counted (Figure 3.18A), and binding affinity can be estimated as described previously [150]. In

Image Metrics, free proteins and DNAs can be filtered and counted through particle analysis

46 For instance, a protein sitting tall will likely have larger volume than the same protein lying flat on the surface because of the tip-dilation effect.

47 For example, if the surface is not flat, the height threshold may over-mask some particles while under-masking other particles, resulting in larger or smaller masked areas than desired.

74 (Section 3.5.2A). A module called Particle Counter can also be used to count particles manually, where the users mark the particles by using the mouse cursor. Great caution must be taken on validating the binding affinity, however, because one can over-estimate the binding affinity because of deposition artifacts such as random landing48 and/or local binding49 events, or under- estimate the binding affinity because of differential binding preferences for different types of biomolecules to the surface.

D. Conformation

Perhaps the most outstanding strength of AFM in single-molecule protein-DNA studies is that we can directly visualize the conformation of protein-DNA complexes under physiological conditions with relative ease. We can either qualitatively describe the conformation of a complex, such as whether a complex loops the DNA, or how a complex sterically binds to the

DNA; or quantitatively describe the conformation using particle metrics (Table 3.3) and/or DNA bend angles at the complex [91, 156, 157]. For instance, we can measure both the external bend angles between the two DNA arms extending out of the protein complex (Figure 3.18B, Figure

3.18F), and the internal bend angles embedded within the protein complex as visualized via

DREEM (CHAPTER 2). Both bend angle metrics reflect internal conformations of a protein-

DNA complex [154, 157]. Combined with biochemical functional studies, these conformations can then be correlated to their biological functions based on where and when they occur.

Image Metrics provides users with powerful tools to visualize and measure the conformation. For example, unlike other AFM programs (with the exception of SPIP) where users have to resort to third-party screen protractor tools, an angle measurement tool is built

48 Random landing – protein can randomly land on the DNA

49 Local binding – protein can continue to bind DNA even after they are deposited onto the surface

75 directly into Image Metrics to provide interactive angle measurement, and the measurement result can be recorded directly into a database (Section 3.5.3E). Another example is profile analysis (also discussed in Section 3.5.3B), which is often used to dissect and inspect the topography along a certain path [98, 108]. Profile analysis is most powerful when used in conjunction with visualization techniques (3D, contour, and image overlay, see Section 3.3C), especially when users want to quantify the conformation of the DNA and the protein along a 2D slice, such as their locations relative to the surface and/or to each other (Figure 3.18I).

Compared to other AFM software, profile analysis in Image Metrics is more flexible, more interactive, and more comprehensive. For instance, while many software packages offer only line profile tool (e.g. Nanoscope Analysis, Gwyddion), Image Metrics offers both line profile and freehand profile tool (Figure 3.18H). Users can draw multiple lines on the same image (Figure

3.18H), plot multiple profiles on the same graph (Figure 3.18I, graph), mark multiple locations to measure (Figure 3.18I, vertical lines in graph), and take measurements on the coordinates and relative distances between markers (Figure 3.18I, table). Unlike other software where distance measurement is taken between markers within the same profile, Image Metrics allows the measurement to be taken across different profiles. Also unique to Image Metrics, users can interactively change the lines’ positions, sizes, and colors after they are drawn (Figure 3.18H), and their profiles will be updated automatically. In addition, color tagged markers are placed on both the image and the graph for easy tracking of feature spots. A data channel slider is also implemented for users to visualize and measure the line profiles in different data channels from the same lines, which could be very useful if data channels are composed of different types of information (e.g. phase and amplitude) or are composed of images scanned from a time series.

For example, a novel use of this feature is to track the movement of a DNA molecule by tracking

76 the profile along a user-defined path (Figure 3.9A-B). The profile tool allows users to take snapshots of the profiles along different time stamps of the images (after the time unit is calibrated, see Section 3.3A) and measure the displacement and velocity of the movement.

Finally, Image Metrics provides powerful morphologic transformations (Section 3.5.1) not seen in other AFM software, which could open up additional conformational metrics that can be measured (Figure 3.18G). One of the transformations, skeleton, is used to trace the DNA fiber as described earlier (Section 3.5.3B).

77

Figure 3.18 Single-Molecule Analysis

A. DNA molecules are masked, filtered, labeled, categorized, counted, and traced. The colors represent different configurations of DNA molecules – Purple – free DNA with no protein bound; Dark red – tangled DNA; Light blue – DNA trapped by a bunch of proteins; Red – DNA longer than 450nm; Green – DNA shorter than 450nm. B. Inspection of one DNA molecule. DNA is traced automatically and profiled (top inset). Proteins positions on the DNA are identified, labeled, marked, and recorded both on the profile and the image. DNA bend angle is measured on the first protein. Region of interest (ROI) – blue ROI blocks certain area (from being analyzed and traced); yellow ROI crops areas to be analyzed. Bottom inset – topographic visualization of the DNA molecule with phase overlay. C. Volume distribution of protein-DNA complexes (e.g. yellow crops in B) as done by volume analysis. D. Stoichiometry of protein to DNA as calculated from counting and volume normalization of the complexes as in B and C. E. Position distribution of protein-DNA complexes as plotted from position measurement (B – top inset). F. DNA bend angle distribution as plotted from angle measurement in B. G. Morphology transformation of masks. In this example, an ‘open’ morphologic transformation is used so that the transformed masks (green) traces the contour of the original masks (red). This transformation can be useful for contour measurement. H. Section analysis. Line section (olive), freehand section (green), distance section (blue) are shown. I. Profiles of the sections (top), the lengths of the sections (top inset), and measurement of coordinates and distances (bottom table) between markers (vertical lines). E. Data Management and High-throughput Analysis

Few AFM programs have built-in data management capabilities. Measurement data are usually exported to be processed by external statistical analysis and graphing programs. Some

AFM programs (Table 3.1) are built into numerical computing platforms such as Igor Pro

(Asylum Research software) and MATLAB (e.g. FiberApp and SFMetrics), and therefore allow

78 for direct data manipulation, analyses, and graphing using the capabilities provided by their respective underlying platforms. However, this process is far from being streamlined. For example, for the most part of single-molecule analysis, users still have to manually record their measurement (length, volume, bend angle, etc.) in a table. The responsibility for data storage is also relayed to the users, which creates a big disconnect between measurement data and the original image. Even with sophisticated workarounds used, it is time consuming to connect measured data points to the original particles that the measurements were taken, and trying to reproduce or redo the measurements often requires a complete re-processing of the original image(s). The commercial program SPIP has attempted a more integrative approach, but it is geared towards particle analysis (Section 3.5.2A) and is not applicable to the single-molecule analysis discussed in this section.

Image Metrics attempts to tackle this problem on multiple fronts. First, Image Metrics automatically records all the measurements in a spreadsheet while they are being taken, so no manual input is required. Users can freely modify or redo the measurement in situ as long as the particle is under current inspection. After the measurement of a particle is confirmed, Image

Metrics can take a snapshot of the measured particles (e.g. Figure 3.18B, H) and the original image with marked location of the particles (Figure 3.18A). Users are then presented with the next particle and the measurements are repeated until all particles are measured. The information in the spreadsheet provides references to the particles and their locations so that users can, when combined with the image snapshots with marked locations, easily go back to the original image to review and redo their measurement. Since Image Metrics allows users to save processed and analyzed images, re-processing of the images is unnecessary. With the information in the spreadsheet and the snapshot, users can trace down individual measurement from the image, to

79 the inspected region, to the DNA, and ultimately down to the protein or protein parts where the measurement is taken. New measurement from different images or data sets can be appended to an existing spreadsheet to form a database where measurement from all data sets can be analyzed and compared altogether, and the spreadsheet can be exported as Excel or tab limited text files to be further processed by external statistical analysis and graphing programs50. The top-down integration of measurement and data storage greatly enhances the throughput of single-molecule analysis while also allowing users to improve their analysis through iterations and revisions, which is not possible through other AFM programs.

50 Plans to integrate statistical analysis and graphing inside Image Metrics is also in the works, see Section 3.7.4. Alternatively, users can also directly export the measurement data into MATLAB for further processing.

80 3.5.4. Compare Measurement to Other AFM Software

A. Shape Analysis

To validate the algorithms used in Image Metrics on measuring shape metrics, measurements made in Image Metrics are compared to measurements made in other AFM programs (Table 3.4, also see Table 3.7). It should be noted that the measurements can vary across different AFM programs (not just Image Metrics) due to different definitions of the metrics and algorithms used, as well as rounding errors (as seen in the difference in flattening operation in Table 3.2), and any difference should be viewed with an understanding of the algorithm being used. Some of the most notable differences come from area measurement, which could relate to the measurement difference in contour (perimeter), as well as mean height

(affected by perimeter pixels) and volume (also affected by perimeter pixels). To help define the area and the contour, a square particle with four blue pixels (pixel 6, 7, 10, 11) is shown in

Figure 3.19. In many AFM programs, the contour (Figure 3.19, red polygon) is not synonymous to the borders of the pixels (Figure 3.19, out-facing borders of the blue pixels). In Asylum

Research software and in Image Metrics, the contour is traced by connecting the center of each pixel (Figure 3.19 left, A-B-C-D). In SPIP, contour can be traced using either the pixel border

(Figure 3.19 right) or weighted pixels51 (Figure 3.19 middle, A-B-C-D-E-F-G, also known as contour smoothing) [96]. Gwyddion also employs a different version of weighted pixels for contours ([119], Ref. Section 4.11). In all software packages, the perimeter is measured as the

51 In the contour smoothing algorithm, each weighted pixel is calculated by a block of four actual pixels. In Figure 3.19 middle, depending on the whether a pixel belongs to the minority or the majority (masked or unmasked) inside the block, it either carries a weight or no weight. Adding weight to the minority pixel results in smoothing in particle contour (e.g. pixel 6 vs. weighted pixel B). The position of the weighted pixel is calculated as the weighted average of the positions of the four pixels (algorithm omitted here).

81 total Euclidian distance of all the waypoints in the contour. However, the perimeter of a wrap- around line segment (e.g. the line segment DE in Figure 3.20) could be interpreted differently in different software packages. For example, in Image Metrics, the perimeter of such line segment equals to 1x the length of the line segment52, whereas in Asylum Research software, the perimeter will count twice the length of the line segment, resulting in longer perimeter in general

(Table 3.7). The wrap-around line artifact (singular line) is a side effect of contour tracking by pixel center, as well as point singularities resulting from single isolated pixel(s), which would have a perimeter of zero. On the other hand, contour by pixel borders or weighted pixels have no singularities and always have non-zero perimeter. Not surprisingly, contour by pixel center has smaller perimeter than contour by weighted pixels, which has smaller perimeter than contour by pixel borders (Table 3.4 Perimeter).

52 This behavior is dictated by the underlying regionprops function of MATLAB.

82

Figure 3.19 Different Contour Measurement Mechanisms

A block of 16 pixels is shown, with masked pixels labeled in blue. Left. Contour is traced through centers of masked pixels (dots, A-B-C-D). Middle. Contour is traced through weighted pixels (A-B-C-D-E-F-G-H). The weighted pixels are calculated from every block of four pixels that has at least one non-void pixel, and results in smoother edges compared to contour of the pixel borders. Right. Contour is traced through outward-facing borders of the pixels.

Figure 3.20 Perimeter Measurement of Contour by Pixel Center

The perimeter for a wrap-around line segment can be interpreted differently in different AFM programs, resulting in either single counting (1x segment length) or double counting (2x segment length).

83 Nanoscope Metrics Image Metrics Gwyddion SPIP Analysis Detected 78 78 78 78 Particles

547푛푚3 Volume 566푛푚3 N/A 566푛푚3 (-3%)

Max 2258푛푚2⁡ 2267푛푚2 N/A 2267푛푚2 Area (-0.4%)

Min 15푛푚2 15푛푚2 N/A 15푛푚2 Area

Mean 439푛푚2 439푛푚2 441푛푚2 441푛푚2 Area (-0.5%) (-0.5%)

68푛푚 83푛푚 Perimeter 53푛푚 N/A (+28%) (+57%)

Max 4569푝푚 4569푝푚 N/A 4569푝푚 Height

Min 520푝푚 520푝푚 520푝푚 520푝푚 Height

Mean 1458푝푚 1458푝푚 1458푝푚 1458푝푚 Height Table 3.4 Comparison of Selected Metrics in Particle Analysis

A selective of particle metrics are measured for an AFM image composed of MutSβ proteins (see CHAPTER 4). Percent difference compared to measurement made in Image Metrics are included in brackets, and notable differences are also highlighted in red. Metrics that are not available to measure in respective AFM programs are labeled with N/A. The perimeter measured in Image Metrics is through contour by pixel center, whereas perimeter measured in Gwyddion uses contour smoothing (weighted pixels) and perimeter measured in SPIP showing here is without contour smoothing. In most AFM software packages (including Image Metrics), the area is measured as the enclosed area of the pixel border (Figure 3.19 right), despite their differences in contour measurements. The area measured in these programs are identical with only minor rounding errors (Table 3.4). Notably, however, the area in Asylum Research software is measured as the

84 enclosed area of the contour (rounded to multiples of a unit pixel area53), which could be significantly smaller for small particles (Table 3.7). For example, the area in Figure 3.19 (left) will be measured as 1 instead of 4 (in pixel unit) in Asylum Research software. Since volume is defined as some integration of height and area, its measurement is also impacted accordingly

(Table 3.7). Traditionally in volume measurement, the pixels are modeled as blocks (Figure

3.21 left) and therefore the volume can be directly derived as the multiplication of height and area. With height measurement identical across all compared programs (Table 3.4), the method used in area calculation would be solely responsible for measurement differences in volume.

Pixels can also be modeled as vertices, a process called triangulation, to better model the surface area and encompassing volume (Figure 3.21 right), as is used in Gwyddion. In the triangulation scheme, the centers of pixels are modeled as vertices of the triangles, and the volume can be obtained using trapezoid integration ([119], Ref. Section 4.11). A new area metric, called surface area, can also be obtained by adding up the areas of all the triangles54. In the triangulation scheme, because volumes could be tailored off around the edges of the particles (Figure 3.21 right), the volume obtained could be slightly smaller than the block scheme, as seen in Table

3.4.

53 In the Asylum Research software, the enclosed area is rounded to multiples of the area of a single pixel, despite the actual enclosed area may be smaller than that. For example, the area in Figure 3.20 will be measured as 2 instead of 1.5 (in pixel unit).

54 In Image Metrics, surface area calculation is supported in the roughness analysis module.

85

Figure 3.21 Models for Volume Measurement

A 3D view of a particle of three pixels is shown as wireframe diagrams. Left. Pixels are modeled as blocks, and therefore the volume is the direct multiplication of height and area. Right. Pixels are modeled as vertices (a process called triangulation), and therefore the volume is the integration of height along the vertices.

B. Single-Molecule Analysis

Protein-DNA single-molecule analysis places a central focus on fiber length and position measurement in addition to measuring general shape metrics. Table 3.5 and Table 3.6 compare the DNA fiber length measurements featured in Figure 3.17 across multiple AFM programs. The measurement difference is insignificant (±4푛푚 in a ~900푛푚 DNA, or 0.4%), and is mostly contributed by the different morphological transformation algorithms used in different software.

This result can be seen in the transformed fiber in the mask images (Table 3.5&Table 3.6, green lines), where none of the software produce identical results even with the same morphological procedure (skeleton). Notably, when morphological thin operation is used in place of skeleton,

Image Metrics preserves the kinks of the DNA molecule much better than its skeleton counterparts (Table 3.5&Table 3.6, red/blue arrows and insets), where DNA fibers are more smoothed out. With lack of smoothing, the DNA also measures longer (927푛푚 vs. 914푛푚,

86 Table 3.6), but has arguably a better tracing result55 (Table 3.5&Table 3.6, green and black lines). The kink preservation is important in measuring DNA bending by the protein, as well as accurately tracing the DNA at rough edges and estimating the accurate position of the protein.

However, among all the software compared, SPIP is best at picking the accurate end points of the traced fiber (Table 3.5&Table 3.6, green arrows and insets), where the points are picked at the centroid of the ends (SPIP) rather than at the boundaries of the ends (Image Metrics)56. This

‘truncation’ in fiber ends is also reflected in the shortened length measurement (910푛푚 vs.

914푛푚, Table 3.5&Table 3.6), and could be useful in estimating the accurate positions when proteins are bound to DNA ends57. However, all the methods will introduce some biases without prior knowledge of the exact position where the protein binds to the DNA end58; and despite the difference arisen from individual measurement, on average the positions measured among different methods are near-identical and agree within their statistical errors (Table 3.7).

55 Ideally, the DNA trace should overlay on top of the DNA backbone. Assuming the DNA and the protein are symmetric (which most of them are at the imaging resolution), the threshold-based mask should symmetrically mask the particles around its height while the morphologic thin operation symmetrically reduces mask boundaries such that the resulting fiber should overlay on top of local maxima of the masked particles.

56 The difference comes from the underlying morphological algorithms, where the end points in the transformed skeleton in SPIP do not extend to the boundaries, whereas the end points in skeleton/thin operation in MATLAB extend to the boundaries.

57 Without the truncation (i.e. fiber extends to the mask boundary), errors in estimating the position of end-binding protein could occur depending on the assumptions of where the protein bind at the DNA ends. Assuming a 10nm radius protein binding to a 300nm long DNA at the very end (i.e. 0% length), without the truncation, the protein location will be measured at 10nm in its fiber profile, and its percent position of the total length is estimated as 10/300~3% instead of 0%.

58 For example, MutS homologs require a portion of non-specific sites adjacent to the binding site [66, 158]. In other words, MutS proteins will not bind at the very end of the DNA, but at some fraction of the DNA length near the end. In addition, AFM could not resolve the exact location of the DNA end within the end-binding protein. Therefore, none of the fiber morphological transformations is bias free in this regard.

87 Software Mask w/ Fiber Image w/ Fiber

Asylum

Research

software

918nm

(skeleton)

SPIP

910nm

(skeleton)

Table 3.5 Software Comparisons on Fiber Measurement

(Table continued on next page) The table compares the measurement of an example DNA fiber in selected software. The software and their measurement are displayed in the first column. The base algorithm used for the fiber measurement is enclosed in bracket. A step-by-step fiber tracing procedure of the same molecule can be found in Figure 3.17. (Mask w/ Fiber column) The DNA particle is subjected to the same height threshold mask (orange). The morphologically transformed mask (via thinning or skeletonizing) is used to determine the fiber (green), which is the longest non-crossing path between any two points in the transformed mask. Notable differences among transformations used in different software are highlighted by arrows and their color matched inset boxes. (Image w/ Fiber column) Traced DNA fiber (black) is overlaid on the DNA image, and a direct comparison of how well each software traces the DNA can be visualized. See main text for a discussion of the differences.

88 Software Mask w/ Fiber Image w/ Fiber

Image

Metrics

914nm

(skeleton)

Image

Metrics

927nm

(thin)

Table 3.6 Software Comparisons on Fiber Measurement (cont’d)

(Table continued from previous page) To replicate a complete workflow and reflect a more realistic application using respective software, a test set of measurements is also made without controlling all the variables such that the measurements will include human errors. For example, in the test set, masking is performed separately and therefore will not reproduce the exact same masks due to subjectivity in choosing

89 the threshold, which will affect measurements that are based on the masks59. In the test set,

Image Metrics and Asylum Research software are compared (Table 3.7); comparisons are also made to other AFM programs (e.g. ImageSXM and SPIP, data not shown) with similar results.

Overall, the difference in length measurement (Table 3.7 total length and short arm length) – both freehand tracing (*) and automatic tracing (**) – is small (<2%), and likely results from human errors in masking the features (in addition to the errors caused by the algorithms discussed earlier). The maximum height (2% difference, Table 3.7) likely results from difference in masked flattening operation (Section 3.4.1), whose result also depends on the masks made.

Bend angles measurement also agrees within human error (~20°, Table 3.7), which is typical for a protractor assisted angle measurement for the types of data we analyze. Other metrics (area, volume, perimeter) are mainly subjected to differences in algorithms as discussed earlier in part

A of this section.

59 For example, mask area can be smaller if a higher threshold is chosen when making the masks. A smaller mask can affect the measurement of fiber length, area, average height, volume, perimeter, etc.

90 Image Image Asylum Asylum Metrics Metrics %Diff Metrics Metrics %Diff Research Research (Difference) (Difference) Short Bend N/A 4 ± 5푛푚 N/A N/A −6 ± 18 N/A Arm* Angle

Total Max −56 351푛푚 6 ± 10푛푚 1.7% 2600푝푚 -2% Length* Height ± 65푝푚

Short Arm Mean −100 N/A 1 ± 1% N/A 1200푝푚 -8% Fraction* Height ± 65푝푚

Short 75 N/A 3 ± 17푛푚 N/A Area 374푛푚2 20% Arm** ± 57푛푚2 Total 41 324푛푚 1 ± 23푛푚 0.3% Volume 445푛푚3 9% Length** ± 40푛푚3 Short Arm N/A 1 ± 3% N/A Perimeter 76푛푚 −9 ± 7푛푚 -12% Fraction** Table 3.7 Comparison of Selected Metrics in Single-Molecule Analysis

In this table, the measurement from Image Metrics is compared to that from Asylum Research software. A sample size of ten particles are measured, and results are compared between these two programs. (*) indicates measurement done by free hand tool. (**) indicates measurement done by automatic tracing tool. The DNA measured is ~340nm in length and the short arm length (the position of the protein to the nearest end of DNA) is ~133nm in length.

Taken together, these results indicate that the measurements between Image Metrics and other AFM programs are mostly interchangeable. For the few metrics that are measured differently, the measurement differences are often negligible when compared to the differences in sample features (e.g. volume of different protein oligomers); and when normalized against measurements made in the same program, the difference is close to non-existent (e.g. normalized volume and fiber length).

91 3.6. Macros and User Extensions

In measurement science, a statistical conclusion is only as convincing as the quantity of the data that are analyzed and the quality of the data that are collected and measured. In single- molecule studies - not simply limited to AFM studies – it is often easier to collect more data than one can analyze, and lots of information remain to be analyzed unless the throughput of data analysis can be increased. One of the top design goals of Image Metrics is to increase analysis throughput and streamline the analysis pipeline – including using macros to automate a processing or analysis sequence, incorporating batch operations to process and analyze images, and building powerful tools for comprehensive measurement, filtering, categorization, visualization, graphing, and reporting.

In Image Metrics, users can configure macros in two modules – Action History and

Macro Builder. Action History records the history of all user actions, including all their clicks and the values of the controls they use (Figure D.3). Users can easily select their past actions and save them as a macro and execute the same macro on all images. This feature is particularly useful to execute a set of commands repeatedly on a batch of images without any user inputs.

The macro can be further edited or created anew in Macro Builder to customize the actions

(add/delete/edit) (Figure D.4). Similar modules are also seen in other AFM software, but not in combination. The customization is also often limited in other software. In addition to batch processing, users can also execute the macro one step at a time in Image Metrics to make sure actions are performed to expectations, or pause the macro anytime during the execution to review the result or rewind any unintended actions (Figure D.5).

Using macros can only go as far as the existing components and functions a program provides. For user extensions, Image Metrics provides a miniature integrated development

92 environment (IDE) that features a script editor, a debugger, and a command console to help users develop their own user applications. The Script Editor module (Figure D.6) allows users to write their own scripts and functions, which can be debugged and executed inside Image Metrics. The

Command Console module (Figure D.9) allows users to test simple scripts in real time, which functions like a mini MATLAB desktop that features a workspace where users can store, inspect, and manipulate variables at the backbone of the program. These features greatly expand the functionality of Image Metrics. Users are no longer restricted to what Image Metrics has to offer.

They can write functions that further automate the processing or perform complex analysis that are not provided in the program. The relatively full-featured script editor, debugger, and command console make Image Metrics not only a user application, but also an application development tool independent of MATLAB.

User applications and macros can be deployed locally through the Apps Manager (Figure

D.11) and distributed through a web interface (App Store, in development). Image Metrics also provides interfaces that allow user apps and macros to be integrated into the program as if they are native functions. These features greatly expand the usability of the program as a user-centric tool that provides users maximum flexibility in customizing and automating their own workflows.

93 3.7. Author’s Remarks and Future Directions

3.7.1. Motivations on Creating Image Metrics

Even though there are already very comprehensive AFM programs in the market, none of them are tailored for high-throughput single-molecule protein-DNA analysis, which is crucial for these types of analyses to obtain any meaningful statistical conclusions. The complexity of protein-DNA complexes makes them unsuitable for the typical batch particle analysis found in these programs, yet most of their analyses can be divided into repetitive sub-routines that can benefit greatly from automation. Many of these sub-routines rely heavily on human discretion, which could include cropping individual particle complex from the image for inspection, tracing molecules for their height profiles and locations, drawing region of interests on molecular components and obtain their shape parameters, writing down the measurement and taking snapshots for backtracking, and etc. The manual processing of these sub-routines adds biases and imprecision to the overall analysis and dominates the time consumption. Unfortunately, none of the programs in the market are flexible and sophisticated enough to accommodate the coordination and automation of these sub-routines, hence the need to develop a new type of image analysis software that is built on flexibility and easy customization that allows users to curate and streamline their own workflows rather than being fixated by hard-coded preset operations.

Luckily, this new type of software can be made possible through numeric computing platforms such as MATLAB, which is the preferred programming language used by millions of scientific professionals to build specialized in-house applications. The strength of MATLAB lies in its enormous scientific computing and graphing libraries that can be readily implemented in user applications, while being easy enough to learn for any user without requiring any pre-

94 existing programing knowledge. MATLAB does most of the heavy-lifting in programing for users so they can instead focus on implementing key algorithms, which are mathematical in nature.

That is where Image Metrics comes in. Unlike other MATLAB-based AFM programs that target only a specific set of analyses, Image Metrics is fully featured to compete with mainstream AFM software vendors. It also sets up the foundations for automated, highly customizable image analysis that takes full advantage of MATLAB’s capabilities, with the aim to reduce human biases while improving throughput through better algorithms and automation.

With close to a hundred thousand lines of codes dedicated to AFM image processing and compiled to run on all major platforms, Image Metrics incorporates a complete AFM package that does all the heavy lifting for AFM users to conduct AFM analysis both inside and outside

MATLAB, in any platform of their choosing. Whereas users are hard-wired to preset functions in other software, they have full control in developing their own workflows in Image Metrics. This level of flexibility not only allows users to enhance greatly the throughput of highly complex yet highly repetitive single-molecule analysis, but also allows them to tackle any demanding analysis goals.

95 3.7.2. Particle Detection

Particle detection in single-molecule AFM studies remains a difficult task for a number of scenarios that cannot be handled by any current AFM software. One scenario involves detecting particles in images with low SNR, such as phase and EFM images, or detecting specific particles in a heterogenous sample where different types of particles could not be easily separated by their particle metrics. The masking method using intensity threshold (Section 3.5.1) will not work well in this scenario. To detect particles in images with low SNR or in sample of heterogeneity, methods developed in EM can be used [115]. In EM, a reference particle is used as a template to detect similar particles. The template particle is rotationally averaged before being cross-correlated to the full image field. Particles that are similar to the template particle will result in local maxima in the cross-correlation map, and will resemble the autocorrelation function (ACF) of the template particle60. Therefore, the local maxima can be used to mark the locations of the detected particles. The EM method works in images with extremely low SNR by

AFM standard, and is able to filter particles by their correlations (to a user input reference) instead of by their particle metrics. Since the particles detected in this method are pre-filtered by correlation, it is easier to perform subsequent particle alignment and classification (Section

3.5.2B-C and Section 3.7.3) due to reduced sample size and sample heterogeneity. The disadvantage, however, is that the detected particles will highly resemble the reference, which could prevent other types of particles (e.g. a different conformation of the same particle, or a different oligomerization state of the particle) from being detected if users do not have enough prior knowledge of their sample [115].

60 This result is expected because the rotational invariant template particle (after being subjected to rotational averaging) will correlate to the same particles that are orientated in other directions, and therefore obtaining a version of its autocorrelation function.

96 A second scenario involves the detection of particles with loose components, such as proteins with long linker arms or flexible domains, which cannot be easily masked by a height threshold without masking the surface, especially if the surface is less than ideally flat. As described earlier in Section 3.5.1, the masks on these loose components can be joined together via morphologic open operation (erosion followed by dilation). However, dilating the masks risks over-masking areas that are not features of interest, resulting in overestimation of various particle metrics (e.g. volume, area, etc.). An alternative to morphologic operation is to group the masks that are close in distance so that masks within the same group are measured as a whole61.

Essentially, the disconnected masks (that are closer in distance) become a single, disconnected mask that is subjected to the same measurement as with other connected masks. In this way, users make measurement only on the masked areas they are comfortable with, but they may risk underestimating some particle metrics if loose components of the features are under-masked or unmasked. Ideally, an algorithm that masks features based on their local surface heights instead of a universal height will make masks closer to the ideal that best identifies the features.

The third scenario involves the detection of particle components or sub-particles, such as the proteins on a protein-DNA complex, or individual monomeric units in a protein dimer.

Currently, measurements on particle parts can only be performed manually through ROI tools

(Section 3.5.3C). However, the process could be automated through the introduction of mask sets. Set operations on masks have been developed in Gwyddion ([119], Ref. Section 4.4).

Basically, one can perform a different set of masking operations on the same image, resulting in detection of different types of particles. By saving the different masks as mask sets and

61 For example, the centroid metric of disconnected masks can be obtained to calculate the distance between different centroids, which then can be used to determine if the masks need be grouped or not.

97 performing set operations (e.g. logical AND and OR) on them, users can identify masks from one set that are related to masks from the other set. For example, using logical AND (also known as intersection), users can identify proteins (from the protein masks) that are only bound to the

DNA (from the DNA masks), or identify protein monomers (from masks based on a higher height threshold) that come only from a protein dimer (from masks based on a lower height threshold)62. Combined with the powerful macros and scripting interface built in Image Metrics

(Section 3.6), and the planned new data structure (Section 3.7.4), this technique should be able to enhance the capability of the program greatly to detect particles in very complex scenarios and batch processing their analyses, such as the analysis of individual protein complexes on each

DNA, or the analysis of individual protein subunits on a multimeric protein complex. An alternative method of particle parts detection is the use of automated ROI tools, which is discussed later in Section 3.7.4.

62 Assuming the protein dimer has a long, flexible linker arm between its two monomers, a high threshold mask would not mask the protein as a whole, but rather mask its individual subunits separately. Vice versa, a low threshold would likely be able to mask the protein dimer as a whole.

98 3.7.3. Shape Analysis

In particle matching and classification, I expect to incorporate a new alignment method for better particle alignment and classification. The current method uses a reference particle

(‘template’) to align other particles. Although the method allows the best reference to be picked by the software (Section 3.5.2C, Figure 3.16), it is still a reference-based alignment algorithm.

Using such an algorithm could cause misalignment for particles that are not strongly correlated to the reference. Reference-free alignment methods, such as those using rotation invariant kernels63 or exhaustive sampling of rotation space64, have been developed for EM [115, 130].

These methods allow each particle to be aligned optimally to a user defined angle independently from any reference. Better alignment will enable users to obtain higher resolution class averages from averaging, and the refined class averages can be subjected to multi-reference alignment

(MRA) for better particle classification [115, 135, 136]. If an existing structure model of the particle is already available, it can be projected or docked into the images by minimizing the surface distance to the model’s center of gravity [139] or by minimizing conformational variances between the images and the model [100]. These structurally docked images can be used for supervised classification to classify only particles with matching structural signature

[139].

63 An example of the rotation invariant kernel is the double auto-correlation function of a particle image. The first auto-correlation function (calculated translationally in the Cartesian coordinate) makes the image translationally invariant. A second, rotational auto-correlation function (calculated along a ring in the polar coordinate) makes the image rotationally invariant. Therefore, particles after applying the kernel are both translational and rotational invariant, and alignment could be performed by the inverse operation with a user defined rotation phase.

64 In this approach, the rotation space (azimuthal angle) is segmented in divisions. Particles that are close to each other in the same division are aligned and averaged. The averaged particle is then used to align other averaged particles from other divisions. By using the averaged particle as reference, it reduces alignment errors from using individual particle as reference.

99 It should be noted that the difference in emphasis in approaching particle classification

(Section 3.5.2C) between EM and AFM is designed to suit the type of images collected in these two techniques. EM image resolution is known to be limited by its noise [159, 160]. The reason is that EM often has to use a low dose of electrons for imaging to protect the sample. The origin of noise comes from insufficient number of electrons passing through the sample, but it also makes EM an ideal tool to benefit from oversampling techniques (Section 3.4.2). In EM, particularly in cryo-EM, data typically have low signal-to-noise ratio (SNR) [147, 161], but they can be collected in large quantities. Therefore, users can still obtain large quantities of particle images in a class even as they increase the number of classes (which reduces the size of each class), and significantly improve SNR through correlation averaging. The disadvantage is that the particles often have to be picked based on prior knowledge since the SNR of EM images is usually not high enough to distinguish features from noise [160]. This limitation could potentially prevent some of the heterogeneous conformations from being detected if they are not pre-known to the user [115]. On the other hand, AFM data typically have high SNR, and all particles, regardless of their conformational states, can be picked up with little ambiguity through a simple intensity threshold (Figure 3.7D). Since the particles are deposited onto a surface, the surface tends to induce preferred orientations of the particles (compared to random orientations in cryo-EM) [162, 163], making it easier to separate different conformations (vs. different orientations) within a heterogenous sample. The disadvantage of AFM data is that the origin of many image artifacts/noise (APPENDIX F) does not allow them to benefit directly from oversampling techniques, especially in the topographic images. In particular, the resolution is tip size limited (through tip dilation effect), which cannot be improved through oversampling. In a typical experiment, the number of particles that can be collected in AFM is also small in

100 comparison to EM. These shortcomings prevent particle classification techniques from obtaining refined classes and class averages on AFM data. Therefore, the emphasis of particle classification in Image Metrics on resolving classes by major conformational changes and sample heterogeneity suits the types of data (AFM) it processes.

Two-dimensional (2D) classes are obtained through alignment in the azimuthal angle.

They can also be used to reconstruct a 3D model if the spherical coordinates for their 2D projections are calculated [164], in which common line65 or simulated annealing method66 can be used [138, 165]. The ability to obtain refined 2D classes, combined with the more heterogenous particle orientations on the grid, allows cryo-EM to reconstruct 3D models from 2D class averages [136]. EM also benefits from the Random Conical Tilt technique that allows for in situ reconstruction of a 3D model from the individual particle [166], which is often used to build initial particle models and identify homogeneous data set from a heterogenous sample [135].

Even a technique with lower resolution, such as the negative staining EM, benefits from all these advancements and facilitates reconstruction of low resolution 3D models [135]. AFM is in many ways similar to negative staining EM. For example, they both have a supporting surface (e.g. mica in AFM, and carbon support film in negative staining EM), which facilitates classification of conformations of heterogenous samples due to their preferred orientations on the surface

[135]. There are important distinctions though – while EM images are 2D projections of a see-

65 If two 2D images come from two projections of the same 3D model, they will share a common line in their feature (think it as a rotation invariant line). In the common line technique, images are transformed in a 3D Fourier space as 2D planes, and their common lines are intersection of these planes. Once the lines are found, their relative position can be used to back-calculate their orientations, which are used to put the 2D planes into proper orientations in the 3D Fourier space. The 3D model can be reconstructed when inverse Fourier transform is performed.

66 In this technique, particle images that are close in 3D orientations are ‘annealed’, i.e. assigned a similar spherical angle. Particle orientations can be fully restored when all relative angles are obtained while repeating this process to all particles.

101 through 3D structure, AFM images are essentially topographic images. The places that electrons can reach, such as holes inside a structure, are less accessible to AFM probes. The 3D models reconstructed from AFM topographic images will be topographic in nature. If the sample is homogenous in conformation and yet enough instances of different orientations can be collected,

AFM can potentially benefit from EM computation techniques and be used to reconstruct low resolution 3D topographic models as well.

To accelerate particle shape recognition, I plan to implement machine learning techniques, which have seen applications in EM particle detection and classification [167, 168].

The computer will be trained to recognize particle shapes using a data set, which allows it to recognize particle shapes in future data sets without going through the more laborious alignment and classification procedure. This procedure will allow Image Metrics to conduct particle metrics analysis and classification simultaneously, significantly improving the throughput of shape analyses.

102 3.7.4. Single-Molecule Analysis

The single-molecule analysis module in Image Metrics provides users with a very comprehensive toolset to analyze individual molecules microscopically. This feature separates it from other AFM programs. At its core are powerful masking and region of interest (ROI) tools, which functions like tweezers and scissors to crop the particles or particle parts of interest to be analyzed, and a powerful set of measurement tools that allows for easy DNA tracing and protein locating with a streamlined data recording and batch processing interface. These powerful features, while not completely automatic, greatly reduce the time needed for such analysis, increasing the throughput by as much as tenfold in some analysis workflow. However, a lot of potential improvements are still highly desired as I continue to work through data analysis of protein-DNA complexes (CHAPTER 4).

High on the list is to incorporate a data management system that reorganizes the current structure of data storage (Section 3.5.3E) for single-molecule analysis and shape analysis. Right now, users still have to resort to external spreadsheet and graphing programs to collect and analyze their result, and going back to the original data to inspect and/or redo the measurement is still a non-trivial process. The program SPIP has attempted solutions to tackle the data management issue, but its flexibility is severely lacking. Spreadsheets as a carrier for data storage are also severely limited by the types of data they can store – for one, the data stored in a spreadsheet is array in nature and lacks structural organizations. By reorganizing the data storage into a hierarchical format (such as that of a structure type) and agglomerate them into a centrally managed database inside Image Metrics, data from all experiments can be gathered in one place for comprehensive statistical analysis and cross-comparisons. Image Metrics will have built-in data storage and graphing capability to process these data and eliminate the need for external

103 programs for statistical analysis (though users still have the option to export their result to those programs). Most importantly, users will have the full flexibility to examine and redo their measurement on any data point without having to locate the images and the particles manually.

I also plan to automate several operations, including bend angle measurement, fiber length of looped structure, and ROI cropping. The bend angle may be measured by using the tangent of neighboring pixels relative to the kink center. Indeed, automatic DNA curvature measurement has been done in a previous study [169]. For looped fiber structure (e.g. if two fiber strands cross each other), it is possible to segment the fibers at each crossing point, and calculate the fiber length as the sum of all fiber segments. Tracking the fiber would require user discretion, but automation using profile heights should be able to resolve most cases. In fact, the segmentation of fibers has been implemented in Asylum Research AFM software (Figure 3.22), but the process of picking the segments sequentially could benefit greatly from automation. The profile heights can also be used to separate proteins from the DNA by creating automatic ROIs67.

Therefore, the repetitive and yet complex process of cropping ROIs, measuring bend angles, tracing looped DNAs could be completely automated. This improvement could have profound impact on the throughput of single-molecule analysis. If automatic measurement is 90% accurate, it can be combined with imaging, imaging processing, particle analysis, and statistical analysis to completely automate the data collection and analysis process from raw data collection to final result. If enough data can be processed, minor inaccuracies will unlikely affect the main conclusion.

67 The boundary between the protein and the DNA can often be distinguished by thresholding the DNA profile (because the protein is usually much taller than the DNA). This boundary can be used to create ROIs. Alternatively, proteins can be separated from DNA using set operations on separate protein masks and DNA masks, as is described in Section 3.7.2.

104

Figure 3.22 Fiber/Skeleton Segmentation

In Asylum Research software, fiber/skeleton is segmented whenever branching occurs. (Right) Original image showing a nucleosome DNA complex. It is the same image shown in Figure A.2. (Left) The fiber/skeleton is generated through skeleton morphological operation (Table 3.5) on the masked DNA complex (orange) and displayed in blue or green. When branching/looping is detected, the fiber/skeleton is segmented at the branches. The branches with a longer detour than the shortest path between the two ends of the DNA (see also Section 3.5.3B)is displayed in blue whereas the latter is displayed in green. The branch points are indicated by circles (inset), which form segments between branch points. In the case of a real looping/branching event, users can manually pick the segments that travels along the correct path of the DNA, and obtain the correct DNA length (1.17µm, black box) instead of the shortest path (939.39nm, green box).

105 3.7.5. Open Development Model

Through the evolution history of computer software development, software projects have never been as free and open as now, and computer languages are evolving extremely rapidly while becoming easier than ever to understand. Software packages are also becoming more and more user oriented, and becoming increasingly complex by building on decades of previous work. The cost of managing an increasingly complicated project far exceeds the capability and knowledge of a single developer, but with challenges come opportunities. The most exciting features will come from our users. Our users are also our developers – I fully believe in open source and source available software and collaborations, and am committed to building Image

Metrics as a publication platform for applications developed by and for the users. I am working on a licensing model where Image Metrics could be free, open source, or source available to contributors, while end users may be charged based on modules for development, maintenance, and support.

Currently, users can develop and test modules for offline use in Image Metrics. I plan to integrate an online App exchange module (App Store) where users can contribute their own applications. The App Store aims to facilitate the packaging, distribution, and licensing of user apps. Individual users can choose their own licensing terms and monetization models. The long- term goal is to build a researcher-friendly development community to support and accelerate the development and distribution of scientific and research applications. Traditionally, researchers have little incentive to continue development of in-house software outside of their scientific projects. In the lack of a mature distribution channel and financial incentives, the cost to support, document, and distribute a software package is too high for most researchers. Numerous innovations and potentials are wasted in the corners of lab archives. Code collaboration

106 platforms (e.g. GitHub) and software distribution platforms (e.g. App Store, etc.) exist, but they are not geared toward researchers, and finding research specific applications is difficult. A steep learning curve is also required to use those platforms, a resource that most researchers cannot afford. Many technical computing environments have their own code sharing platform. For example, MATLAB has its own sharing platform in MATLAB File Exchange68 where

MATLAB Addons/Apps/Toolboxes can be submitted by end users. Originlab’s Orign/OriginPro also has its own file exchange platform69. But they are closed environments (walled garden) that can only be distributed within their own proprietary software. A programming language agnostic platform that is geared towards researchers has yet to exist. I believe that there is a market for such a platform to exist and serve the research community, much as special social networks like

LinkedIn and ResearchGate are created to target their professional community.

68 https://www.mathworks.com/matlabcentral/fileexchange/

69 https://www.originlab.com/fileExchange/

107 3.8. License and Distribution

Image Metrics is currently in active development and in the stage of open-sourcing part or all of its components. For now, I am adopting a proprietary freeware license model with plans to release the source codes as source available or open source components. The details of the license can be found inside the software by clicking ‘About’ under menu ‘Help’ in the application launcher (Figure 3.1). The license of open source codes will be based on variants of

BSD or GPL. Image Metrics also uses open source components including GUI Layout Toolbox

[170] and Widgets Toolbox [171] as well as many others. Users can access a list of open source components and their license information in the same place under ‘Help’. At the time of writing,

Image Metrics is solely distributed at http://im.zimengli.com.

108 CHAPTER 4. STRUCTURAL AND FUNCTIONAL STUDY OF DNA MISMATCH REPAIR IN THE CONTEXT OF TRINUCLEOTIDES REPEATS EXPANSION

DNA mismatch repair protein MutSβ is involved in maintaining our genetic stability by repairing insertion/deletion loop (IDL) errors that happen during DNA replication. But studies have found that MutSβ is also implicated in genetic instability that involves the expansion of trinucleotides repeat (TNR) sequences. TNR expansion is key to the development of several neurodegenerative diseases such as the Huntington’s disease. MutSβ interacts with IDLs that are predicted to form during TNR expansion, which could be key to understanding the molecular mechanisms. To investigate how MutSβ interacts with TNR IDLs and non-TNR IDLs, we use atomic force microscopy (AFM) to visualize directly the conformations of MutSβ-DNA complexes in the initial stages of repair processing. Our data suggest that when interacting with non-TNR IDLs, MutSβ adopts conformations that are consistent with a repair signaling capable formation; when interacting with TNR IDLs, MutSβ adopts very different conformations that may not be competent for repair signaling.

109 4.1. Introduction

The integrity of DNA is constantly undergoing changes from various exogenous and endogenous sources [172]. Exogenous factors such as UV light, ionizing radiation, and chemical mutagens can induce DNA breaks and damage bases [173], as do many endogenous factors such as reactive oxygen species (ROS), DNA hydrolysis, and DNA replication errors [2, 174].

Improper maintenance of the DNA can result in mutations that lead to severe diseases, such as cancers.

Multiple cellular repair mechanisms have developed to maintain the integrity of DNA during biological evolution. Among the most prominent repair pathways are base excision repair

(BER), nucleotide excision repair (NER), and mismatch repair (MMR). Base excision repair utilizes DNA glycosylases to remove a damaged DNA base induced by oxidation, deamination, and alkylation. One prominent example particularly relevant to this study is 8-Oxoguanine (8- oxo-G), which often results from reactive oxygen species. Nucleotide excision repair removes bulky nucleotide adducts such as those induced by UV light. Mismatch repair, the subject of this study, fixes post-replication errors in the dividing cells. The DNA repair is of such extraordinary importance in science that in 2015 three scientists were awarded the Nobel Prize for their study in this field [175-177].

Mutations in key DNA mismatch repair genes, first identified in the 1980s, are found to inflict many cancers, including colon cancer, endometrial cancer, and ovarian cancer, collectively called Lynch Syndrome [3]. It is important to note that although DNA mismatch repair proteins are participating in mostly positive roles such as fixing DNA mismatches, signaling for cell apoptosis, double-strand break repair and homologous recombination [2], sometimes their roles can lead to undesirable effects. In particular, the role of MMR proteins

110 MutSβ and MutLα in promoting a number of trinucleotide repeat (TNR) expansion related neurologic disorders has generated a lot of interest in the research community [4], with

Huntington’s Disease (HD) being one of the most notable examples.

Huntington’s Disease (HD) is a severe neurodegenerative disease in the brain with some patients describing it as Parkinson's, Alzheimer's, and motor neuron disease rolled into one

[178]. The disease originates from a genetic defect in the Huntingtin gene, which encodes the

Huntingtin protein and is crucial for brain development. People carrying an intermediate or full mutation70 of this gene will progressively develop this disease in their late life, and their offspring have a 50-50 chance to inherit this disease as well [179].

The genetic defect in the Huntingtin gene is embodied by the expansion of CAG repeats in the coding sequence, which results in misfolded proteins that are prone to aggregation. The brain cells are killed as a result71. Similar repeats-based disease pathogenesis, collectively called microsatellite instability (MSI)72 or more specifically trinucleotide repeat (TNR) expansion, is also found in myotonic dystrophy and a number of spinocerebellar ataxias [180]. Although there is currently no cure, recent breakthroughs using antisense oligonucleotides (ASOs) targeting mRNA [181] have successfully reduced the level of Huntingtin proteins in a 46-person human clinical trial73 [182], and the group performing the trial is currently evaluating long term effects

70 Premutation – The individual who carries this mutation does not have this disease, but their offspring will inherit a more severe form of this mutation; Intermediate mutation – The individual who carries this mutation will develop the disease in a less severe form or on a later age of onset; Full mutation – The individual who carries this mutation will fully develop the disease. 71 Protein toxicity induced diseases are also called protein gain-of-toxic-function diseases. The gain or loss of function at the protein or mRNA level is the foundation of many neurologic disorders such as Parkinson’s Disease and Alzheimer’s disease. 72 Although a primary function of MMR is to reduce MSI, it can also introduce more MSI as is shown later. 73 The ASOs are delivered intra-thecally and catalyze the degradation of Huntingtin mRNA by RNAse H.

111 on the patients. If successful, this treatment could be the biggest breakthrough in neurodegenerative diseases in the past 50 years [178].

Despite the recent advancement in treatment, the molecular mechanism of trinucleotide- repeat-expansion diseases is still poorly understood. It is known that the repeats can form secondary structures such as hairpins, single-stranded loops, G-quadruplexes, and R-loops [183] that can lead to aberrant DNA processing. In addition, studies in human and mouse models have indicated that MMR proteins MutSβ, MutLα, and MutLγ have prominent roles in promoting the repeat expansion and the disease [184-186, 187 ], potentially through their interaction with these

DNA secondary structures [188, 189]. It is unclear how MutSβ and MutLα interact with these

DNA structures. There are conflicting evidence and postulations on whether MutSβ could be trapped at these unusual structures and whether the signaling pathway is hijacked [190-192].

While MutLα interactions can be seen in functional assays [193, 194], the molecular mechanism of how MutLα interact with these secondary structures is largely unknown. Therefore, a more thorough understanding of these aberrant structures in relation to MMR is sorely needed.

In this study, we use state-of-the-art Atomic Force Microscopy (AFM) techniques to study the structural and functional relationship between TNR DNA substrates and MMR, specifically, in the initial stages of repair processing. Our data show that when interacting with non-TNR IDLs, MutSβ adopts conformations that are consistent with a repair signaling capable formation; when interacting with TNR IDLs, MutSβ adopts very different conformations that may not be competent for repair signaling.

112 4.1.1. DNA Mismatch Repair (MMR)

Errors in DNA synthesis typically occur in the daughter strands during DNA replication in a dividing cell, but they can also occur during DNA re-synthesis in somatic cells during repair events [192, 195]. The function of DNA mismatch repair is to correct DNA replication errors including base-base mismatches and short insertion/deletion loops (IDLs). By maintaining the integrity of DNA replication, MMR plays an essential role in sustaining genomic stability and cell function.

In eukaryotes, MSH2, MSH3, and MSH6 homologs hetero-dimerize to form MutS proteins that identify replication errors and initiate MMR [2]. Specifically, they are MSH2-

MSH6 heterodimers (MutSα) and MSH2-MSH3 heterodimers (MutSβ). All MSH homologs have one Walker A motif in the C-terminus, allowing them to bind and hydrolyze nucleotides

[66, 158]. Their ATPase activity is crucial to mismatch repair function [196, 197]. Although both

MutSα and MutSβ utilize similar structure to bind the DNA, key difference in the structure enables MutSα to bind specifically to base-base mismatches and 1-2 base IDL mismatches, and

MutSβ to bind mainly to 2-15 bases IDLs74 [66, 158].

74 MutSb also partially binds to single-base mismatch or IDLs.

113

Figure 4.1 Mechanism schematics of DNA Mismatch Repair through the MutSα pathway

A. Mismatch identification by MutSα. B. MutSα activation by ATP. C. SL Complex formation and MutLα activation by PCNA. The activated MutLα can nick the DNA on both sides of the mismatch. D. EXO1 loading through the nick and activation by MutSα to cleave the nascent strand. E. DNA re-synthesis and ligation. A schematic of the mismatch repair process via the MutSα pathway is shown in Figure

4.1. Upon binding to DNA, MutS homologs can search for the mismatch through a diffusive mechanism (Figure 4.1A) [198]. Mismatch recognition occurs through specific interactions between MutS and the mismatch, which distinguish specific binding sites from non-specific binding sites [150]. The specific and non-specific interactions also transform MutS into different conformational states, with respective DNA bending profiles ranging from bent to unbent [6,

199]. The bend angles revealed by crystal structures show a 45 degree kink in MutSα-induced

DNA bending at the base-base mismatch [66], and a 90 to 120 degree kink in MutSβ-induced

DNA bending at the IDLs with increasing IDLs size [158].

114 MutS can then undergo a conformational change that unbends the DNA (unbent state) upon ATP binding and hydrolysis (Figure 4.1B) [5, 6]. The ATP-activated MutS can slide away from the mismatch through a diffusive mechanism, mimicking a mobile clamp formation called the sliding clamp75 [81, 200]. As a result, the mismatch (also known as the specific site76) is freed to load additional MutS proteins, and increased MutS loading onto the DNA can be observed when ATP is present [201]. MutS can also form quaternary structure (the ‘SL’ complex) with

MutL (MutLα77 in eukaryotes) [200, 202], a heterodimer with an endonuclease subunit78, which is essential to cleave the nascent strand79 [203]. The formation of SL complex is important to identify the discriminated strand (i.e. the nascent strand) that contains the mismatch; however, the mechanism of finding the discriminated strand is unclear. First, MutLα must interact with

PCNA80 to activate its endonuclease activity (Figure 4.1C) [204]. The specific orientation of

PCNA, which is loaded on to DNA by RFC81 on the 3’ end at the junction of single- and double- stranded DNA (such as a nicked site82), facilitates MutLα to cleave only the nascent strand [204,

207]. The PCNA, therefore, appears to serve as the strand discrimination signal. To reach PCNA

75 The sliding clamp state is not to be confused with the non-specific binding state during the initial search, with the sliding clamp having longer diffusion persistence length, lower direct dissociation rate and lower DNA bending. 76 Typically, locations in the DNA sequence that contain mismatches or IDLs are called specific sites, and locations that are not specific sites are called non-specific sites.

77 MLH1-PMS2 in Homo sapiens, and MLH1-PMS1 in Saccharomyces cerevisiae. 78 PMS homologs are the endonuclease subunit. 79 MutLα only cleaves one strand of the DNA (aka. nicking). 80 PCNA - Proliferating cell nuclear antigen. 81 RFC – Replication factor C. 82 In dividing cells, the nick could embody itself in the lagging strand as the gap between Okazaki fragments [205], or created in the leading strand during ribonucleotide excision repair, where mis-incorporated RNA on the DNA during DNA replication is removed by ribonuclease H2 or RNase H2 [206]. In somatic cells, the nick can be introduced on either strand by apurinic/apyrimidinic endonuclease after a damaged base is removed by glycosylase in base excision repair [195].

115 loaded from a nearby or a distal site, it was proposed that the SL complex can either slide or polymerize along the DNA, or loop the DNA [79, 200, 208]. Both MutS and PCNA can move freely on the DNA as mobile clamps, and the PCNA-binding motif (PIP83 box) on MutSα and

MutSβ will help them locate PCNA84 [209, 210]. However, it could be much more difficult for the mobile clamps to cross over an IDL than a single base-base mismatch (Figure 4.2B vs.

Figure 4.2A), suggesting that the IDL-directed searching mechanism may be very different from that of base-base mismatches, and a long-range interaction (e.g. looping) may be required to connect the PCNA and the SL complex on opposing sides of the IDL [207, 209].

Figure 4.2 Strand Discrimination Signal Searching Mechanism

PCNA could serve as the strand discrimination signal on both the leading (A) and lagging (B) strands. To locate PCNA, it is suggested that PCNA and/or MutSβ-MutLα complex (the ‘SL’ complex) could move along the DNA as mobile clamps till they collide (A). The movement could be restricted, however, when a large IDL (such as a hairpin) blocks the way between PCNA and the SL complex (B), whereas a single base-base mismatch may not present an obstruction to the movement (A). Because of this difference, the PCNA searching mechanism involving IDLs might be very different from those involving base-base mismatches. Figure adapted from [207].

83 PIP – PCNA interacting peptide

84 Unlike MutSα, MutSβ will dislocate MutLα when interacting with PCNA as MutLα and PCNA share the same binding site on MutSβ [209].

116 Once MutLα cleaves the 5’ side of the mismatch through its interaction with PCNA,

EXO185 can then be loaded onto the newly nicked site and be activated by MutS to excise the strand containing the mismatch in a 5’ to 3’ direction, thereby removing the mismatch (Figure

4.1D). DNA re-synthesis is carried out by DNA polymerase δ, PCNA, and RPA86, and finally the end-point disjoint is joined by DNA ligase to complete the repair process (Figure 4.1E).

Defects in DNA mismatch repair genes (MSH2, MSH6, MLH1, and PMS2) are associated with a series of cancers, collectively called Lynch Syndrome [3]. MMR mutations also decrease cell apoptosis, increase cell survival, and increase cellular resistance to many chemotherapeutic agents [2]. Furthermore, inaccurate, incorrect, or escaped repair may also jeopardize the integrity of the genome [193]. Specifically, the profound role of MutSβ-dependent pathway involving MutLα and MutLγ87 in promoting the expansion of trinucleotides repeat

(TNR) sequences signals for aberrant processing by mismatch repair proteins in certain cellular processes [4]. This finding has sparked strong interest in the research community since discovering the mechanism behind this aberrant processing would have profound impact in understanding the development of TNR related neurodegenerative diseases such as Huntington’s

Disease and myotonic dystrophy [207, 213].

85 EXO1 - Exonuclease 1 86 RPA – Replication Protein A 87 MutLγ is a heterodimer of MLH1-MLH3. It is a MutSβ-stimulated endonuclease that’s involved in meiotic recombination [211] and MutSβ-dependent mismatch repair [212].

117 4.1.2. Trinucleotides Repeat Expansion (TNR)

Trinucleotides repeat (TNR) expansion or triplet repeat expansion is a genetic instability that involves the expansion of three-nucleotide microsatellite repeat sequences. It is a genetic defect at the heart of many neurodegenerative diseases including Huntington’s Disease (HD),

Friedreich’s ataxia (FRDA), fragile X tremor and ataxia syndrome (FXTAS), myotonic dystrophy (DM), and variations of spinocerebellar ataxia (SCA) [214].

The disease pathogenesis centers around the expansion of several triplet repeat

88 sequences, notably (CAG)n for HD and SCA, (CTG)n for DM , (GAA)n for FRDA, and (CGG)n for FXTAS. Depending on the location of the repeats and the function of the protein, the mutation can either lead to protein or RNA gain-of-function, or protein loss-of-function89[180,

214]. In the case of HD, the disease could result from a dominant-negative mutant huntingtin that leads to a potential protein loss-of-function, or from a toxic protein gain-of-function effect that allows mutant huntingtin to interact with new binding partners in neuron cells. Since the function of huntingtin is poorly understood, a protein gain-of-function mechanism is more likely given evidence of inclusion bodies of mutant huntingtin found in disease-affected neurons that could be the main driver for toxicity [215]. The expanded polyglutamine tract, coded by the (CAG)n in the huntingtin gene, increases the propensity for the protein to aggregate, and is thought to fuel the formation of these inclusions.

88 (CTG)n for myotonic dystrophy type 1 (DM1) and (CCTG)n for myotonic dystrophy type 2 (DM2) 89 Protein gain-of-function: (Location – exon) Protein acquires toxicity through mutation that affects their structure and function. RNA gain-of-function: (Location – intron) Like the DNA as shown later, TNR transcribed RNA also has high propensity to form stable secondary structures, which may sequester proteins in inclusion bodies, prevent their function and alter gene expression. Protein loss-of-function: (Location – intron) TNR transcribed RNA, when sharing sequence similarities, can also function to degrade target RNA and silence selective genes, thereby decreasing expression of target proteins.

118 Usually, the repeat length dictates the stage of the diseases, with increasing severity proportional to the repeat length. It also dictates the propensity for the repeat to expand, and therefore the age of onset. A longer repeat will have an earlier age of onset, a more severe degeneration, a higher risk of transmission to offspring, and a higher risk for genetic anticipation

(an increased number of repeats inherited by next generations). Because of this unique correlation, the repeat length is often classified into three categories by a certain threshold that is specific to the individual disease - normal (well below the threshold, will not expand), premutation (around the threshold, may expand) and full mutation (above the threshold, will expand) [216]. In addition, the repeat expansion is often tissue specific and time sensitive, which may be accounted for by the differential expansion rate in different tissues and at different stages of the lifetime of a tissue. For example, in HD patients, the striatum is expansion-prone, while the cerebellum and blood cells are deletion-biased or stable, presenting high somatic mosaicism.

While HD expansion goes on throughout a person’s adult life, the expansion in DM1 patients, which occurs predominantly in muscle cells, only expands prior to terminal differentiation, therefore presenting a more limited window of expansion [213, 217].

119 4.1.3. Molecular Mechanisms of TNR

Although the link between repeat expansion and disease is well studied, the molecular mechanisms of expansion remain largely unclear. Various models have been proposed, with the consensus centering on the formation of novel slipped DNA structures (also known as slip-outs or extrahelical elements) during DNA strand separation events, but the models differ on the factors leading to such events. Figure 4.3 illustrates a generic scheme for repeat expansion, with an overview of the various factors in cis and in trans90 that have been explored. Interested readers can learn more in reviews [4, 180, 218].

90 Cis-factor: factors related to the same DNA such as DNA sequence and structure; Trans-factor: factors not related to the same DNA, such as interactions with proteins.

120

Figure 4.3 Mechanism of Repeat Expansion

A. Strand Separation. DNA strand separation events are precursors to DNA slip-outs. DNA strand separation could occur in the nascent strand (red) or in the template strand (green). In DNA replication, this event could occur during leading and lagging strand synthesis. In DNA repair, this event could occur during strand displacement re- synthesis at nicks generated by DNA repair. Blue sphere – DNA polymerases. B. Slip-out Formation. Once DNA slippage occurs, the dangling strand can fold into alternative structures based on non-Watson-Crick base pairing. Left. Hairpin/loop structure formed by (CNG)n repeats. Middle. A quadruplex structure formed by (CGG)n repeat. Right. A triplex structure formed by (GAA)n repeat. Star - Hoogsteen pair; Grey rectangle – G-quadruplex; Yellow rectangle – an i-motif. Adapted from [214]. C. Stabilization. Once formed, the slip-outs can further be stabilized when binding to various trans factors (e.g. MutSβ, circle) that may hamper its repair, or stabilized by flanking sequences (pink) and interruptions (grey). D. Expansion. The slip-outs, if not correctly removed, result in expansion (if slip-out occurs on the nascent strand, red), or contraction (if slip-out occurs on the template strand, green) during next cycle of DNA replication. Bright red – newly synthesized strand with expansion or contraction while using top strand as template.

121 Strand Separation

The initiation stage (Figure 4.3A) leading to the formation of DNA slip-outs usually occurs during DNA strand separation, which allows the DNA to explore alternative conformations. DNA strand separation typically takes place in forms of nicks, gaps, overhangs, junctions, and branches, which can occur in various DNA metabolism events such as replication, recombination, repair, and transcription [188, 219, 220]. It could also take place during DNA breathing (bubbling) events, in which the DNA spontaneously disrupts its local base pairing through thermal fluctuations [219]. For example, during DNA replication (Figure 4.3A Left),

DNA strands are separated. The lagging strand, partially unpaired due to its replication mechanism91, is free to form alternative structures [221]. The leading strand, even though fully paired, could also experience strand separation in the form of DNA slippage (single-stranded looping) at the replication fork when the replication fork is stalled [214, 219, 222]. In DNA repair (Figure 4.3A Right), DNA nicks could serve as entry points for strand separation during synthesis in which overhanging single strand can be created while being displaced by DNA synthesis. These nicks can be generated during base excision repair, nucleotide excision repair, and mismatch repair, where damaged bases or mismatches are recognized and cleaved by

OGG1/APE1, XPF/ERCC1/XPG, and MutS/MutL respectively [2, 195, 218]. Similar strand separations can also be created by nucleases and strand invasions during double-strand break repair and DNA recombination [188, 223], or in transcription [223].

91 The lagging strand synthesis is carried out in a ‘back-stitching’ mechanism because DNA polymerase could only extend DNA in the 5’ to 3’ direction, which is opposite to the direction of the replication fork. Primers have to be periodically inserted on the lagging strand, forming Okazaki fragments, and complete the synthesis in segments. Because the synthesis is not continuous, part of the nascent strand on the lagging strand could be missing, leaving the lagging strand partially unpaired [172].

122 Slip-out Formation

The conversion from separated strand to stable slip-out structure (Figure 4.3B) is more likely for triplet repeat sequences because of enhanced stability resulting from partial Watson-

Crick pairing when the DNA single strand is palindromically paired, as is the case of

92 hairpin/loop formations on (CNG)n repeat (Figure 4.3B Left) , or contains Hoogsteen pairing to form triplex structure such as the sticky-DNA (for (GAA)n repeat, Figure 4.3B Middle), or G- quadruplex structure (for (CGG)n repeat, Figure 4.3B Right) [214]. For example, in the case of

(CNG)n, although the free energy still favors the complementary Watson-Crick base pairs over the slip-out construct, the difference is small enough so that the DNA can fold into the slip-out structure with high propensity93 [183, 225]. Slip-out structures generated by TNR repeat have been observed in the past in electrophoresis mobility assay [226], in EM [227, 228], and in AFM

[229, 230].

Stabilization

The slip-out structure can easily relax into the more favorable homoduplex states, but it can be strengthened or weakened by other cis factors including the size of the slip-out, its flanking sequence, and interruptions in the repeats (Figure 4.3C Right). The size of the slip-out, which is inherently linked to the length of the repeat, strengthens the stability of the slip-out as it grows [225]. Longer repeats could increase both the number and the size of the slip-outs. This process does not automatically lead to longer slip-out, however, but repeat asymmetries caused by new insertions on one strand could greatly promote the formations of longer slip-outs [228].

92 In a (CNG)n single strand formed antiparallel duplex, the cytosine and guanine are Watson-Crick base paired, and the unpaired nucleotide N could still form a weak H bond though stacking interactions [183, 224].

93 Based on the strength of bonding between the mis-pairs, the (CNG)n slip-out can either form a hairpin (as in the case of (CTG)n) or loop (as in the case of (CAG)n)).

123 The flanking sequence also influences the stability of the slip-out since they form the bases of the slip-out structure [225]. CpG islands are found to flank the repeat region in almost all disease related genes and are regulated epigenetically through DNA methylation [180]. Abolishing the sequence context could greatly hamper the expansion of the repeat by destabilizing formation of the slip-outs or changing the context for gene expression [180, 213]. Frequent interruptions within the repeat tract could stabilize the repeat by compartmentalizing the repeat in smaller segments, thereby reducing the strength of the slip-out and the stability of its base pairing [213].

It could also induce mismatch repair to remove the weakened slip-outs through mismatches introduced by these interruptions [224]. The slip-out, depending on its size, can still be removed efficiently through several repair pathways94; however, the cellular repair mechanism could be overwhelmed if the propensity of slip-out formation triumphs over what it can repair [189], or compromised if the slip-out is further stabilized or promoted by proteins that could counter its repair [218] (Figure 4.3C Left). Among the stabilizing proteins are mismatch repair proteins, which are the focus of this study and are discussed in detail in the next section. The particular structure of the slip-out can also be a barrier to its removal by either blocking repair proteins from passing through and/or by resisting being processed by those proteins95 [188, 214, 218].

Expansion or Contraction

In dividing cells, failure to remove the slip-out will ultimately lead to its incorporation into the DNA (during the next round of replication), leading to expansion or contraction depending on whether the slip-out occurs on the nascent strand (Figure 4.3D Left) or on the

94 Long hairpins are removed by an unknown hairpin removal repair [193]; short hairpins are removed by mismatch repair [231]. 95 For example, FEN1 cannot excise the 5’-flap if the overhang is folded into strong secondary structures. The larger structures are also resistant to being processed by mismatch repair.

124 template strand (Figure 4.3D Right) [218]. Although slip-outs can be formed on both the leading and lagging strand, they are made of complementary sequences and will have differing stabilities96. Shifting the replication origin could therefore switch between expansion and deletion biases by flipping the direction of replication [180]. In non-dividing cells, expansion or contraction could also occur during the processing of the slip-outs depending on whether the slip-out is formed on new DNA (such as during DNA repair) or on existing DNA (such as during

DNA breathing events).

Many aspects of the disease can be implied in the factors discussed in this section. For example, the threshold length, which is closely related to degeneration rate and age of onset, could be explained by the increased propensity and stability of slip-out formations in longer repeats [189, 213, 227, 228]. Difference in threshold length of different diseases might be explained by the sequence composition and context that affects the stability of the slip-out, or the location of the repeat97 [232]. Paternal transmission (genetic anticipation from paternal germ lines) in Huntington’s Disease would be explained by the proliferative cellular divisions in male gametes prior meiosis98 [234-236], supporting the role of DNA replication. Degeneration

(somatic instability) would support the role of DNA repair as repeat keeps expanding throughout a person’s adult life [218]. Somatic mosaicism could be explained by the varying levels of repair genes expressed in different tissues in response to different levels of damage [4, 237].

96 The lagging strand synthesis contributes to contraction instability because the slip-out is formed on the template strand. The leading strand contributes to expansion instability because DNA slippage occurs on the nascent strand [214]. Given the same sequence, the slip-out formed on the lagging strand is likely weaker due to the lack of pairing from the nascent strand (Figure 4.3A Left, green vs. red). 97 Repeats located in the non-coding area have a higher repeat threshold and intergenerational change than repeats located in the coding area. This result could be due to the replication origin, DNA methylation, or transcription. 98 Expansion occurs pre-meiosis in DM1 mice [233], post-meiosis in HD mice [234], and both prior and post- meiosis in humans [235].

125 4.1.4. Relations between TNR and MMR

DNA mismatch repair is known to curb microsatellite instability and remove short insertion deletion loop errors (IDLs) [2]. Consequently, it came as a surprise that normal alleles of mismatch repair genes are found to be involved in triplet repeat expansion, a form of microsatellite instability [233, 238]. Specifically, in humans, MSH2 is required for 100% expansions in both the germ line and somatic cells; MSH3 also promotes expansion, but in an

MSH3-expression dependent manner99, whereas MSH6 has little impact on repeat expansion

[239]. Both MLH1 and PMS2 promote expansions, though PMS2 has a lesser impact and also prevents against contractions100 [4, 185, 187]. The deficiencies in MLH3 also ablates expansion

[187]. These results suggest important roles of MutSβ, MutLα, and MutLγ in repeat expansion, but not MutSα. The lesser impact of MSH3 in MutSβ (MSH2-MSH3) on repeat expansion suggests that other mismatch repair proteins such as MSH2 homodimers may also play important roles [239]. A previous study showed that MutSβ’s binding to secondary structures containing

TNR sequence is very different from binding to other structures101 [190], and the study postulated that MutSβ could be trapped on these unusual DNA structures and inhibit repair as a result. However, this assumption is not consistent with a later study showing that MutSβ can still be released from these structures upon ATP activation [191], suggesting a different mechanism might be in play [192]. The interplay between MutSβ and MutLα, whether they act dependently

99 MSH3 leads to full expansion when it is highly expressed, such as in neuron cells; it has limited to no impact on cells not expressing the genes, such as in germ line cells and some types of somatic cells. This result could explain tissue specificity observed in these cells. 100 PMS2 is only responsible for 50% expansion in mice studies. MLH1 forms heterodimers with MLH3 (MutLγ) and PMS2 (MutLα). MLH3 also has a bigger impact than PMS2 in promoting expansion. Therefore, it is not unexpected that MLH1 has a bigger impact than PMS2. 101 Error-prone structures – structures formed by unstable repeats that cannot be effectively processed by DNA repair. Error-free structures – structures that can be effectively processed by DNA repair.

126 or independently on repeat expansion, is less clear, and it has triggered a debate on whether the mismatch repair mechanism is being hijacked to promote repeat instability [232, 240]. While

MutLα’s primary role is in mismatch repair, the MutSβ’s ability to bind to the IDLs – including the slip-outs – that are predicted in various cellular processes, greatly expands its role to a great many expansion models outside of mismatch repair. These processes may include base excision repair, nucleotide excision repair, and non-homologous tail removal (NHTR) during genetic recombination [188, 218]. MutSβ’s role is unlikely just binding because its ATPase activity are found to be essential for repeat expansion, suggesting ATP exchange and/or hydrolysis are still important [4, 240]. This observation suggests that MutLα and MutLγ are likely involved in the pathway following MutSβ activation [194, 211, 212], and the key to expansion lies in the aberrant processing of the slip-outs instead of in the recognition stage [4].

Previous biochemistry studies suggest that the structures of DNA are at the core of understanding MutSβ’s bipolar functions – promoting TNR slip-out while fixing non-TNR slip- out [190, 194]. The molecular mechanism of MutSβ-DNA interactions, however, remains largely unsolved. To begin to understand fully the molecular mechanism of this aberrant processing of slip-outs predicted in TNR instability, we compare the conformations of several slip-outs bound by MutSβ ranging in sizes by directly visualizing their interactions with mismatch repair proteins

MutSβ and/or MutLα using Atomic Force Microscopy (AFM). Compared to other techniques,

AFM offers unique advantage in determining conformations of heterogeneous samples, and it allows us to probe the modes of processing (normal vs. aberrant) by establishing a direct link between the conformations and functions. Out study reveals various conformations that may be important to signal for the different repair pathways.

127 4.2. Materials and Methods

4.2.1. Proteins and DNAs

MutSβ and DNA substrates are gifts from the Modrich lab and were prepared as described in previous work [194, 204, 209]. MutLα is obtained from the Hsieh lab. To produce substrates for AFM studies, relaxed close circular DNAs are cut twice to produce two strands

(one homoduplex strand and one heteroduplex strand, see Table 4.1 column 3 and 4), or once to produce a single heteroduplex linear DNA (Table 4.1 column 2). Details of the cut site(s) can be found in Figure E.1. Unless specified, all the analysis and results are based on the double-cut substrates. DNA substrates with 0, 1, 5 CTG repeat(s) slip-out insertions and one hybrid slip-out

102 insertion (CTG)56/(CAG)54 are used [194] . Notably, the insertion site of the repeat sequence on the DNA is also flanked by sequence context from the disease myotonic dystrophy DM1, which makes the DNA substrates more relevant in a disease context. The length of individual cut strands and percent position of the slip-out for each DNA substrate are listed in Table 4.1.

Position of the slip-out is measured from the nearest end of the DNA to the slip-out. Percent position represents the position of the slip-out as a percentage of the DNA length103.

102 Other DNA substrates are also analyzed, such as (CAG)5. Results for those substrates are included in the main text where appropriate.

103 Since the two ends of the DNA cannot be distinguished, the position and percent positions are measured from the nearest end, and therefore will overlay the non-specific binding events from symmetric positions in measurement. The non-specific binding additions can be accounted mathematically and does not impact our specificity calculation [150].

128 Double Double DNA substrates Single cut cut cut

(CTG)0 3074bp 1041bp Homoduplex Homoduplex Homoduplex

(CTG)1 2033bp

3074bp 18% 1041bp 39% Homoduplex

(CTG)5

(CTG) /(CAG) 56 54 3236bp 17%~22% 1203bp 33%~47% hybrid

Table 4.1 DNA substrates length and position of slip-out

On the left column, relaxed closed circular DNA substrates and their schematics are listed. The repeat insertion is labeled in red. On the right columns, the lengths of the linearized DNAs are listed. For heteroduplexes, the percent position of the slip-out are listed in red. Homoduplex DNAs are also text-labeled to distinguish from the heteroduplex DNAs. 4.2.2. AFM Sample Preparation

Reaction conditions are listed in Table 4.2. Reaction conditions are selected to optimize imaging in AFM. DNA is incubated with MutSβ (or MutSβ+MutLα) in buffer for a set period on ice, with ADP or ATP. MutSβ+MutLα+DNA reactions are performed in higher concentration to optimize SL complex formations and the reactions are crosslinked before diluting 1:10 for optimal AFM imaging. The DNA sample used in reactions may contain heterogenous species as control groups. For example, the double-cut species in Table 4.1 are not further separated because they can be distinguished by size. Incomplete single cut may be performed to preserve some populations of circular DNAs.

129 Reaction Reactions MutSβ MutLα DNA ATP/ADP Time Crosslink 1x Buffer Size MutSβ 25mM 10nM N/A 1µg/ml N/A 20uL +DNA HEPES, 10mM Magnesium 0.85% Acetate, 1,5,10 MutSβ+MutLα 1mM 100mM 125nM 125nM 10µg/ml min Glutaraldehyde 4uL Sodium +DNA Acetate, 5% 1min Glycerol, 1mM DTT, pH 7.5 Table 4.2 AFM sample preparation recipe for MutSβ-MutLα-DNA reactions

Buffer is filtered through 0.2um syringe filter for purity. To ensure final buffer concentration, condensed buffer (e.g. 4x concentration of the same buffer in the table) may be required for the 4uL SL reaction. 4.2.3. Deposition, Imaging, and Analysis

The reaction mix is deposited on fresh peeled, chemically treated mica (APS [241] on

MutSβ-DNA reaction and Ethanolamine [242] on MutSβ-MutLα-DNA reaction), blotted dry, rinsed with deionized water (Sigma-Aldrich), and dried with Nitrogen before transferring to the

AFM for imaging. A summary of the experiment design is shown in Figure 4.4. Images are collected on several systems (MFP3D (Asylum Research), NanoScope III and IIIa (Digital

Instruments)), at a scan speed of 10µm/s and scan resolution of 3.9µm/pixel. See our methods review paper [242] for detailed AFM deposition protocol. Image analysis is performed using

Image Metrics (CHAPTER 3). A typical AFM image of MutSβ-DNA complex and the structural properties we analyze are shown in Figure 4.5. A detailed description of the analysis using Image Metrics can be found in Section 3.5.3. Notably, the protein-DNA complexes are filtered for some types of analysis (Figure 4.6). A schematic of how the position analysis is performed is shown in Figure 4.7. It should be noted that our analysis measures DNA bend angles that protracts externally from the protein-DNA complex (Figure 4.5), which is indicative but does not actually capture the three-dimensional bend angles internally. The DREEM technique (CHAPTER 2) we developed is able to visualize internal bend angles directly. Sample

130 DREEM images of MutSβ-DNA complex have been captured (Appendix Figure C.4), showing different external and internal bend angles. We will discuss more of the analysis in the results.

Figure 4.4 Experiment Design

The DNA substrates used are shown on the left pane. Different protein and nucleotide combinations (middle pane) are used to examine different stages (blue boxes) of the repair process. The reaction is timed for a set length of time before the reaction mixture is put down on a mica disc for AFM imaging.

131

Figure 4.5 A Typical AFM Image of a Protein-DNA Complex and Its Analysis

Shown in the image is one DNA with two MutSβ complexes. Measured quantities are labeled and color coded on the image and explained in the side boxes. A. Stoichiometry. Stoichiometric relationships between protein complex and DNA can be counted directly from distinctive individual complexes on the DNA. Protein stoichiometry within a complex can be measured by volume analysis [84]. B. Position. By measuring DNA profiles (sections along the DNA, see example in the inset), positions of protein complexes can be identified (arrows) to calculate specificity of protein binding. C. DNA Bending. DNA bending is assessed through deflections of outgoing path from incoming path of DNA through a protein complex. It is an effective way to assess the internal conformations of the protein- DNA complex. D. Protein-Protein Coordination. DNA-bound protein may interact (‘coordinate’) with another protein through short range (‘association’) or long range (‘looping’) interactions that may be significant in repair signaling. This interaction may be visualized as multi-protein complex or individual complexes that share borders (‘neighboring’ complexes).

Figure 4.6 Selecting Particles for Stoichiometry and Position Analysis

(Left) Typically, our AFM image contains a mixture of proteins (red dots) and DNAs (blue squiggles). (Middle) To analyze the stoichiometry of protein complexes on a DNA (with at least one protein complex bound), free proteins and free DNAs are filtered (displayed in opaque colors in the figure). (Right) To measure the position distribution of a protein complex on DNA, the DNAs that contain more than one protein are filtered (i.e. only DNAs that are bound with a single-protein complex are measured).

132

Figure 4.7 Position Analysis

(A) Once the DNA height profile is plotted (top, adapted from Figure 4.5), the locations of the peaks can be mapped. In the middle, the schematic of the DNA substrate is shown with the slip-out location (marked in red) matching one of the peaks’ location. A distribution of the peak locations can be plotted as a histogram after analyzing all the particles (bottom). The location of the slip-out will appear as a peak in the distribution. The dashed line marks the location of the slip-out across the height profile, the DNA schematic, and the position distribution. For our analysis, we look at the position distribution on DNAs with a single protein bound. (B) For unblocked DNA, the protein can land on symmetric points (e.g. a distance of d) from either end of a DNA of length 2L, but only one location can be specific (purple) – other locations are non-specific (pink). Since we cannot distinguish the two DNA ends on unblocked DNA, only the short-arm length (location to the nearest end) of the protein’s location is measured, which ranges from 0 to L (half-length of the DNA). (C) The position distribution is a distribution plot of the short arm lengths. Because of the symmetric sites by measuring short arm length, the specific binding events (purple) will be an overlay over the non-specific binding events (pink) in the plot (adapted from [150]). The areas that are masked by their respective colors, Asp and Ansp, represent the population of proteins that specifically bind to the DNA and non-specifically bind to the DNA. The location of the peak frequency, Pmax, represents the location of the slip-out (the binding site with the highest binding affinity). The location of the base- line frequency, Pmin, represents the average frequency of non-specific binding. The specificity of the specific site (i.e. 퐴 푃 the slip-out) can be calculated as 푆 = 푁 × 푠푝 + 푑 = 푁 × ( 푎푣푔 − 1) + 1, where N is number of binding sites and 퐴푛푠푝 푃푚푖푛 Pavg is the average occurrence probability of the whole distribution [150]. Accordingly, the larger the fraction of 퐴 the specific binding population over the non-specific binding population ( 푠푝 ), the higher the specificity of the 퐴푛푠푝 specific site (the slip-out) is.

133 4.3. Results

The DNA substrates used in our experiment (Table 4.1) are based on the following considerations. As discussed in Section 4.1.3, the length of repeat is strongly correlated to its instability – the longer the repeat (after a certain threshold), the more likelihood the repeat will expand. Secondary structures (slip-outs) predicted in repeat instability models are more likely to form and are more stable when the repeat is longer. The slip-out size and structure are also affected by the repeat length and construct [228], and could affect how they interact with proteins. The DNA substrates chosen, (CTG)1, (CTG)5, and (CTG)56/(CAG)54, are predicted to form different structures, and were previously shown to activate different MutLα incision activities that are MutSβ, PCNA, and RFC dependent [194]. The study shows that a slip-out size with two or three repeats is sufficient for PCNA to effectively load on to DNA to activate

MutLα’s incision activity, and suggests that the slip-out size correlates positively with PCNA loading. However, for larger slip-outs (푛 > 3), the activity decreases. We postulate that while these larger slip-outs continue to facilitate PCNA loading, they could inhibit the functions of mismatch repair proteins. The structure of MutSβ-DNA and MutSβ-MutLα-DNA complex on these DNA substrates, which has not been shown, could be key to understand the decrease in incision activities on larger slip-outs. Therefore, resolving the conformations of MutSβ-DNA complex and MutSβ-MutLα-DNA complex on these substrates is an important step towards unveiling the molecular interplays between MMR and TNR. Specifically, the (CTG)1 substrate is chosen as a positive control for functional MMR as demonstrated in past studies [189, 231].

Conversely, the (CTG)0 substrate is chosen as a negative control for non-specific interactions.

The (CTG)5 substrate is selected as a larger size slip-out that is predicted to exhibit aberrant processing based on previous studies on similar sized slip-out [190, 194]. Since repeat instability

134 occurs as n exceeds a certain threshold of about 30 repeats [218], the (CTG)56/(CAG)54 substrate is selected as a model heteroduplex with slip-out(s) embedded in a long repeat tract with important implications in modeling how repeat expands in vivo [243].

All the analyses are carried out across 3 or more replicates of experiments, with results carefully examined across the replicates. Just as any experiment, inconsistencies do arise. The results are filtered and agglomerated based on the following factors – image quality, quality of sample deposition, and consistencies of other experiment variables. Due to the incompleteness of the analysis and ongoing effort in developing better analysis on the data (CHAPTER 3), not enough data are analyzed and more experiments may be needed to resolve existing inconsistencies in the data. Despite the author’s best effort, the analyzed data are not consistent enough for the author to make a compelling statistical assessment. As a result, statistical analysis is omitted in this section and only the agglomerated results are shown.

4.3.1. (CTG)1

To estimate the stoichiometry of proteins in a protein complex on (CTG)1, we performed volume analysis as described in Section 3.5.3C. The volumes of protein complexes are measured and their distribution is plotted in Figure 4.8B. Since AFM volume correlates positively to molecular weight [84], the stoichiometry of proteins in a protein complex can be estimated from the volume distribution. In both nucleotide conditions (ADP or ATP), a dominant peak representing a single MutSβ protein can be distinctively resolved at 600-800nm3 (Figure 4.8B).

The volume at the peak location is similar to that of the free proteins (Figure 4.8E), which are known to exist as single-protein heterodimers [232]. This result suggests that the majority of complexes are comprised of a single MutSβ protein only, a stoichiometry of 1:1. A much smaller second peak could also be resolved at 1200-1300nm3, suggesting a fraction of the complexes are

135 multi-protein complexes (Table 4.3). The volumes are independent of their position on the DNA, suggesting complexes at the specific site (i.e. the slip-out) have the same size as those at the non- specific sites or DNA terminus.

We then measure the stoichiometry of protein complex on a protein-bound DNA (i.e. free

DNAs are filtered) by counting the number of protein complexes on those DNAs (Figure 4.6 middle). In the presence of ADP, our data show that the majority of the protein-bound DNAs are bound by one protein complex (Figure 4.9B ADP), similar to results obtained from the homoduplex controls (Figure 4.9A Figure E.3 ADP). In the presence of ATP, the stoichiometry increases on DNA containing (CTG)1 (Figure 4.9B ATP), which is not observed in the homoduplex controls (Figure 4.9A, Figure E.3, ATP). These results suggest that MutSβ loading is mainly facilitated through the specific site, and the number of MutSβ complexes on the

(CTG)1 substrate increases in the presence of ATP, consistent with the prediction of the sliding clamp model (Section 4.1.1). Overall, about 15% of DNA-bound proteins are clustered in the vicinity of one another (Table 4.3). Here, clustering is defined as an event where individual complexes neighbor each other (Figure 4.5 green box) or proteins form a single multi-protein complex (Figure 4.8B, >1000nm3). Protein clustering may be indicative of coordination between proteins in the repair processing.

Next, protein positions are measured on protein-bound DNAs that have one protein bound (Figure 4.6 right). Because we use unblocked DNAs, we use the short-arm distance (the nearest distance to either end of the DNA) for our position measurement (Figure 4.7B) and therefore positions from one side (that includes the specific site) will overlay on top of symmetric positions from the other side (Figure 4.7C). The position distribution would be an

136 overlay of the specific binding distribution on top of the non-specific binding distribution104, but it will not affect our ability to estimate the binding specificity for the specific site (see Figure

4.7C description). In the presence of ADP, the position distribution shows a preference of binding around the specific site on (CTG)1 (~39% length, Figure 4.10B, ADP), suggesting that

MutSβ has preferential binding towards the specific sites relative to the non-specific sites. In the presence of ATP, the position distribution continues to show similar preference at the specific site (Figure 4.10B, ATP). This result appears to be inconsistent with the sliding clamp model

(Section 4.1.1), which predicts a drop in MutSβ’s binding affinity for the specific site upon ATP activation. However, it is possible that MutSβ’s binding affinity for the non-specific sites also lowers in the presence of ATP, making the relative preference for the specific site unchanged.

We also did experiments on a longer (CTG)1 heteroduplex (Table 4.1, single-cut column). On this substrate, the preference for the specific site is notably lower in the presence of ATP than that in the presence of ADP (Figure 4.11). We performed specificity calculation on the ADP case (calculation not shown), and it is ~1000 (i.e. the binding affinity on the specific site is

~1000 fold stronger than that on the non-specific site). Because of the inconsistencies in results between the short and the long (CTG)1 heteroduplexes, more analysis and/or experiments are needed. On the short DNA substrates (Table 4.1, double-cut column), strangely, MutSβ also seems to show some preference around the same region (~40% length) on the homoduplex DNA substrate (CTG)0 when ADP is present (Figure 4.10A ADP). However, no preference for this region is found in the presence of ATP when examining the distribution on the (CTG)0 homoduplex (Figure 4.10A ATP). We also investigated the distribution on a different and longer

104 Since the short-arm length caps at half-length of the DNA, the short-arm length fraction caps at 50% as shown in Figure 4.10.

137 homoduplex (Table 4.1 2033bp Homoduplex), and the distribution is essentially flat (Figure

E.2). These results suggest that some systematic errors may be at play when measuring the position in the presence of ADP. In position analysis (Figure 4.7C), the base line occurrence probability Pmin (i.e. the average of non-specific binding probability) can be used to estimate binding specificity (Section 3.5.3B) as manifested in the specificity equation (see Figure 4.7C description). The smaller the Pmin, the higher the binding specificity (and the preference for the specific site). In our data, despite the complication on the homoduplex control, the base line of the non-specific distribution (Figure 4.10A, B, dashed line) is lower on (CTG)1 compare to its homoduplex counterpart (CTG)0, suggesting MutSβ still favors the specific site on (CTG)1 more than the non-specific site on (CTG)0 around the same area.

To characterize the conformation of the complexes, we measured DNA bending with or without MutSβ. Hairpin sizes of 7bp-18bp has been visualized on AFM before [244] (Figure

4.13 left), however the hairpin can often be indistinguishable from natural DNA distortions and other background features. As an alternative, we measured notable kinks and their locations instead (Figure 4.13 right). Because the kinks are spotted by eye and filtered by their notability, lower-bent and unbent states are filtered out. Our preliminary data show that the kink angles on the specific site (~40°) are indistinguishable from those on the non-specific sites (Figure 4.14 left column, (CTG)1). However, since lower-bent and unbent states, which dominate the non- specific sites (unpublished data105), are filtered in our analysis, the bend angles showing on the specific site are likely less affected by the filter and are therefore more likely to capitulate the native kink angles of the slip-out. We also examine locations of sharp kinks (> 60°), and they do

105 Our past AFM data showed that the non-specific bending distribution on free DNAs is a half-Gaussian centered at 0°, i.e. on average, the DNA is unbent on non-specific locations.

138 not show any preference at the slip-out location (Figure 4.14 right column, (CTG)1). Since we are making a big assumption (on the filter), we are not making a conclusion on the native bend angle for (CTG)1 here and only noting it shows up as 40° in this preliminary analysis.

The DNA bending with MutSβ shows a distribution of two bent states. One of the bent state (‘high bent state’) shows that MutSβ sharply bends the DNA at 100° at the slip-out in the presence of ADP (Figure 4.12B, Specific Site, ADP), consistent with sharp bending of a 3nt slip-out seen in the crystal structure [158]. The other bent state (‘unbent state’) shows an unbend population around 0°, which is not seen in the crystal structure. On the non-specific sites and on the homoduplex DNA fragment, MutSβ only bends the DNA ~50° (Figure 4.12B Non-specific

Site, Figure 4.12A, ADP), similar to what we have seen in E.coli and Taq. MutS [6]. In the presence of ATP, interestingly, DNA bending distribution shows a transition towards the lower

50° bent state on the specific site, and a transition towards an unbent state around 0° on the non- specific sites, suggesting potential conformational changes on both locations (Figure 4.12B).

The change in bending states towards a decrease in DNA bending is consistent with the ‘unbent state’ revealed in past studies on sliding clamp formations, which is likely induced by ATP exchange and/or ATPase activity [5, 6, 83]. Our data are mostly consistent with sliding clamp formations on small slip-outs despite MutSβ’s continued preference for the specific site.

4.3.2. (CTG)5

On (CTG)5, volume analysis resolves a dominant species with volume consistent with a single MutSβ as well as a smaller second species composed by multi-protein complexes in both nucleotide conditions (Figure 4.8C). We also observe that the DNA is mainly occupied by one protein complex in both nucleotide conditions (Figure 4.9C). Since increased loading of MutSβ in the presence of ATP is not observed, this result suggests that the efficiency of multiple loading

139 on (CTG)5 could be lower than that on (CTG)1. Because we use linear unblocked DNA, we cannot rule out the possibility of multiple loading (and sliding clamp formation) on (CTG)5 based on this result (more on that in the discussion). Overall, the population of proteins that are clustered ranges from 12% (ADP) to 20% (ATP) (Table 4.3), which is similar to the level of protein clustering seen in (CTG)1. Further error analysis is required to understand the increase in clustering in the presence of ATP, however.

As discussed in the (CTG)1 section, we use the short-arm length for our position measurement on the unblocked DNA, which would address issues with binding sites that are at symmetric locations (i.e. the position of 39% and 61% length will be both measured as 39%).

The position distribution shows that (CTG)5 is recognized with very high specificity in the presence of ADP, as indicated by a significant peak at the slip-out (~39% length, Figure 4.10C

ADP). The increase in specificity compared to (CTG)1 suggests that larger slip-out may facilitate tighter MutSβ binding. Interestingly, MutSβ’s specificity for the slip-out lowers in the presence of ATP as seen in the drop of binding preference to the slip-out (Figure 4.10C ATP).

The DNA bending on (CTG)5 without MutSβ is similar to that of (CTG)1, albert slightly larger at ~50°, which also shows up on the non-specific sites (Figure 4.14 left column). We attribute the increase to surface changes from different AFM depositions that could alter how much DNA kinks. Interestingly, on (CTG)5, the kinks (at least for sharp kinks) show up more on the specific site than on non-specific sites, unlike (CTG)1 (Figure 4.14 right column). Similar results are also obtained on (CAG)5 (Figure 4.14 right column, (CAG)5). These results suggest that the size of the slip-out correlates positively to the occurrence of the kinks, with larger slip- outs showing kinks more frequently than smaller slip-outs. With the assumption we made earlier

(see discussion on the (CTG)1 section), and the observation of sharp kinks appearing more on the

140 specific site than on the non-specific sites, we think the native kink angle on (CTG)5 could be larger than that observed on (CTG)1, with a nominal value of ~50° and reaching up to 80°.

Again, since this result comes from a very preliminary analysis, we are not making any conclusion on the native kink angle of (CTG)5 here.

Previously, crystal structure showed that the kink angle of MutSβ-bound DNA increases as the number of CA dinucleotide repeat increases in an insertion loop [158]. From this result, we predict the bend angle of MutSβ-bound DNA increases as the number of CTG repeat increases in the slip-out. However, the bend angles of the high bent state measured on MutSβ- bound (CTG)5 in AFM is similar to that measured on (CTG)1 (~100°, Figure 4.12C, Specific

Site, ADP). A lower bent state around 60° can also be seen. Interestingly, the unbent population seen in (CTG)1 is missing in (CTG)5 (Figure 4.12B, Specific Site, ADP). In the presence of

ATP, DNA bending at the slip-out shifts towards an unbent state around 0° (Figure 4.12C,

Specific Site, ATP) instead of 40° seen in (CTG)1 (Figure 4.12B, Specific Site, ATP). On the non-specific sites, similar to (CTG)1 and the homoduplex controls, MutSβ induces the same 50° bend in the presence of ADP (Figure 4.12C, Non-specific Site, ADP). No change in DNA bending is seen, however, when ATP is present instead of ADP (Figure 4.12C, Non-specific

Site, ATP).

Taken together, these results suggest that MutSβ may not be able to convert to the same conformation on (CTG)5 as seen on (CTG)1. Although the specificity of MutSβ for the slip-out on (CTG)5 is lower in the presence of ATP, increased loading has not been observed (pending on further experiments on circular or blocked DNA), and MutSβ does not induce the same conformational change as seen in (CTG)1 with ATP present. If MutSβ could not form the sliding clamp on (CTG)5, the drop in specificity suggests that MutSβ, while unable to induce the same

141 conformational change as (CTG)1, could dissociate from the DNA directly instead of proceeding down the sliding clamp repair pathway. This suggestion may explain why (CTG)5 is refractory to processing in past finding [194].

Since the frequency of protein clustering on (CTG)5 is similar to that on (CTG)1 (Table

4.3), MutSβ clustering at large TNR slip-outs may not block repair as we initially hypothesized.

Interestingly, with the addition of MutLα, we see larger complexes forming in most protein-

DNA complexes, which could be the SL complexes106 (Preliminary data, Figure 4.15A-B). This result suggests that MutSβ may still be able to recruit MutLα despite its seemly incapability to induce the same conformational change as (CTG)1. This signaling-incapable SL complex may ultimately block the slip-out from being processed.

4.3.3. (CTG)56/(CAG)54

Comparing to the other two CTG-only slip-outs, the (CTG)56/(CAG)54 is a hybrid with complementary repeats on both sides – the CTG side is two repeats longer than the CAG side – and the slip-out may have flexible location, size, and number. The repeat is located on the DNA between 33%-47% of its length (Table 4.1). The hybrid has important implications in repeat expansion diseases because the length of the repeat is in the unstable zone and is prone to expand in vivo.

106 The MutSβ-MutLα-DNA experiment was done in higher concentration with chemical crosslinking (Section 4.2.2), which could promote larger complexes formation that is not seen in the MutSβ-DNA experiment (carried out in low concentration). A control MutSβ-DNA experiment done in the same condition as the MutSβ-MutLα-DNA experiment is helpful to understand the composition of these large complexes (whether they could also be MutSβ multimers instead of the SL complexes). However, even with the control, the exact composition of these complexes will not be clear since AFM could not distinguish MutLα from MutSβ. A volume analysis on these complexes could be helpful to resolve the composition of the complexes since MutSβ and MutLα have measurable difference in their molecular weight (AFM volume). In other words, the SL complex will have a different volume than that of the MutSβ multimer and this difference will show up in the volume distribution.

142 On the (CTG)56/(CAG)54, both the volume analysis measuring the oligomerization state of the complex (Figure 4.8D) and the stoichiometry of protein complex per DNA (Figure 4.9D) resolve a dominant species in both nucleotide conditions, suggesting MutSβ binds

(CTG)56/(CAG)54 mainly as a single-protein complex and no increase in multiple loading is observed on the linear unblocked DNA substrate in the presence of ATP, similar to (CTG)5. As with the analysis on (CTG)5, the proper assessment of sliding clamp formation and multiple loading would require further experiments on end-blocked or circular DNA. Overall, ~16% of the DNA-bound proteins are clustered on the (CTG)56/(CAG)54 substrate in both nucleotide conditions (Table 4.3), which is similar to the level of protein clustering on other DNA substrates.

A previous structural study on slipped DNA suggests that the hybrid will predominately form a single two-repeat slip-out at the center of the repeat tract [228]. The position distribution, however, reveals a broad distribution of strong binding preferences covering all over the repeat tract in the presence of ADP, not just at the center (Figure 4.10D). This result suggests that

MutSβ does recognize the slip-outs formed in the hybrid, but the slip-outs’ locations are flexible and could migrate within the repeat tract, and because of that, the whole repeat tract is ‘specific’ to MutSβ. Interestingly, the specificity for the repeat area stays the same even in the presence of

ATP, similar to (CTG)1.

We did not analyze the native kink angle on the (CTG)56/(CAG)54 substrate since the exact location of the slip-out is unknown. On MutSβ-bound DNAs, the distribution of DNA bend angles in the specific area is broad (Figure 4.12D, Specific Site, ADP) in the presence of ADP.

The non-specific bending at a lower 50° is expected because most sites on the hybrid are non-

143 specific107. The broad distribution of highly bent states (>90°), however, is in striking contrast to those more distinctive states found in (CTG)1 and (CTG)5 (Figure 4.12B-C, Specific Site, ADP).

This result suggests that the slip-outs formed within the repeat may exist in multiple conformations, potentially with multiple sizes, thereby contributing to the breath of the bending distribution. In the presence of ATP, the highly bent populations shift towards lower bent states

(<100°), resulting in a broad distribution with a mixture of bent and unbent states (Figure 4.12D,

Specific Site, ATP). On the non-specific sites, DNA bending is similar, but slightly lower compared to other DNA substrates with ADP present (Figure 4.12A-D, Non-specific Site,

ADP). The addition of ATP only slightly alters the bending populations relative to ADP (Figure

4.12D, Non-specific Site, ATP). These results seem to suggest that a mixture of conformations exists on (CTG)56/(CAG)54 that could be a combination of signaling-capable conformations (seen on (CTG)1) and signaling-incapable conformations (seen on (CTG)5), although it is hard to pinpoint a particular conformation given the broad distribution of DNA bending.

The observation of formation of larger complexes when MutLα is added (Preliminary data, Figure 4.15C-D) is consistent with previous study that MutLα can still be recruited to process the (CTG)56/(CAG)54 substrate [194]. The signaling-incapable population may behave similarly to that of (CTG)5 and contribute negatively to the processing of the slip-out, thereby contributing to the decrease in processing efficiency seen in the previous study.

107 In a hypothetic scenario, the long repeat tract could contain multiple specific sites as slip-outs could sprout on either strand on multiple locations within the repeat tract. However, as described earlier, past evidence suggests single slip-out formation is preferred on the slip-out intermediate substrate (i.e. DNA with one strand containing more repeats than the other strand) [228], therefore leaving most binding sites on the repeat tract non-specific. This result can also be inferred from AFM images (data not shown) in the lack of observations of hairpins and kinks on native DNAs within the repeat area.

144

Figure 4.8 Volume Distribution of MutSβ Complexes

(Figure description on next page)

145 The volume of DNA-bound proteins (A-D) measured here is similar to that of free proteins (E). The volumes are consistent within experiment errors, which could cause slight shift (~100nm3) across experiments. For DNA-bound protein complexes, the majority exists as monomers with a volume of 600-800nm3. A second population representing a dimer (of the heterodimer protein) can also be resolved at 1200-1300nm3. The second population constitutes ~15% or smaller of the total complexes (Table 4.3).

146

Figure 4.9 Stoichiometry of protein complex per DNA

This stoichiometry measures the number of protein complexes on each protein-bound DNA. Increased loading of proteins on the DNA when ATP is present on linear blocked DNA or circular DNA has been used to showcase sliding clamp formation (Section 4.1.1) in the past [81, 201]. Since we use linear unblocked DNA, our result is best used to compare the relative efficacy of sliding clamp formation rather than to assess its presence. DNAs with more than three complexes bound are not counted because of their insignificant populations. A little schematic of the DNA substrate is also shown beneath each the sub-figure’s title. A. (CTG)0. MutSβ does not show multiple loading capabilities in both nucleotide conditions. B. (CTG)1. MutSβ shows an increase in multiple loading in the presence of ATP, consistent with sliding clamp formations. The multiple binding is likely achieved through successive loading through the specific site when the site is freed up due to sliding from previous occupied MutSβ complex. C. (CTG)5. MutSβ does not show an increase of multiple loading in the presence of ATP, suggesting it may not form the sliding clamp as efficiently as (CTG)1. D. (CTG)56/(CAG)54. MutSβ does not show an increase in multiple loading with ATP, suggesting it may not form the sliding clamp as efficiently as (CTG)1.

147

Figure 4.10 Position Distribution of Protein Complex on DNA

Position distribution is a measure of protein specificity to various binding sites on the DNA, and is counted on DNAs with a single protein complex bound. If a protein binds specifically to a site, the location of the protein will be predominantly around the site, resulting in a high population in that location. Since we are using unblocked DNA, the short-arm length is used for position measurement (Figure 4.7B). The DNA schematic showing beneath figure titles is one half DNA that contains the specific site (e.g. Figure 4.7C). A. (CTG)0. MutSβ shows a slight preference towards 40% of the length when ADP is present. This result could be caused by systematic errors given that the preference is not seen when ATP is present. Blue line indicated the base line of non-specific binding probability. Further analysis is needed to clear potential errors in this plot. B. (CTG)1. MutSβ shows preferential binding to the slip-out. The preference has not lowered in the presence of ATP, contrasting with the sliding clamp model. The blue line, indicating the base line of non-specific binding probability, is lower compared to that of (CTG)0, suggesting existence of a specific site. C. (CTG)5. MutSβ shows a much stronger preference to the slip-out compared to (CTG)1 in B, indicating higher specificity. The specificity lowers in the context of ATP, suggesting dissociation from the slip- out. D. (CTG)56/(CAG)54. MutSβ shows a strong preference for the repeat area, suggesting that the slip-out could form anywhere across the area. The specificity to the area has not lowered in the presence of ATP, suggesting MutSβ may not be able to form the sliding clamp.

148

Figure 4.11 Position Distribution of Protein Complex on (CTG)1 (Single Cut)

MutSβ’s position distribution on (CTG)1 was also analyzed on the single-cut heteroduplex (Table 4.1, single cut column). The slip-out is inserted at 18% of the DNA length (Table 4.1) and indicated by the blue arrows in the figure. (A) In the presence of ADP, MutSβ shows a preference for the specific site (the slip-out). (B) In the presence of ATP, MutSβ retains the preference for the specific site, but its specificity is lower compared to the specificity with ADP present.

149

Figure 4.12 DNA Bending by MutSβ

(Figure description on next page)

150 DNA bending reflects the conformational states of the protein-DNA complex. A. (CTG)0. MutSβ bends the DNA non- specifically at 50° in both nucleotide conditions. B. (CTG)1. MutSβ bends the slip-out at two bend angles with ADP, but shift towards an intermediate bent state at 40° with ATP. Elsewhere, MutSβ bends the DNA non-specifically at 50° with ADP, but unbends the DNA (0°) with ATP. This result is consistent with the unbent state in the sliding clamp model. C. (CTG)5. MutSβ bends the slip-out in two states (100° and 60°) with ADP, but unbends the slip-out with ATP. Elsewhere, MutSβ bends non-specifically regardless of nucleotide conditions. This result is consistent with a lack of sliding clamp formation. D. (CTG)56/(CAG)54. MutSβ bends the DNA around the repeat area with a broad distribution – the lower bent state is likely contributed by non-specific bending within that area. The high bent states in the repeat area seen with ADP shift towards lower bent states when ATP is added, but a (less) bent population still exists. Elsewhere, MutSβ seems to bend the DNA non-specifically around 40° in both nucleotide conditions.

151 (CTG)1 (CTG)5 (CTG)56/(CAG)54

ADP 15% 12% 17%

ATP 15% 20% 16%

Table 4.3 Percent Population of DNA-bound Proteins that Form Clusters or Multi-protein Complexes

Here protein clustering is defined as an event where two or more proteins are found in the direct neighborhood of one another. A protein complex with more than one protein is a cluster (i.e. a multi-protein complex), so are two proteins bordering each other but are otherwise resolved as two individual protein complexes (e.g. Figure 4.5 green box). The data shown in the table are the percentage of complexes that are involved in forming protein clusters and 푁푢푚푏푒푟⁡표푓⁡푐표푚푝푙푒푥푒푠⁡푡ℎ푎푡⁡푓표푟푚⁡푐푙푢푠푡푒푟푠 are calculated as . Note that the number counted here is the number of 푇표푡푎푙⁡푛푢푚푏푒푟⁡표푓⁡푐표푚푝푙푒푥푒푠 complexes, not the number of clusters. If two complexes form a single cluster, the numerator is counted as 2. A volume threshold of 1000nm3 is inserted in Figure 4.8 to roughly categorize complexes that contain more than one protein. Complexes that share borders are counted manually.

Figure 4.13 Measuring Kinks on Free DNA

(Left) Hairpins of 7-18bp can be seen as bumps on DNA (red arrows, figure adapted from [244]). The bumps can be hard to distinguish from the background without prior knowledge of its location. (Right) Notable kinks on DNA is spotted manually and their angles are measured as marked (yellow) in the figure. The hairpin can be seen as a bump (red arrow). The criteria of whether a distortion is a kink is determined subjectively by looking at how much the DNA bends and the fiber skeleton of the DNA (sea green) is used to aid the determination. If a DNA distortion is small (e.g. <20°), it is not counted as a kink. The locations of the kinks are used to determine the whether it is at a specific site or at a non-specific site. We don’t use the bump (red arrow) to identify the hairpin location because it is often hard to distinguish the bump from the background and other DNA distortions.

152

Figure 4.14 Native DNA Kinks and Their Locations

(Figure description see next page)

153 The bend angles of notable kinks on free DNAs are measured as described in Figure 4.13 and their distributions are plotted (left column). For sharp kinks with bend angles larger than 60°, their position distributions are also plotted (right column) with the slip-out location marked by a red arrow. Data are preliminary and not all DNA substrates are analyzed. As seen in the figure, the bend angles on the specific site are not distinguishable from those on non- specific sites (40° − 50°). Both could shift ~10° in different AFM depositions, possibly due to changed surface conditions that affect the native kinks on the DNA. Interestingly, on the position distribution of sharp kinks, on both (CTG)5 and (CAG)5 (more so on (CAG)5), preferred locations (i.e. a peak) of sharp kinks can be found around the specific site (red arrow). Since the preference may not be restricted to kinks with sharp bend angles (i.e. bend angle smaller than 60° may also exhibit similar preference – data not analyzed), the preference is best interpreted as increased frequencies of kinks happening around the specific site compared to non-specific sites on the larger slip- outs.

Figure 4.15 AFM Images of MutSβ-MutLα-DNA Complexes

As seen in the images, large SL complexes can form in both (CTG)5 and (CTG)56/(CAG)54 DNA substrates. These complexes are significantly larger in volume compared to a single-protein complex.

154 4.4. Discussion

The slip-out substrates we tested have wide implications in how MutSβ interacts with the

DNA. Functional studies of processing using the same sized slip-outs have been carried out previously [189, 194, 231]. Indeed, our result supports sliding clamp formation on small slip-out size such as (CTG)1, therefore is consistent with in vitro repair assays demonstrating their efficient repair [189, 231]. Our result on (CTG)5 suggests a conformation different from the sliding clamp formation seen in (CTG)1, and this conformation may lead to the significant drop in MutLα processing shown in previous study [194]. The (CTG)56/(CAG)54, a long repeat hybrid with a short insertion, is constructed to mimic the slip-out intermediate predicted in repeat expansion diseases. Our result on (CTG)56/(CAG)54 shows that MutSβ could carry mixed population of conformations - some signaling capable while others not - when processing slip- outs within long repeats. The population of a signaling capable species is important for MutLα’s recruitment and/or activation and the observation of a mixed population is consistent with the decreased, but not completely abrogated MutLα processing seen previously [194]. In the previous study [194], it is unclear whether the decrease in processing on (CTG)5 and

(CTG)56/(CAG)54 is caused by a lack of recruitment or a lack of activation. Our preliminary data show that MutLα may still be recruited on those substrates, suggesting that the activation of

MutLα, rather than its recruitment, might be the main drive for the observed decrease in processing. This suggestion is consistent with a previous mice study that shows a reduction of the expansion rate by ~ 50% in Pms2-deficient mice [185]. Elucidating how MutSβ and MutLα interact with these slip-out constructs is crucial to understand the mutagenesis and mechanism that leads to expansion of TNR sequence and TNR related diseases.

155 Our structural study reveals new insights into the various conformational species of

MutSβ-DNA complex that may be important to its signaling in the downstream repair pathway.

We partially recapitulated the high bent state (~100°) seen in the crystal structure upon slip-out recognition [158], but we also observe the lower bent or unbent populations (Figure 4.12B-D,

Specific Site, ADP) that may be unique to the construct of the slip-outs. Our data suggest a unique bending conformation on the (CTG)1 that could be important to the sliding clamp formation (Figure 4.12B ATP). On repair incompetent substrates, this bending conformation is either not seen ((CTG)5, Figure 4.12C ATP), or buried with other conformations

((CTG)56/(CAG)54, Figure 4.12D ATP). Since the DNA slip-outs may have different inherent bend angles (Figure 4.13), the conformational difference among the DNA slip-outs is likely to cause the different bending states that we see on these substrates. The incapability to induce the right conformational change to form the correct signaling species may be the reason why abrogated ((CTG)5) or decreased ((CTG)56/(CAG)54) processing is seen on those substrates

[194].

Interestingly, the recognition conformation seen on (CTG)1, with a high bent state (100°) and an unbent state (0°) (Figure 4.12B, Specific Site, ADP), can also be observed on (CTG)5

(and possibly on (CTG)56/(CAG)54), but only in the presence of ATP (Figure 4.12C-D, Specific

Site, ATP). This observation suggests that (CTG)5 and (CTG)56/(CAG)54 may be stuck into some inactivated conformation of (CTG)1 (if we were to dub the recognition conformation as the

‘inactive’ form prior to the sliding clamp formation) when ATP is present. This possibility might explain why (CTG)5 and (CTG)56/(CAG)54 may have difficulty forming the signaling-capable conformation if they were ‘trapped’ into some inactive recognition conformation. Again, to reiterate from the Methods section (Section 4.2.3), all the bend angles discussed here are

156 external bend angles as DNA strands internal to the protein-DNA complex cannot be visualized directly. Therefore, we do not expect the external bend angles we measure to be the same as that obtained from the crystal structure, but to use them as indicators for conformational states. It is possible that the external bend angles are less optimal indicators for conformational states, and inconsistencies may arise when comparing them to results obtained from higher resolution methods (such as the crystal structure). For example, using the DREEM technique we developed

(CHAPTER 2), we can directly visualize the DNA strands within the MutSβ-DNA complex and measure the internal bend angle, which could be much higher than the external bend angle

(Figure C.4).

We initially predicted that the mismatches in the slip-out could recruit more MutSβ proteins as the size of the slip-out grows, or a MutSβ protein could cluster with another MutSβ protein through long range (looping) or short range (association) interactions. Previous sedimentation equilibrium analysis measured the molecular weight of MutSβ-DNA complex on a

CAG hairpin, and found a 15% increase of mass on the CAG hairpin than that expected from a

1:1 protein to DNA ratio, suggesting a fraction of the DNA was binding more than one protein on the CAG substrate [232]. Indeed, in our reaction condition (10 nM MutSβ), although MutSβ predominantly exists as a single-protein complex at the slip-out (Figure 4.8), it does cluster on

~15% of the complexes (Table 4.3) and would predict a ~15% increase in the average complex mass108. This observation also suggests the possibility of high oligomerization states and more clustering events if the reaction were to occur in higher concentrations (e.g. 100 nM). Our data

108 Most of the clustering is performed by two proteins. For example, the secondary peaks (1200-1300nm3) in Figure 4.8 corresponds to complexes containing two proteins (the volume of a single protein is 600-800nm3. Supposing the molecular weight of a single-protein MutSβ-DNA complex is x and there are n complexes, the 85%∗푛∗푥+0.15∗푛∗2푥 average mass of a complex can be calculated as = 1.15푥, which is 15% greater than the mass of a 푛 single-protein complex.

157 supports that MutSβ may possess the ability for regional and long-range coordination in the cellular context even if the frequency of those events is relatively low in our in vitro testing condition.

The sliding clamp model (Section 4.1.1) predicts multi-complex formations in the presence of ATP. Previous study using EMSA (Electrophoretic Mobility Shift Assay) has shown that MutSα forms a hydrolysis-independent sliding clamp on mismatched DNA109 [81]. Similar result is also observed in MutSβ using SPR (Surface Plasmon Resonance) [201]. Both studies saw an increase in binding in the presence of ATP, consistent with a sliding clamp mechanism that allows for stochastic loading of additional MutS proteins onto the DNA through the mismatch. However, when trying to interpret our stoichiometry results, important differences exist between our experiment and previous studies. In previous studies, multiple loading was only observed on end-blocked or circular DNA. Specifically, in the SPR study, when the end block was removed after the MutSβ sliding clamps were formed, MutSβ dissociated from the

DNA extremely rapidly and binding affinity essentially dropped to the non-specific level.

Therefore, on linear unblocked DNAs, such as the ones used in our study, the absence of multi- complex formation does not rule out the possibility of sliding clamp formation110. Our data may, however, be used to interpret the relative efficacy of sliding clamp formation. We did observe

109 When multiple loading occurs in the presence of ATP, the mobility of the protein-bound DNA is reduced to form a broad band in the gel assay. In the presence of ADP, only a single distinctive band can be resolved, representing a single conformation of mismatch-bound protein-DNA complex.

110 We also use lower protein concentration than the concentration used in the SPR study (10nM vs. 50nM) [201], which may lower the stoichiometry that we see (MutSβ’s binding affinity on the DNA substrates that we use was previously measured as 10~20nM [194]).

158 111 increased loading on (CTG)1 in the presence of ATP compared to the other DNA substrates, suggesting that MutSβ may form sliding clamp more efficiently on (CTG)1. However, since loading results in the AFM sample are sensitive to local concentrations of proteins and DNA112, the results are not consistent enough to make for a convincing statistical assessment (and hence the omission of statistical analysis and only the agglomerated data are shown in our stoichiometry result). More experiments, in particular, experiments using circular DNAs or blocked DNAs, are needed for further assessment of the sliding clamp formations.

MutSβ sliding clamp has been shown to recruit MutLα on repair competent DNA substrate in the previous study [201]. It is less clear if MutLα can be recruited as efficiently on repair incompetent substrates as on repair competent substrates, and whether MutSβ sliding clamp formation is required to recruit and activate MutLα. The potential observation of MutLα recruitment on repair incompetent slip-outs, such as (CTG)5 and (CTG)56/(CAG)54, adds some clarity to the causes for decreased MutLα processing shown in the previous study [194]. We initially predicted that the decreased MutLα processing is caused by a decrease in MutLα recruitment. But from our preliminary data, MutLα may still be recruited efficiently even on repair incompetent slip-outs. This result suggests that the activation of MutLα, rather than its recruitment, might be the main cause for the observed decrease in processing. As discussed previously, the formation of sliding clamp cannot be clearly interpreted from our current data.

111 I attribute this increase loading to the longer DNA we used compared to the SPR study (1041bp vs. 236bp). The longer DNA makes it harder for MutSβ to diffuse off the DNA, which may explain why MutSβ sliding clamps were completely dissociated in the SPR data after the end-blocks were removed [201], whereas in our data we could still see ‘ruminants’ of the sliding clamps that have not been completely off-loaded.

112 Because the protein-DNA complex is not chemically cross-linked, reaction could still occur during sample deposition onto the mica surface. We found that the local concentration of protein and DNA was not even across the mica surface. The variation in reaction concentrations will result in inconsistencies in the observable binding stoichiometry on different spots of the mica even for the same deposition.

159 Therefore, the question of whether MutSβ sliding clamp formation is required to recruit and activate MutLα requires further investigation. Collectivity, our results give new insight on how the mismatch repair system can incorporate these slip-outs into the DNA and ultimately lead to repeat instability. The activation of MutLα’s endonuclease activity requires PCNA, which can be loaded through DNA openings such as slip-outs as shown in previous study [194]. The loading has no preferred orientation, which causes MutLα to promiscuously incise the DNA on either strand. A previous study [193] showed that nicking on the same strand as the slip-out leads to

‘error-prone repair’ – a failed repair that does not completely remove the slip-out; nicking on the opposite strand of the slip-out leads to correct repair (for CAG slip-out) where the slip-out is completely removed or ‘escaped repair’ (for CTG slip-out) where the slip-out is completely intact. These results suggest a plausible mechanism for repeat instability in somatic cells.

Alternatively, in a dividing cell, PCNA can preexist on the leading strand behind the DNA polymerase, which could produce expansion prone slip-outs; or it can be loaded from the nick found in between the Okazaki fragments during replication, which produces deletion prone slip- outs. In either cellular scenario (somatic or proliferating cells), whether it is expansion prone or deletion prone, a trapped MutSβ-MutLα-DNA complex at the slip-out would stabilize the slip- out and prevent its removal, ultimately leading to repeat instability. The recruiting and activation of MutLα at repair incompetent slip-out such as the hybrid (e.g. (CTG)56/(CAG)54) could also create a novel way for more slip-out formations through nicking, excision, and strand displacement synthesis. These new slip-outs would trap more MutSβ-MutLα-DNA complexes, creating a vicious cycle of stabilizing and promoting more and more slip-outs, leading to increasingly severe repeat instability.

160 Several pathways have been proposed as to how MMR impacts slip-out incorporation and formation. In diving cells both expansion-prone and delete-prone slip-outs can arise through

DNA replication, and MMR can be hijacked to stabilize these slip-outs while processing them

[207]. In somatic cells, expansion-prone slip-outs could occur during nick-directed DNA synthesis, which could be generated by BER through 8-oxoguanine removal or MMR through

DNA breathing and misalignment [194, 218]. Building on these models, our data further refines the process MMR employs to stabilize and promote slip-out formations, and a refined process based on our data and hypothesis is summarized in Figure 4.16.

The degenerative nature of repeat expansion is directly associated with the continued length of non-interrupted repeat. When stabilizing interruptions are lost through mutations, the repeat expands in continuity and provides the stability required for the initial seeds of stable slip- out formation. MutSβ and MutLα could further worsens the situation through non-productive trapping and aberrant processing on those slip-outs, countering their effective removal by other cellular mechanisms. Our study fits in with existing models on expansion and recapitulates results from past studies, while expanding on details of the processing of both repair competent and repair incompetent slip-outs, furthering our understanding of the molecular mechanism of

MMR in normal repair and in repeat expansion.

161

Figure 4.16 Model of Repair Signaling

In this model, we assume the signal-incapable conformations that we saw are the non-sliding clamp conformations (although we don’t know if they are not sliding clamps based on our data). A. (CTG)1. MutSβ recognizes the (CTG)1 by sharply bending the DNA. MutSβ is then activated by ATP and converts to the sliding clamp formation. More proteins can be loaded onto the DNA when MutSβ slides away from the slip-out. MutSβ then can recruit MutLα, which can properly incise the DNA (with PCNA activation). B. (CTG)5. MutSβ recognizes the (CTG)5 by sharply bending the DNA. Upon adding ATP, MutSβ directly dissociates from the slip-out while failing to convert to the sliding clamp. MutSβ can still recruit MutLα onto the DNA, but the SL complex is unable to incise the DNA. C. (CTG)56/(CAG)54. MutSβ sharply bends the DNA within the repeat tract. Upon adding ATP, MutSβ shifts to a lower bent state, mixed with both sliding clamp population and non-sliding clamp population. MutLα can be recruited by either population. The sliding clamp population can properly signal MutLα to incise the DNA, whereas the non- sliding clamp population cannot signal MutLα to incise the DNA.

162 APPENDIX A. SUPPLEMENTAL INFORMATION FOR DREEM

A.I. Supplemental Figures

Figure A.1 Topographic and DREEM Images of a Polarized Batio3 (BTO) Thin Film

(A) Schematic showing the generation of a surface pattern with different polarization states on a BTO film. External electric fields (-5V DC bias for the larger area followed by +5V bias for the smaller area) are applied through a conductive AFM cantilever. The charge density on the BTO thin film after polarization was estimated to be approximately 2 electrons/nm2. (B) Topographic (left), DREEM phase (middle), and DREEM amplitude (right) of the polarized areas on the BTO thin film. The DREEM phase and amplitude images directly reveal the pattern of charged areas, without any detectable crosstalk into the topographic channel. In addition, the large contaminant particle seen in the topographic image (white arrow) is not seen in the DREEM images, indicating that there is no crosstalk of the topographic signal into the DREEM signals. The XY scale bars are 1 μm.

163

Figure A.2 Additional DREEM Images of Histone Alone and Nucleosomes

(A) DREEM images of histone proteins alone in the absence of DNA. Topographic (left panels), DREEM phase (middle panels), and DREEM amplitude (right panels) images of two individual histone proteins (top and bottom panels). (B) Repeated scans (top panels) and retrace images (bottom panels) of the nucleosome shown in Figure 2.2B. The topographic (left panels), DREEM phase (middle panels), and DREEM amplitude images (right panels) demonstrate the reproducibility of DREEM imaging. The XY scale bars are 20 nm. (C) DREEM imaging reveals DNA paths on multiple nucleosomes in individual images. Left panel: The topographic image. Right panel: A DREEM phase image of the same DNA molecule with multiple nucleosomes (middle image). Zoomed in areas with individual nucleosomes are shown surrounding the full image. Each nucleosome is identified by the matching color outlining the nucleosome molecules in the middle image. The inserts in the zoomed-in images show the corresponding topographic images (not to scale). The DNA paths revealed in DREEM images are at different orientations, ruling out the possibility that the signals consistent with DNA paths are due to scanning artifacts.

164

Figure A.3 DREEM Imaging Reveals The Path Of The DNA In hMutsα Mobile Clamp Complexes Loaded Onto A Circular DNA Substrate (4 kbp) Containing Two GT Mismatches 2 kbp Apart

(A) Topographic (left), DREEM phase (middle), and DREEM amplitude (right) images of a sample containing both hMutLα, which adopts multiple conformations [245], and hMutSα without DNA (hMutSα MW=257 kDa; hMutLα MW=180 kDa). The larger protein is hMutSα (identified by arrow). (B) Topographic (left), DREEM phase (middle), and DREEM amplitude (right) of multiple hMutSα sliding clamps formed by incubating 125 nM hMutSα and 1 mM ATP with the mismatch containing DNA. Based on the volume of the complexes in the topographic images, each of the complexes (in boxed regions) contain two or more hMutSα proteins. The scale bars are 50 nm. (C) Zoomed-in images showing the cross-section analysis of the hMutSα-DNA complexes in the boxed regions in B. The scale bars are 20 nm. The pairs of perpendicular lines on the section plots indicate the positions of the DNA strands. In the zoomed-in image in C (top left) (as well as in the overview image in B) two MutSα proteins can clearly be seen adjacent to one another on the DNA. In C (top right), the MutSα complex is interacting with two DNA stands, and the two individual duplex DNA strands can be clearly seen in the DREEM images and the cross-section analysis. Only DREEM images, but not topographic images reveal the DNA in hMutSα-DNA complexes.

165 A.II. Theoretical Basis of DREEM Measurements

As demonstrated below, because we are monitoring very small changes in surface potential on a modestly charged surface (mica), DA (x,y) will be dominated by changes in the w2 force gradient, with only small contributions from the force. This method is similar to amplitude slope detection method used to monitor the atomic force gradient in topographic AFM images

[54].

Because the AC bias is applied at the first overtone frequency (2), the applied force induces a vibration, with a free amplitude (assuming no damping)

푄2 푄2 퐴0,휔2 = ( ) 퐹휔2 = 푎 ( ) (Δ휙푇푆 − 푉퐷퐶)푉퐴퐶 (A. 1) 푘2 푘2

where Q2 and k2 are the quality factor and effective spring constant, respectively, of the first overtone of the cantilever, and a is a constant that depends on the tip radius, and tip-sample separation [35, 246, 247]. In addition, the force gradient, F, changes the effective spring constant of the cantilever and shifts its first overtone frequencies by

′ 휔2퐹 Δ휔2 = (A. 2) 2푘푐 where kc is the spring constant of the cantilever (which is equal to k1, the spring constant of 1)

[49, 53, 54], thereby reducing the vibration amplitude at 2 to

푄2 ′ 퐴휔2 = 퐴0,휔2[1 + 푏 ( ) 퐹 ] (A. 3) 푘푐

This approximation assumes that the applied force is slightly off the resonance frequency

2 on the side of the resonance peak, where the slope of the peak is maximum and 푏 = [54]. 3√3

Notably, the frequency shift and therefore the change in amplitude depend on both the static and

166 ′ ′ dynamic components of the electrostatic force gradient (i.e., 퐹퐷퐶 and 퐹휔2; Eq. 2.1&Eq. 2.2)

[246, 248].

′ Because we are operating in intermittent contact, the force gradient (퐹푎) due to repulsive atomic interactions is significantly greater than that due to the attractive electrostatic interactions

′ (퐹푒푙) [249, 250]; therefore, 2 > 0. (We verified that 휔2 > 0 in our experiments by monitoring the vibration amplitude as a function of the AC bias frequency.) Under our imaging conditions,

′ 퐴휔2⁡is ~1/2 퐴0,휔2⁡after engaging in repulsive mode. During scanning, however, 퐹푎⁡should be constant because the topographic signal at 1 maintains a constant atomic force gradient, and

′ ′ ′ therefore, changes in 2(x,y) will be dominated by 퐹푒푙. In addition, because 퐹푎 >>⁡퐹퐷퐶,

′ 퐹퐷퐶⁡does not significantly contribute to the signal at 1. Consequently, the static electrical force

′ gradient does not affect the topographic image, and therefore, 퐹퐷퐶⁡is not maintained constant

′ during imaging and will contribute to the signal at 2. We confirmed that 퐹퐷퐶⁡does not affect the topographic images by turning the modulated bias voltage on and off while scanning.

Assuming that the atomic force gradient is constant as a function of (x,y) position of the tip, the change in A ( DA (x,y)) due to changes in the electrostatic force and force gradient w2 w2 associated with a change in position from (x1,y1) to (x2,y2) on the surface is approximately

푄2 푄2 ′ Δ퐴휔2 = ( ) [Δ퐹휔2,푒푙(푥, 푦) + 푏 ( ) Δ(퐹휔2,푒푙(푥, 푦)퐹푒푙(푥, 푦))] (A. 4) 푘2 푘푐

푄 Δ퐴 = 푎 ( 2) [Δ퐹 (푥, 푦) 휔2 푘 휔2,푒푙 2 푄2 ′ ′ +푏 ( ) [[퐹휔2,푒푙(푥1, 푦1) + 퐹휔2,푒푙(푥, 푦)]퐹푒푙(푥2, 푦2) − 퐹휔2,푒푙(푥1, 푦1)퐹푒푙(푥1, 푦1)] (A. 5) 푘푐

For small changes in surface potential [ Dy(x, y)] or capacitance, such as those in the current experiments where Dy(x, y) and capacitance changes are very small (only the difference

167 in potential and/or capacitance between the mica substrate and the deposited protein and DNA molecules), DF (x, y) << F w2,el w2,el and

푄2 푄2 ′ Δ퐴휔2 = ( ) [Δ퐹휔2,푒푙(푥, 푦) + 푏 ( ) 퐹휔2,푒푙(푥, 푦)Δ퐹푒푙(푥, 푦))] (A. 6) 푘2 푘푐

' Because 퐹휔2,푒푙 is sensitive to electrostatic potential over a greater distance than Fel , the tip cone and the cantilever, as well as the tip apex, make contributions to F (x, y), and therefore, w2 ,el

' will be averaged over a greater area of the sample than Fel (x, y)[36, 50, 54-56, 251,

252]. Consequently, for small changes in capacitance and surface potential [i.e., (x,y) <<

(TS-VDC)] over an area similar to the tip radius, may be relatively constant. If the force is approximately constant as a function of position then

2 푄2 ′ Δ퐴휔2 = 푏 ( ) 퐹휔2,푒푙(푥, 푦)Δ퐹푒푙(푥, 푦)) (A. 7) 푘푐푘2 and only the force gradient contributes to DA (x,y). (For the cantilevers used in our w2 experimental setup, kc 2.8 N/m, k2110 N/m, Q1170, and Q2500.) Because  << (TS-

' VDC) for proteins and DNA deposited on mica [37, 253], DA (x,y) is dominated by F . w2 el

168 A.III. Supplemental Experimental Procedures

Conductive cantilever preparation

To obtain high-resolution topography and DREEM images, we used highly doped silicon cantilevers (PPP-FMR from Nanosensor; 2.8 N/m) instead of metal coated cantilevers, because the radius of curvature of the metal coated tip is ~ 20 nm, while that for the non-coated tip is ~ 7 nm. The conductivity of the doped cantilevers is comparable to that of the metal coated tips. It should be noted, however, that these doped silicon tips are easily oxidized, which results in the formation of a nanometer thin non-conductive oxidized layer. Consequently, to make a conductive connection between the cantilever and the external input power source, it is essential to penetrate the oxide layer. As described below, we have devised a straightforward method for making a reliable connection, by scraping the cantilever chip and simultaneously coating it with colloidal liquid silver. The silver on the chip makes contact with the metallic tip holder for the

Asylum AFM system. For use with instruments that do not have grounded tip holders, ground wires can be attached with patch of liquid silver.

Detailed instructions for cantilever preparation.

A small amount of the colloidal liquid silver (Ted Pella Inc. product #16034) is spread on a clean glass slide. The cantilever is held with one pair of tweezers. Another pair of tweezers is dipped in the liquid silver, and these silver coated tweezers are used to scrape and coat the edges of the silicon chip and the silicon surface of the chip on the side opposite from the cantilever tip.

The scraping removes the oxidized silicon (SiO2) layer on the surface and replaces it with a conductive silver layer. This process simultaneously scratches away the oxide layer and covers the silicon with silver, preventing any oxidation and forming a conductive layer that can be

169 easily connected to the external electrical sources. Once the coating is completed, the silver coated chip is allowed to dry for ~5 minutes, and it then can be loaded into the AFM.

Substrate grounding

In our setup, the bias is applied to the tip and the sample is grounded. To ground the sample, which is deposited on mica, we use liquid silver to connect a thin piece of mica to a glass slide, and we also make a connection to ground using liquid silver. Specifically, after the sample has been deposited on mica, a box cutter is used to cleave a thin layer of mica containing the deposited samples (on the topside). The opposite side of the mica (the downside), which does not contain the sample, is coated with liquid silver and held in the air until the liquid silver is dried. This sample is then attached to a glass slide with liquid silver.

To prepare the glass slide, the center of a glass slide is coated with a patch of liquid silver at least as large as the mica. A streak of silver leading from this central patch to one of the furthest sides is painted, and the streak is continued for a short distance on the other side of the glass slide to ensure that it makes proper contact with the metal on the AFM base for grounding.

The silver-coated mica is placed, silver side down, on the wet silver patch, and the slide is allowed to dry for ~30 minutes. It is important not to press down too hard when placing the mica on the silver patch to avoid causing patches where there is no silver.

Selection of the imaging conditions

AFM topographic images are collected in standard repulsive intermittent contact mode at the fundamental resonance frequency (1) (MFP-3D AFM, Asylum Research). With the cantilevers used in this study (PPP-FMR from Nanosensor; 2.8 N/m), we found that the highest quality topographic images were obtained with a vibration amplitude of 30 to 50 nm and a set point such that the force on the sample is minimized, while maintaining a repulsive interaction

170 with surface. Not surprisingly, we found that the quality of the DREEM images is highly dependent on the quality of the topographic images. To determine the optimum AC and DC biases for DREEM imaging, we measured A and collected images as a function of VAC and w2

VDC (from 0 to 20 V and -2.5 to 2.5 V, respectively) using the instrumental setup shown in

Figure 2.1. Images were collected on two different custom modified MFP-3D Asylum Research

AFMs in two different labs (DAE and HW).

The DREEM images of the nucleosomal arrays were taken at VAC ~ 20 V and VDC between ±2.5 V, depending on the tip. The magnitude of the applied DC voltage was adjusted based on the resolution and contrast of the DREEM images to achieve the highest signal to noise ratios. When the tip is in contact with mica in either repulsive or attractive mode and tuned near the optimum DC voltage, A increases linearly upon varying VAC from 0 to 20V, as expected w2

[39]. The time constant for collection of the DREEM signal at 2 is 1 ms. Images were collected at a scan speed of 2 Hz, and the scan speed is limited by collection of the topographic signal, not the DREEM signal. The largest amplitudes that we employed for DREEM imaging at the first overtone are very small (~ 1 nm) compared to the mechanical vibration (30 to 50 nm) at the fundamental frequency, which prevents crosstalk of the electrical signal into the topographic signal. As expected, we also did not detect any crosstalk from the topography in the DREEM images (Figure A.1), and no distinct signals are observed without applied biases. Also, we found that the larger protein-DNA complexes gave a better contrast between the DNA and the proteins, as compared to the smaller ones, probably because the greater amount of protein provides higher contrast. Similar to conventional AFM imaging techniques, DREEM imaging also can experience tip artifacts, which are primarily due to the asymmetry in the electric field between the cantilever and sample surface. For example, in some cases, half-moon like

171 asymmetries, with one side of the signal consistently higher than the other side, are seen in the same orientation for all complexes and proteins in a single image. Such images are discarded and not included in analyses. As with artifacts in conventional AFM images, these artifacts can be identified by their repetitive nature and by scanning at various angles, speeds, and size ranges, and by rotating the sample. Similar to conventional AFM imaging, preparing clean samples and conductive tips, and driving the tip at the minimum possible drive amplitude minimize the artifacts.

Sample preparation, deposition and analysis

The BaTiO3 (BTO) thin film was fabricated by atomic layer controlled growth with in- situ monitoring using high pressure reflection high energy electron diffraction (RHEED) [58,

254]. External electrical fields (DC bias) applied through a conductive AFM cantilever during scanning were used to locally polarize the BTO thin film and to generate a surface pattern with different polarization states.

Reconstitution of nucleosomes was done using a linear 2743 bp DNA substrate that was generated through XbaI restriction digestion of plasmid containing 601 (pGEM-3z/601,

Addgene) nucleosomal positioning sequences [255]. The reconstitution was done using histones

(EpiCypher) and the salt dialysis method [256]. In some cases, the nucleosomes were crosslinked with gluteraldehyde (Sigma Aldrich) for 30 min at room temperature. The crosslinked or uncrosslinked nucleosomal arrays were deposited on the freshly prepared aminopropyl triethoxy silane (APTES)-treated mica surface and incubated for 10-15 minutes before rinsing [241]. Taq

MutS, human MutS and MutL were purified using the protocols published previously [257,

258]. For MutS-DNA complexes, the proteins and DNA were incubated together at room temperature for two minutes, crosslinked with 0.8% gluteraldehyde for 1 min. The DNA is a

172 linearized 2030 base pair plasmid containing a single GT-mismatch (375 base pairs from one end) [257], which serves as a recognition site for MutS and hMutS. Some protein-DNA complexes were purified using an approximately two-centimeter agarose bead gel filtration column prior to deposition to remove excess free proteins. The complexes were deposited on

APTES-treated mica [241], immediately rinsed with water, and dried with nitrogen, before imaging. The mica was exposed to APTES for only 15 minutes so that the mica surface contains a low density of amine groups. The DNA lengths were measured using the Asylum Research

Software. The volume analysis was done as described previously [17, 84].

173 APPENDIX B. AUTHOR’S CONTRIBUTION TO DREEM

As a member carrying on the pioneering work of DREEM since its invention, I participated extensively in the technique’s continued experimentation, theory development, writing, and optimization.

Specifically, in experimentation, I continued to explore the application of DREEM in various biological contexts, including the nucleosomes that went into the publication. The work on nucleosomes was performed in collaboration with one of the authors (P. Kaur), but most experiments were performed independently with a similar instrument set-up from two different labs. To minimize differences in the two lab set-ups, the two authors worked closely together in deposition methods, experiment set-ups, and imaging parameters on both instruments. Notably, I made and provided the APS chemical for mica treatment used for sample deposition [241]. The consensus in results from two experimenters and instruments greatly strengthens our argument for the reproducibility and usability of the technique. The best images we took went into the publication. Specifically, on imaging nucleosomes (Figure 2.2), although I was not involved in taking that specific image, I did take a lot of images on our instrument supporting our analysis.

My measurement results on the nucleosome spacing are also reflected in the results (Section 2.3) in the published paper. Overall, my role in the collaboration gravitates toward the technical side

(working on optimizing the instrument and experiment design) while serving a supporting role in taking the images and performing the analysis.

I also worked closely with other authors in the development of the theory underlying the technique. For example, I added clarity to the force compositions in DREEM. I also contributed to the interpretation of image contrast. To make the theory more readable to readers with a physical science background, I revised and added clarity to the mathematical derivation of the theory.

174 Although I was not involved in writing the first draft of the paper, I was heavily involved in every revision of the paper. I made numerous edits in writing, in the organization of clearer sentences and flows, and in the integrity of citations. Specifically, I contributed a lot on writing the introduction (Section 2.1) and design (Section 2.2), as well as addressing the limitation in the discussion (Section 2.3) and developing the theory in the supplement (Appendix A.II).

To further optimize the technique, I continued to explore various aspects of the theory, the underlying parameters, and experimental designs. See APPENDIX C for a summary of these efforts.

175 APPENDIX C. DREEM OPTIMIZATION

We are still developing our knowledge of the mechanism underlying DREEM as we try to optimize the technique for biomolecular studies and understand its limitations. A key distinction of DREEM from similar techniques is the improved resolving power in resolution and sensitivity, which could result from the effective utilization of electrostatic interactions, higher harmonics, and intermittent contact mode. In this section, I explored a series of factors through experiments and calculations that could potentially be used to improve the resolution and sensitivity of this technique.

Minimum Detectable Force Gradient

One metric to describe the sensitivity of DREEM is by using the minimum detectable force gradient 푘푚푖푛, as described previously [49, 54, 247] –

1 4푘0푘퐵퐵푇 푘푚푖푛 = √ (C. 1) 푎0 휋푓푛푄푛

The terms are listed in Table C.1.

176 Term Representation

푎0 Vibration amplitude at the first flexural

mode in free air113

푛 nth flexural mode114

115 푘0 Spring constant

푇 Temperature

th 푄푛 Quality factor of n mode

th 푓푛 Frequency of n mode

퐵 Scan speed or imaging bandwidth116

Table C.1 Terms in the minimum detectable force gradient

From Eq. C. 1, we could potentially improve sensitivity by optimizing the scan speed (퐵) and the choice of flexural mode.

Scan Speed

Operating DREEM on the first overtone often results in noisy images due to weak signals. Two methods have been explored to improve sensitivity by improving the signal to noise ratio (SNR). The first method involves decreasing the scan speed, which could enhance SNR by increasing sampling117. The downside is that some older instruments suffer from significant

113 Amplitude is calculated using 500mV measured free air amplitude and 67nm/V InvOLS (Invert optical lever sensitivity) based on published method [259]. 114 Also called nth eigenmode, or nth harmonic. The first mode represents the fundamental frequency; the second mode represents the first overtone, etc. 115 Spring constant is measured using Sader’s method [260] on NANOSENSORS PPP-FMR cantilever. In this 퐸푡3푤 method, the spring constant of a rectangle cantilever can be estimated as 푘 = , where E is elastic modulus, t, w, 4퐿3 and L are thickness, width, and length of the cantilever respectively. 116 Also called measurement bandwidth.

117 The integration time constant in the external lock-in amplifier needs be adjusted to match that of the scan speed.

177 image distortions due to platform drift (Section 3.4.1), and samples could be damaged from extended exposure to forces. The second method uses oversampling and correlation averaging

[111, 261] (Section 3.4, Figure 3.9), a common technique in photography. By sequentially scanning multiple images in the same area, these images can be averaged. Despite the improvement in SNR, the inherent irreproducibility of signals on certain features such as the

DNA path inside a protein could prevent them from further being resolved (Figure 3.9G-H).

Flexural mode

Eq. C. 1 suggests that a higher flexural mode, which impacts 푘푚푖푛 through 푓푛 and 푄푛, could improve sensitivity by having a lower minimum detectable force gradient [49]. I experimented with DREEM on the first and the third mode to compare with the second mode. In the first mode, the AFM failed to track the surface. The third mode has much weaker signal output, but features can still be distinguished, albeit with much more noise. In comparison, the second mode is still the optimal mode for DREEM imaging [50].

Image Contrast

Although 푘푚푖푛 describes the minimum force gradient a cantilever can detect on a flexural mode, it does not describe the cantilever’s sensitivity to changes in force gradient, a more useful metric on image contrast. The contrast in amplitude is proportional to the change in force gradient (Eq. A. 7) and was found in experiment to be dominated by the static electrostatic force

118 퐹퐷퐶 . The contrast in phase is related to change in inelastic dissipative forces at the tip-surface interface [43, 246], which include the electrostatic forces [38]. Since 푘푚푖푛 of the first mode is larger than that of the second mode, any magnitude of force gradients in between will not be detected and nullified by amplitude modulation on the first mode, and therefore will be picked up

118 This result can be inferred from the dominance of the quadratic feature in the amplitude vs. 푉퐷퐶 curve.

178 by the second mode [262, 263]. These ‘leaked’ forces include, but are not necessarily limited to, the electrostatic forces. From Eq. 2.1, the change of force and force gradient in the electrostatic forces comes from the change in capacitance gradient and the change in the tip-surface contact potential difference (휙푇푆). Interestingly, I obtained identical measurement of Δ휙푇푆 among different sample features119. This result leaves capacitance gradient difference as the only source of image contrast for forces of electrostatic origin [38, 264]. Although capacitance cannot be directly manipulated, other user parameters can be explored to maximize image contrast.

One user parameter is the DC bias. Usually, the DC bias needs be determined empirically for each experiment to optimize the image contrast. Interestingly, for some treated micas, mode transition [246] (also known as mode hopping) can be observed at 푉퐷퐶 = Δ휙푇푆 where the tip- surface interaction changes from repulsive mode (푉퐷퐶 < Δ휙푇푆) to attractive mode (푉퐷퐶 >

Δ휙푇푆). This result is consistent with switching mode when the DC bias overcomes the initial repulsive engage (Figure C.1). During the transition to attractive mode, the phase drops significantly and the amplitude increases by two-fold (Figure C.1 red vs. black). In attractive mode, the DNA stands out much better from the surface while sensitivity of DNA path inside the protein worsens, whereas the contrary occurs in repulsive mode (Figure C.2 right vs. middle).

The SNR is also much better in attractive mode (Figure C.2 right). This result suggests that while attractive mode significantly improves SNR with its increased amplitude, it is also much less sensitive in detecting small changes in force gradient within the protein-DNA complex.

119 휙푇푆 can be obtained by identifying the minimum point of the amplitude vs. 푉퐷퐶 curve. They are found identical among the mica surface, DNA, the protein, and the DNA intersection at the protein, albeit 휙푇푆 can be different on different micas depending on their surface charges.

179

Figure C.1 Frequency shift on the first overtone with different DC bias

Engaging the tip (black) shifts the frequency positively resulting in drop in both the amplitude120 and the phase (phase curve not shown) compared to state before engaging (grey). When the static electrostatic force is small (푉퐷퐶 < 훥휙푇푆), the frequency shifts slightly negatively (attractive interaction, red) from the repulsive engage, but the overall interaction remains repulsive. When the static electrostatic force is big (푉퐷퐶 > 훥휙푇푆), the frequency continues to decrease until it drops below the original 휔2 (orange), and the overall interaction becomes attractive.

Figure C.2 DREEM imaging in attractive mode vs. repulsive mode

Left. Topographic image. Middle. DREEM imaging in repulsive mode (푉퐷퐶 < 훥푉푇푆). Right. DREEM imaging in attractive mode (푉퐷퐶 > 훥휙푇푆).

120 It should be noted that the peak of the amplitude increases when the tip is engaged compared to when the tip is in free air. The reason is that the capacitance gradient increases when approaching the surface, resulting in higher 퐹휔 (Eq. 2.2).

180 Another user parameter is the operating frequency at the first overtone. Although we could try to drive the tip at the maximum slope off resonance, it rarely helped. The reason is that the mechanic bandwidth of the first overtone is so small that any small perturbation in force gradient could send the cantilever completely off-resonance121 (Figure C.3 orange). In other words, we could not detect all the force gradients with equal sensitivity, and the operating frequency has to be adjusted such that frequency shifts on the feature of our interest can be squeezed into the narrow sensitivity window (Figure C.3 grey box). Notably, operating on different sides of the resonance peak results in flipped amplitude response, while operating on the resonance peak could enhance edges122 when moving across a force gradient cliff between repulsive and attractive forces, such as the protein-surface interface or the protein-DNA interface

(Figure C.4 top-right or bottom-left). This technique is very helpful in resolving, if not completely removing, the half-moon shaped image artifact123 that often plagued DREEM images

(Figure C.4 top-left or bottom-right).

121 This result is verified in experiment. 122 This edge enhancing technique is conceptually similar to that of phase contrast microscopy. 123 The artifact is caused by electric field asymmetry when the tip scans over a feature. When the tip approaches the feature, only the tip is interacting with the feature. When the tip scans over the feature, both the tip and the cantilever are interacting with the feature. The asymmetry in tip-feature interactions leads to asymmetry in amplitude response.

181

Figure C.3 Frequency shift on image contrast

′ New operating frequency is set at peak of amplitude (휔2) while the tip is engaged (black). When the tip scans over the protein, the amplitude drops while the phase increases (phase curve not shown) resulting from attractive interaction (orange). When the tip scans over the DNA inside the protein, the interaction becomes less attractive due to charge neutralization or decrease in capacitance gradient, resulting in a slight increase in frequency compared to when the tip scans over just the protein (light orange). Notably, because the sensitivity window (e.g. amplitude vs. frequency curve with notable slope, grey filled box) is narrow, the amplitude from protein has better contrast (bigger change) while the amplitude change from DNA inside the protein has lower contrast (smaller change).

Figure C.4 Operative Frequency on Image Contrast

A series of DREEM amplitude images of a MutSβ-DNA complex is shown on the left figure panel and its model is derived on the right. (Left Panel) Top left to top right – increasing operating frequency from below resonance to resonance. Notably, the amplitude increases on the protein, indicating attractive interaction (Figure C.1 red vs. black).When the operating frequency is on the resonance peak, edges of features are enhanced, resulting in ‘concentric circles’ on the protein. Bottom left to right – increasing operating frequency from resonance to above resonance. Notably, the amplitude flips (decreases) on the protein for the same attractive interaction, consistent with operating on the other side of the resonance. Taken together, the different operating frequencies present different slices of the feature sensed by the sensitivity window (Figure C.3 grey box), with the DNA path inside the

182 protein best resolved while the tip is operating on resonance. The half-moon shaped image artifact (the top-left image and the bottom right image) on the protein can be observed as amplitude asymmetry viewed from the bottom up angle (scan direction is 90°). (Right) Hand-drawn model derived from information overlays across all the DREEM images obtained from the left. The DNA is modeled as a ladder (blue) and MutSβ is modeled as a hand- prayer shape with its two subunits labeled in red and orange. Considering the DNA is bent at the intersection between the N-terminus domains, MutSβ is oriented such that the incoming DNA strand (top strand) enters the groove between the two N-terminus domains and exits from behind the protein (bottom strand). DNA bend angle (exterior angle of the angle formed by the two DNA strands) are labeled and shows that the external bend angle (orange) can be very different from the internal bend angle (green). In this case, the internal bend angle (122°) is much larger than the external bend angle (71°) measured. It should be noted that although the optimal DC bias and sensitivity window for phase and amplitude are different, a sweet spot can sometimes be found to obtain good contrast in both the amplitude and the phase image. Ideally, we could sweep the DC bias and the operating frequency for every feature so that all areas of interest can be sampled and detected in a series of scans. Some modern instrument may already allow us to do that programmatically.

I also attempted experiments on the choice of cantilevers and grounding substrates, the treatment of tip and mica, imaging in aqueous solution [265-267], in lift mode [268], or in attractive mode, and better designs for sample preparation and preservation. The results are not conclusive and more studies are needed.

183 APPENDIX D. SUPPLEMENT FIGURES FOR IMAGE METRICS

Figure D.1 Welcome Screen of a Typical Module

Shown in the figure is the Image Processor module. In the welcome screen, recent files are listed on the left pane. Help topics, videos, and demos are on the right pane.

184

Figure D.2 Image Filter Module

In the Image Filter Module, users can interactively adjust filter parameters (top-left panel) while previewing changes (right panel), and save user parameters as custom filtering profiles. For example, when using the median filter as shown in the figure, users can choose the kernel size of the filter (height and width), and the outlier level (in the form of the standard deviation of the kernel, or SD) at which the filter applies. Pixels that fall within the specified SD are not modified. Further tuning of outlier removal can be adjusted in the Outlier Removal panel (middle-left panel). The outlier threshold can be used to control which part of the outlier spectrum (bottom-left panel) users want to apply the filter to, and the threshold size of the outlier region at which the filter would apply. In addition, users can crop regions of interest where the filter would apply. These features are useful in fine-tuning the image correction process, such as only removing the scan line noise in AFM images (Section 3.4.1).

185

Figure D.3 Action History

The Action History module lists all past user actions. User can re-run those actions on current image or images. Actions can be exported as script or macros.

186

Figure D.4 Macro Builder

The Macro Builder module is the central place to create, edit, and manage user macros. Users can build macro by choosing from a list of existing functions and controls. User can adjust the value and action of each control (blue box), adjust the order of the controls (green box). A preview and description of the control is shown in the control gallery (red box). The script underlying each macro action can also be edited (purple box).

187

Figure D.5 Batch Processing through Macros

Images can be batch processed with macros. A progress bar of current processing is shown in the bottom. User can choose to pause, stop, or continue the current macro. While paused, users can inspect the changes and make modifications to the image without stop running the macro completely.

188

Figure D.6 Script Editor

The Script Editor module provides a full featured editor that can make, open, and edit user scripts. It integrates with the built-in command console and debugger so that users can test their script as they write. Scripts can then further be integrated into macros and as functions inside Image Metrics.

189

Figure D.7 Image Acquisition

This module allows user to capture image or videos from external hardware connected to the computer. Currently, this module supports data acquisition from hardware supported by MathWorks hardware support package and has full control over the camera operations as supported by its driver. This addition extends the use of this program in field studies and studies involving optical microscopes. More hardware support can be implemented as long as they are supported in MATLAB. Images in the clipboard is also supported. The images can then be imported to other modules for processing.

190

Figure D.8 Image Browser

The Image Browser module allows users to preview images in a gallery format. Users can drop images or images folder into this module directly or use the navigation pane to manually locate the images in folders. Double clicking the images will peak into the different data channels within that image. Operations connecting to all other modules can be accessed at the right click of an image.

191

Figure D.9 Command Console

The Command Console is the place to run short scripts and commands in real time (blue box). Users can also view the history of previous commands that were run, and details of errors that were generated during the usage of the program (green box). Users can create variables using commands and the variables will be stored in the workspace (red box), which also allows them to inspect & modify custom variables and built-in program variables directly.

192

Figure D.10 Shape Matching Analysis of MutSα-DNA Complexes

A. Overview of MutSα-DNA complexes. B. Alignment of MutSα -DNA complexes to the template (red titled). A conformation (‘medium long bend’) is named and assigned to selected particles (tagged by a green disc on their top left corner). C. Correlation averaging. Compare the averaged image (middle) to the template (left). Since the MutSα bends the DNA in multiple angles, increase the contrast of the averaged image reveals several DNA paths (right, with green line overlay showing the DNA paths). D. Distributions of all categorized conformations after repeating the previous procedure and cycle through all possible conformations.

193

Figure D.11 App Manager

Left side shows the list of user apps that are currently loaded. Right side shows the contents (script or macro) of the App.

194 APPENDIX E. SUPPLEMENT FIGURES FOR MMR AND TNR

Figure E.1 Restriction map of DNA substrates

Numbers in brackets indicate position of digestion sites (green, single cut; blue, duo cut) or slip-out location (red).

Figure E.2 Position Distribution of Homoduplex from the Double-cut DNA

No preference of binding is seen on this homoduplex DNA, indicating non-specific binding does not show preference towards any binding sites across the DNA.

195

"ADP" 100 "ATP"

80

60

40 Relative Frequency (%) Frequency Relative 20

0 1 2 3 Number of MutSb complexes per DNA

Figure E.3 Stoichiometry of Protein Complexes per DNA on Homoduplex from the Double-cut DNA

The majority binding events are single binding events – suggesting that multiple loading is prohibited due to the low binding affinity to non-specific sites.

196 APPENDIX F. IMAGE ARTIFACTS

It should always be remembered that an AFM image is the result of tip-sample interaction, and anything that derails that interaction will trigger the AFM feedback system to adjust for or against proper tracking of the surface, depending on the nature of the interactions. In fact, the interactions between tip and sample are so complex that it is often easier to visualize an

AFM image as an interaction map rather than a topographic map in order to understand the image features and image artifacts, and distinguish them from one another. For example, chemical treatment on the surface and residual salt or water from buffer can lead to a patchy surface (indicative as shallow swamps on the image background that cannot be flattened) due to the additional attractive and/or capillary forces added on the tip when scanning over areas with salt or water (Figure F.1A blue box). The image will also become noisy if the interaction is weak (low signal), or not tracked properly (e.g. scanning too fast, Figure F.1C), or unstable (e.g. contaminated tip or sticky surface, more below).

An AFM image is also the convolution of tip and sample geometry, and because no tip is infinitely sharp, sample features imaged are always dilated. Any degradation and contamination of the tip and/or the sample may change their geometries, and therefore alter the final appearance of image features. For example, a contaminated tip that picks up random stuff from the surroundings can significantly alter its mass, geometry, electrostatic and/or mechanic properties.

This result can cause a worsening of the tip dilation effect and lead to the broadening of surface feature (i.e. fat tip, Figure F.1B). It could also alter the tip-sample interactions that could derail proper tracking of the surface and result in noisy surface (e.g. spiky or streaky surface). In the worst cases, the contamination can form new ‘tips’ on the AFM cantilever that could simultaneously scan over the sample and lead to recurrence of the same sample features (double or triple tip, Figure F.1B red boxes); or a tip can grow so large that - instead of the tip scanning

197 over the sample, the sample is scanning over the tip - resulting in features more resembling to the tip than to the sample. Sample features can also be blown up if too little imaging force is used, resulting in poor tracking of the features; or be flattened and removed of their fine details because of too much imaging force that is being applied. Biomolecules are particularly vulnerable to excessive imaging force and they often lose their fine structural details after being repeatedly scanned.

AFM images can also suffer from environmental or equipment noise that mostly impacts vertical measurement (Figure F.1D, top left), and mechanical or thermal drifting that mostly impacts lateral measurement (Figure F.1B red box). Low frequency noise often mingles deep into the frequency space of surface features and cannot be easily corrected, while high frequency noise can be remedied through image filters (Gaussian, Median, etc.) or FFT124 analysis (Figure

F.1D). Noise can also be reduced if more images of the same area are collected (Figure F.1C).

Image drifting can be partially remedied by reversely transforming the image and/or surface features through modeling, calculating, and applying the drifting parameters.

124 FFT – Fast Fourier Transform

198

Figure F.1 Other Image Artifacts

A. ‘Patchy surface’ artifact (e.g. blue box) as caused by chemical treatment – dips on the surface whereas it should be flat. B. ‘fat tip’ and ‘double tip’ artifacts – DNAs and proteins are much thicker and larger compared to A. Taller proteins (e.g. red boxes) always shapes to be the same two-folded globes that mirror the shape of the double tip. C. Improving signal to noise using correlation averaging. Left – template image of a fast-scanned image; Right – 20 images averaged. D. Removing periodic noise using FFT. Top left – original image with a certain high frequency noise; Down left – FFT of the original image, with dark bands in the white boxes masked; Up right – inverse of the FFT image (corrected image) with masks excluded, also notice the high frequency noise in the original image is gone; Down right – Differential image between the original and the corrected image, which shows the filtered high frequency noise.

Collectively, these factors inevitably add inaccuracies to height measurement of sample features due to complications in tip-sample interactions and/or environmental noise, and they can transform features laterally due to changes in tip-sample geometries and/or equipment drifting.

Although some of these artifacts can arguably be mediated (heights can be normalized, noise can be filtered, tip dilation can be de-convoluted, etc.), it is debatable whether it is worthy to do so given that the correction can involve very laborious calculations, inadvertently remove useful information, or even introduce artificial information to the data. After all, it is much easier to collect more quality data than to analyze dubious data, and most of these artifacts can be fixed through better experimentation and instrumentation. In our experience with protein-DNA AFM

199 studies, we always treat images with these types of artifacts with great caution – either discard them or selectively analyze them, and always resort to taking more quality images instead if conditions permit.

Interested readers can also refer to these references ([110], [90] Ref. Chapter 6, [109]

Ref. Section 3.5) on general AFM image artifacts.

200 APPENDIX G. PARTICLE CLASSIFICATION ALGORITHMS

A lot of the pioneering work in particle classification has been done in EM [139]. In EM, particle images are often re-projected into their factorial space (eigenvector space) through multivariate statistical analysis (MSA) such as principal component analysis (PCA) and correspondence analysis (CA) [139, 146]. This technique effectively reduces the particle images into their constituting components or eigenimages. Particles with similar structures will cluster near each other in the factorial space, thereby removing ambiguities for manual classification. In

Image Metrics, a similar eigenanalysis is developed (Appendix G.I), but instead of reducing particles into eigenvectors of individual components, the particles are grouped into eigenvectors of their pairwise correlations125. In other words, in EM particles are grouped through similarities in their constituting components, whereas in Image Metrics particles are grouped through similarities in their pairwise correlations.

Particles can also be automatically classified through clustering analysis such as hierarchical ascendant classification (HAC) and K-means classification. In EM, clustering is performed through minimizing distances from particles to class averages (such as those used in multi-reference alignment) so that variations within a class can be minimized. In Image Metrics, clustering is performed through maximizing distances from class to class so that variations in between classes can be maximized. See Appendix G.II for more details.

Taken together, the difference in approaches to the same problem means that EM focuses on classifying particles by minimizing component variations, whereas Image Metrics focuses on classifying particles by maximizing their conformational differences.

125 The correlation matrix is similar to the covariance matrix defined in the principal component analysis.

201 G.I. Classification through Eigenanalysis

Particle classification can be performed through eigenanalysis of their pairwise correlation, an analysis rooted in linear algebra [269]. Suppose we have three particles with a complete pairwise correlation map126 (Eq. G. 1, see also Table G.1) and we want to arrange them into groups based on their similarities, the easiest way is to represent each group as an ensemble of the particles in the form of a linear summation of the particles (Eq. G. 2), also known as a representation. This forms the initial base for grouping.

Table G.1 Particle Alignment and Correlation

Particles (xi) of different shapes are shown in the left column. They are aligned as described in Section 3.5.2B (middle column). The maxima of their pair-wise cross-correlation after alignment are shown in the right column. These correlation scores are used to construct the correlation map/matrix (e.g. Eq. G. 1).

126 See Section 3.5.2B

202 푎11 푎12 푎13 |푎21 푎22 푎23| (G. 1) 푎31 푎32 푎33

푦1 = 푎11푥1 + 푎12푥2 + 푎13푥3 {푦2 = 푎21푥1 + 푎22푥2 + 푎23푥3 (G. 2) 푦3 = 푎31푥1 + 푎32푥2 + 푎33푥3

The physical meaning of the coefficients 푎푚푛 becomes clear when the linear equations are converted to its matrix form (Eq. G. 3), where vector y represents the particle groups and vector x represents the particles. Even though 푎푚푛 is the correlation score in the correlation map, it also tells us the likelihood of which particles belong to which group.

푦1 푎11 푎12 푎13 푥1 |푦2| = |푎21 푎22 푎23| |푥2| (G. 3) 푦3 푎31 푎32 푎33 푥3

Since all 푥푛 have overlapping similarities (i.e. non-zero correlations for any particle pair),

푦푛 is not independent. In another word, one group may contain particles from another group

(known as dependent group). Independent grouping requires the non-ambiguous, singular mapping of an individual particle to a group. The problem of grouping, therefore, lies in seeking the independent solutions of the linear algebra in Eq. G. 3. More specifically, solving for independent 푦푛 (groups) and removing dependent and redundant 푦푛, thereby reducing the number of groups and achieving regrouping of the particles.

For example, for a given correlation map (Eq. G. 4), initial grouping results in three dependent groups (푦1, 푦2, 푦3) with overlapping member particles (푥1, 푥2, 푥3) (Eq. G. 6).

1 0.3 0.7 |0.3 1 0.6| (G. 4) 0.7 0.6 1

푦1 1 0.3 0.7 푥1 |푦2| = |0.3 1 0.6| |푥2| (G. 5) 푦3 0.7 0.6 1 푥3

203 푦1 = 푥1 + 0.3푥2 + 0.7푥3 {푦2 = 0.3푥1 + 푥2 + 0.6푥3 (G. 6) 푦3 = 0.7푥1 + 0.6푥2 + 푥3

Our goal is to transform the correlation matrix such that dependences of the groups are removed, i.e. groups will have non-overlapping member particles. Essentially, we aim to change the base of the vector space underlying the matrix (i.e. 푥푚 or particles), and replace it with a new

′ ′′ base representing the independent particle groups (푦푚) such that the new groups (푦푚) would be devoid of dependencies127 (Eq. G. 8, G. 7). In linear algebra, this goal can be achieved through diagonalization of the matrix128 by solving for its eigenvalues and eigenvectors. The eigenvalues

휆푛 will equal to the size of the group (i.e. number of particles inside the group), the number of

′ eigenvalues 푛 will represent the number of groups, and the eigenvectors (푦푚) will form the new, independent base of the new matrix where dependent, off-diagonal terms are removed (Eq. G. 7).

푦 푎 푎 푎 푥 푦′′ 휆 0 0 푦′ 1 11 12 13 1 퐷푖푎푔표푛푎푙푖푧푎푡푖표푛 1 1 1 ′′ ′ |푦2| = |푎21 푎22 푎23| |푥2| → |푦2 | = | 0 휆2 0 | |푦2| (G. 7) 푦 푎 푎 푎 푥 ′′ ′ 3 31 32 33 3 푦3 0 0 휆3 푦3

′′ ′ 푦1 = λ1푦1 ′′ ′ {푦2 = λ2푦2 (G. 8) ′′ ′ 푦3 = λ3푦3

The problems, however, is that although diagonalization will guarantee independencies of

′ the eigenvectors (푦푛), it does not guarantee the independencies (overlapping) of the particles

(푥푚) comprising these vectors. The reason is that mathematically overlapping usage of base vectors are permitted129, whereas in our case they are not.

127 The mechanism of the change of base representation is the same as that of a coordinate transformation between Cartesian coordinates and spherical coordinates. 128 Also called a similarity transformation to the diagonal form. 129 E.g. in 2D geometry, vector 푦 = 푥 and vector 푦 = −푥 are independent, but they share the same coordinate base vectors (푥, 푦). Vector 푦 = 1 and vector 푥 = 1 are also independent, and they do not share the same base vectors. Vector 푦 = 1 is independent of base vector 푥, whereas vector 푥 = 1 is independent of base vector 푦.

204 To remove the overlapping dependencies of particles among groups, I formulated a similarity scoring system that simultaneously boosts the overlapping scores (more on that later) of similar particles and attenuates those of dissimilar particles. This scoring system calculates the similarity degree of a particle pair based on the overlap of the particles that they are similar to, score them accordingly (called overlapping score), and replace their old score (e.g. correlation score) with this new score. The determinant of similarity between particles is decided by a user- specified threshold (called overlapping threshold) – if a particle pair scores higher than the threshold, they are considered similar. Essentially, if a particle pair has enough overlapping matching particles, they will be scored higher. Otherwise, they will be scored lower.

Collectively, the correlation score and the overlapping score are called the matching score (푀).

This process is repeated iteratively till the scores are no longer changing, which is, all the scores will be either 0 or 1, and the correlation matrix will become a binary matrix where particle overlapping among groups are completely removed (Eq. G. 9, Figure G.1) and redundant groups are merged (Eq. G. 10). This method is conceptually similar to the maximum likelihood methods in image classification [270-272]. 푦 푎 푎 푎 푥 푦 푥 1 11 12 13 1 푆푖푚푖푙푎푟푖푡푦⁡푇푟푎푛푠푓표푟푚푎푡푖표푛⁡퐸푥푎푚푝푙푒 1 1 1 0 1 |푦2| = |푎21 푎22 푎23| |푥2| → |푦2| = |1 1 0| |푥2| (G. 9) 푦3 푎31 푎32 푎33 푥3 푦3 0 0 1 푥3 푦 = 푦 = 푥 + 푥 { 1 2 1 2 (G. 10) 푦3 = 푥3

Image Metrics provides multiple similarity scoring approaches to estimate the overlapping score. One approach is to calculate the Jaccard index [144], which is defined as the intersection of the matching over the union of the matching, i.e.

#퐼푛푡푒푟푠푒푐푡푖표푛⁡표푓⁡푚푎푡푐ℎ푖푛푔⁡푝푎푟푡푖푐푙푒푠⁡푓푟표푚⁡푝푎푟푡푖푐푙푒⁡푝푎푖푟⁡(푖, 푗) 푀 = 푀 = (G. 11) 푖푗 푗푖 #푈푛푖표푛⁡표푓⁡푚푎푡푐ℎ푖푛푔⁡푝푎푟푡푖푐푙푒푠⁡푓푟표푚⁡푝푎푟푡푖푐푙푒⁡푝푎푖푟⁡(푖, 푗)

205 where 푀푖푗, the matching score, is any element of the to-be-updated correlation matrix, and # is interpreted as the number of particles inside the intersection or union specified in the equation.

Using the previous example, supposing the correlation threshold and the overlapping threshold are 0.6 (they do not have to be the same), we have particle 1 and 3 match up while leaving particle 2 alone after applying the threshold (Eq. G. 12).

1 0.3 0.7 퐴푝푝푙푦푖푛푔⁡푡ℎ푟푒푠ℎ표푙푑 1 0 1 |0.3 1 0.6| → |0 1 1| (G. 12) 0.7 0.6 1 1 1 1

Using Eq. G. 11, we can update the correlation matrix – for example, the new score for particle pair (1,3) will be

2 푀 = 푀 = ≈ 0.67 (G. 13) 13 31 3

This result is worked out as follows. In the binary matrix (Eq. G. 12) particle 1 matches up with particle 1 and 3 while particle 3 matches up with particle 1, 2, and 3. The intersect of the matches is 1 and 3, whereas the union of the matches is 1, 2 and 3. Therefore two particles are in the intersection whereas three particles are in the union, and hence 2 in the numerator and 3 in the denominator.

1 2 Similarly, 푀 = 푀 = ≈ 0.33, 푀 = 푀 = ≈ 0.67, and the correlation matrix is 12 21 3 23 32 3 updated (Eq. G. 14). However, when repeating the process (by applying the same threshold), one finds that the binary matrix remains the same (Eq. G. 15) – which means the matrix will not change during the next iteration. Since the grouping is incomplete in that overlapping among 1,

2, and 3 still exists, the solution is therefore putting all three particles in the same group (Eq.

G. 15). This result is an anomaly of the transformation where matrix cannot be fully diagonalized through iterations.

206 1 0.3 0.7 푆푖푚푖푙푎푟푖푡푦⁡푇푟푎푛푠푓표푟푚푎푡푖표푛 1 0.33 0.67 |0.3 1 0.6| → |0.33 1 0.67| (G. 14) 0.7 0.6 1 0.67 0.67 1

1 0.33 0.67 퐴푝푝푙푦푖푛푔⁡푡ℎ푟푒푠ℎ표푙푑 1 0 1 푁표⁡푙표푛푔푒푟⁡푐ℎ푎푛푔푖푛푔,푓표푟푐푒⁡푚푒푟푔푖푛푔 1 1 1 |0.33 1 0.67| → |0 1 1| → |1 1 1| (G. 15) 0.67 0.67 1 1 1 1 1 1 1

Perhaps a better example is transforming the correlation map of four particles. Suppose we have the following correlation map extended from the previous example while using the same threshold –

1 0.3 0.7 0.4 1 0 1 0 0.3 1 0.6 0.8 퐴푝푝푙푦푖푛푔⁡푡ℎ푟푒푠ℎ표푙푑 0 1 1 1 | | → | | (G. 16) 0.7 0.6 1 0.2 1 1 1 0 0.4 0.8 0.2 1 0 1 0 1

Performing similarity scoring, we have, for example

1 푀 = 푀 = = 0.25 (G. 17) 12 21 4

Complete the calculation for other particle pairs, we have the updated correlation matrix as

1 0 1 0 1 0.25 0.67 0 1 0 1 0 0 1 1 1 푆푖푚푖푙푎푟푖푡푦⁡푡푟푎푛푠푓표푟푚푎푡푖표푛 0.25 1 0.5 0.67 푇ℎ푟푒푠ℎ표푙푑 0 1 0 1 | | → | | → | | (G. 18) 1 1 1 0 0.67 0.5 1 0.25 1 0 1 0 0 1 0 1 0 0.67 0.25 1 0 1 0 1

The matrix stops changing because all overlapping terms are removed, and grouping is complete.

To see the independent groups and their composing particles, it is probably easier to rearrange the binary matrix (Eq. G. 18) into its block diagonal form130 by shuffling its base vector 풙 (i.e. shuffling the particle order).

130 Not to be confused with the conventional diagonal form. Blocks in the block form indicates subspace of the linear space.

207 푦1 1 0 1 0 푥1 푦1 1 1 0 0 푥1 푦2 0 1 0 1 푥2 푅푒푎푟푟푎푛푔푒 푦3 1 1 0 0 푥3 | | = | | | | → | | = | | | | (G. 19) 푦3 1 0 1 0 푥3 푦2 0 0 1 1 푥2 푦4 0 1 0 1 푥4 푦4 0 0 1 1 푥4

Here it becomes obvious that there are two independent groups, one composed by particle 1 and

3, and the other composed by particle 2 and 4.

푦 = 푦 = 푥 + 푥 { 1 3 1 3 (G. 20) 푦2 = 푦4 = 푥2 + 푥4

The independent groups in Eq. G. 19 (shuffle not required) can also be directly diagonalized by solving the eigenvalues and eigenvectors as discussed earlier -

′′ ′ 푦1 1 0 1 0 푥1 푦1 2 0 0 0 푦1 ′′ ′ 푦2 0 1 0 1 푥2 퐷푖푎푔표푛푎푙푖푧푒 푦2 0 2 0 0 푦2 | | = | | | | → | ′′| = | | | ′ | (G. 21) 푦3 1 0 1 0 푥3 푦3 0 0 0 0 푦3 푦 푥 ′′ ′ 4 0 1 0 1 4 푦4 0 0 0 0 푦4

Here the number of non-zero eigenvalues represent the number of independent groups. The value of the eigenvalue stands for the size of the group131, e.g. 2 means a group with two particles. The eigenvectors 푦푛 (i.e. groups) with non-zero eigenvalues will contain the composing, non- overlapping 푥푛 (i.e. particles), which are

′ 푦1 = 푥1 + 푥3 { ′ (G. 22) 푦2 = 푥2 + 푥4

The independent groups can further be divided, a process which I called sub-grouping

(Figure G.1). The initial grouping attempt utilizes the whole correlation matrix and therefore is often contaminated by bad matches that could derail the quality of the similarity scoring. Sub- grouping re-scores similarity of the current group by using only particles within the group,

131 This result is similar to EM eigenanalysis where the order of the factor (eigenvalue) determines the dominance of that factor [115].

208 thereby improving the matches. The groups can be sub-grouped continuously till a minimum group size is reached, or sub-grouping is no longer possible under designated threshold.

The independent groups can also be merged, a process which I called super-grouping

(Figure G.2). This method is useful to merge similar groups into super-groups if similar particles are divided into subgroups in the sub-grouping process. In this process, groups are used as seeds for the same transformation procedure described earlier, except that the particle correlation map is replaced by the group correlation map. The correlation between groups are calculated as the average correlation between the particles from one group to those from the other group. Because the groups are being merged instead of being divided, users end up getting super-groups instead of sub-groups. Repeating the merger process till the groups reach a user specified size, or the super-grouping is no longer possible under designated threshold.

209

Figure G.1 Sub-grouping

Top left. Sub-grouping classes into sub-classes. The color represents different levels of the subgroups. Bottom left. The same tree as top left, but the leaves are re-colored and re-labeled. Top right. Sub-grouping as represented in the diagonalized correlation matrix. Block cells of 1 represents the individual groups and are colored the same as the tree in top left. Bottom right. Sub-grouping matrix as represented in the bottom left.

210

Figure G.2 Super-grouping

′ The leaves in Figure G.1 are merged into super-groups. Each leaf, 푦푛, if similar enough, is merged into nodes 푦푛. Final groups are colored green to distinguish them from initial leaves (red). The matrix illustrates the super- grouping process of merging smaller block or unit cells of 1 into bigger block cells. Taken together, the eigenanalysis developed here can be scaled linearly to groups of any levels, thereby generating a tree structure where the leaves are individual particles or particle groups and the nodes are groups (Figure G.1, Figure G.2). This process is conceptually similar to the hierarchical clustering tree (see Section G.II). The key to perfecting particle classification lies in the fine tuning of the sub-grouping and super-grouping process, and using the analogy of the hierarchical tree, in the fine branching and pruning of tree. Sub-grouping removes ambiguous particles from contaminating a class, while super-grouping merges highly similar groups into the same class. By choosing the right grouping parameters, Image Metrics is able to achieve high accuracy for unaided particle classification using this iterative eigenanalysis (see Section G.III).

211 G.II. Classification through Clustering Analysis

Clustering analysis relies on calculating the distance between particle attributes and subjecting the distance to one of many clustering algorithms, including hierarchical ascendant clustering (HAC) and K-means clustering [139, 145, 273]. In Image Metrics, the attributes can be the pairwise correlations of particles or other particle shape metrics. For instance, using the pairwise correlation matrix in Eq. G. 1, we have the attributes of particle 1 as (푎11, 푎12, 푎13), and the attributes of particle 2 as (푎21, 푎22, 푎23). The distance can be calculated as the Euclidean distance (or using other distance metrics) between these two attributes –

2 2 2 푑12 = √(푎11 − 푎21) + (푎12 − 푎22) + (푎13 − 푎23) (G. 23)

The attributes can also be individual pixels of the aligned particles, and in that case, the distance is the correlation of the particle pair itself, i.e. 푑12 = 푎12 = 푎21. After the distance matrix 푑푖푗 is obtained, it is subjected to one of many clustering algorithms available through

MATLAB’s Statistical Toolbox132. As of the time of this writing, HAC and K-means are incorporated in Image Metrics. For example, Figure G.3 shows the HAC implementation in the software. In Image Metrics, since class averages are not used in the distance calculation, clustering relies on the inherent correlations between particle pairs instead of correlation between particles and class averages. This distinction allows clusters in Image Metrics to include particles that have minor conformational differences, but are still inherently correlated.

As discussed later in Section G.III, the major parameters involved in HAC are the distance cutoff or max number of clusters (for the flat cutoff scheme) and inconsistent coefficient cutoff and depth (for the inconsistency scheme). The major parameter involved in K-means

132 https://www.mathworks.com/help/stats/introduction-to-cluster-analysis.html

212 clustering is the cluster number k. The silhouette values are often used as indicators of the quality of the clusters generated. For detailed concepts and best practices on these parameters and their optimization, users can refer to Section G.III for synthetic examples that serve both as guide and as verification purpose. Interested readers can also read more in MATLAB documentation on best practices on choosing the optimal parameters for each clustering scheme.

Figure G.3 Hierarchical Ascendant Clustering (HAC)

Left. HAC tree structure. Tree can be pruned at a user designated distance threshold (~0.7, red arrow). Pruned clusters are labeled in assorted colors. The member of clusters, 8, can be seen from their colors, or from the number of non-pruned branches at lowest level (blue). Right. Zoomed in area in between blue arrows from left figure. The clustering process can be seen from the deepest level.

213 G.III. Verifications and Best Practices

To demonstrate the usage and accuracy of different classification schemes, artificial data are generated in the Image Simulator module (APPENDIX H) with preconfigured classes. Two artificial data sets are used. The first data set involves smiley faces (Figure G.4 upper) employed in the EM example [139] as comparisons to techniques used in EM; the second data set involves classified particle images taken from real AFM data in Figure 3.16, which is then used to synthesize artificial images.

Figure G.4 Image Simulation using Smiley Faces

Filtered individual particles (faces) with no overlaps are colored in red. In the first example, the smiley faces were inputted into Image Simulator to generate ten

AFM images with randomized distribution and orientations of the faces (Figure G.4 lower). The simulated images are then subjected to particle filtering, alignment (Section 3.5.2B), and finally, classification through their correlations (Figure G.5). For example, when using hierarchical ascendant clustering (HAC) for classification, a tree (also known as dendrogram) can be

214 generated (Figure G.3). In HAC, the leaves are individual particles, which are joined together into nodes (called links) in pairs by their similarities (using any of the distance metrics in Section

G.II). The nodes are then further joined together to form bigger nodes, while reducing the number of branches at the same time, and finally merged into one at the top. It is called an ascending tree (or an invert tree) because it is upside down with leaves on the bottom. The tree is pruned at ~0.7 as the branches are most easily separated at that distance (Figure G.3 red arrow), where distances of branches below 0.7 are notably closer than distances of branches above 0.7.

This process generates eight groups (classes) that, upon user inspection, correctly correspond to the eight faces that the artificial data encompasses (Figure G.5). All the groups are verified to contain the same faces in their respective class with no outliers. This result verifies that the HAC implementation in Image Metrics is able to correctly divide the particles into classes that agree with the user’s prior knowledge. Other classification schemes (K-means and Eigenanalysis) are also verified to obtain the same result.

215

Figure G.5 Particle Classification using Artificial Data Set

Left. Particles (aligned) from one of the obtained groups (class). Right. Eight groups (classes) are obtained through one of the computer-assisted classification schemes. The eight groups are identical to the ones that generate the synthetic data. In the second example, particle images from a real AFM image (Figure 3.16, from

MutLα deposition) is used to simulate an AFM image. The particle images are pre-classified by hand into a number of conformational classes (Figure G.6 left). Two images from each class are chosen as input source for Image Simulator to generate a synthetic image of 1휇푚 × 1휇푚 with

~100 particles (Figure G.6 right). To make the class distribution closer to the realistic scenario, population ratios are applied when generating the image.

216

Figure G.6 Simulated AFM Image from Classified Particle Images

(Left) Particle images are pre-classified by hand (‘conformation’ column) and two images from each class are chosen as input source (‘image used’ column) for our synthetic image, which would contain 6 manual classes and 12 image classes in total. To make the class distribution in the synthetic image more realistic, the population ratio (obtained from real data) is applied to each class when synthesizing our image and population counts are displayed in brackets. (Right) A 1휇푚 × 1휇푚 AFM image is synthesized from the source with ~100 total number of particles. Particles are synthesized using the source image (left) with a preset population ratio, random rotations, and random translations applied. The image is synthesized at the same resolution as the source images. The synthesized image is then subjected to shape analysis as described in Section 3.5.

Particles are filtered by volume to remove overlapping particles. The 74 remaining particles are then subjected to different automated classification methods (k-means, HAC, and eigenanalysis) in the Particle Categorization module. First, we examine k-means. The principle of k-means is to segment particles into k random initial clusters. The particles are then re-binned to the cluster that they have a minimum average distance to. In other words, a cluster can gain or lose a member in this process. This process is repeated till all the clusters no longer change, assuming that all intra-class distances are minimized. Since the initial clusters are randomly divided by the seed, which affects the clustering outcome despite the iterations, it usually takes a number of

217 replicates (by using a different random seed) to obtain the clusters with truly smallest intra-class distances. As an initial guess, a five-classes seed (푘 = 5) is used, which results in five classes

(Figure G.7) that capture four of the six conformations from the source (Figure G.6). Notably, the conformation ‘round with tail’ (Figure G.7) is missing and the two images from the triple class are resolved into two classes (not surprising given the two triplet source images are in very different bending states even though they shared the same overall conformation). The elongated dumbbell conformation is also assimilated into other classes.

Figure G.7 K-means Clustering

A five-classes seed (k=5) and the Euclidian distance from the pairwise correlation matrix (Eq.G. 23G. 23) is used for k-means clustering, which generated five initial classes that resembles those from the source (Figure G.6) with the exception of the ‘round with tail’ class, which is missing. This result is not surprising given that the source has six classes, and one of the classes has to be assimilated into other classes when using a five-classes seed for k-means clustering. The numbers in the figure indicate the number of particles in each class (from a total of 74 particles). To resolve the missing classes, one can inspect each class and subject an individual class to further classification. For example, upon inspection of ‘Group#1’ in Figure G.7, I found that the conformation ‘round with tail’ and ‘asymmetric dumbbell’ are assimilated into the same group. A further classification of this group by using one of the classification methods, for example, by using a two-classes seed k-means clustering, is able to completely separate these two conformations. A better approach (than manual inspection) is to use the cluster silhouette

218 plot133 (Figure G.8) [274], which pitches the average inter-class distances of individual particles against their intra-class distances. The mean of all the silhouette values, therefore, reflects how well the separation of the classes are, with higher silhouette means corresponding to better separation of classes. The sum of intra-class distances will continue to decrease as number of classes increases (Figure G.8), reflecting a more refining of the classes. By re-running the clustering scheme using a different seed number, we may obtain a best-case scenario by choosing the seed with the highest silhouette mean (Figure G.9), and make a more objective and quantitative claim of the clustering quality. In Image Metrics, this optimization can be done by scanning the silhouette mean against chosen parameters (e.g. cluster number) and automatically choosing the parameter(s) with the highest silhouette value (Figure G.9). The result shows that a local maxima of silhouette mean can be reached at 푘 = 11, and the resulting classes are displayed in Figure G.14.

133 The silhouette value is a measure of how close a node (particle) is to other nodes in its own cluster as compared to nodes in other clusters. The silhouette value 푆푖 is defined as 푆푖 = (푏푖 − 푎푖)/max⁡(푎푖, 푏푖), where 푎푖 is the average distance of the node to other nodes in its own cluster and 푏푖 is the minimum average distance of the node to other clusters. Specifically, a single-node cluster would have a silhouette value of 1 (since 푎푖=0).

219

Figure G.8 Cluster Silhouette Plot

The silhouette plot displays how close an individual node is to its own class versus its neighboring classes. Its value ranges from +1 (very distant to neighboring classes), to 0 (the node does not distinctively belong to its current class), to -1 (the node is probably assigned to the wrong class). The mean of all the silhouette values, therefore, reflects the separation status of current clusters, with higher mean values representing a better separation of clusters, and smaller inter-class distance sums (until the classes are saturated, see also Figure G.9). The figure shows the comparison between a five-classes seed (Figure G.7) and an eleven-classes seed in k-means clustering. Through comparing the silhouette means and distance sums, the eleven-classes seed resolves more correct classes than the five-classes seed, which is confirmed in manual inspection. The eleven-classes seed over-resolves the conformations from the source classes (Figure G.6) into their individual images, with only the ‘round’ class still intact (due to the much smaller separation from their two individual images).

220

Figure G.9 Comparisons of K-means Clustering using Different Seeds

The silhouette means and intra-distance sums are compared among different seeding configurations. The configuration with the lowest distance sum is used among 100 replicates for each configuration. (left) Plot of silhouette means and intra-distance sums versus cluster number. The peak of silhouette value is marked in a circle. (right) Table listing the values in the plot. Not surprisingly, as the seed number approaches to the number of source images (Figure G.6), the silhouette mean increases while the distance sum decreases (until all classes are resolved). Interestingly, the silhouette-mean at seed 12 (the number of source images used) is lower than that at seed 11. This result could happen if the low variation of the two images in the ‘round’ class (Figure G.6, see also caption in Figure G.8) is smaller than errors from image rotations and/or other interpolations, suggesting that the two images cannot be further resolved within errors. In other words, the increase in intra-class distance is canceled by the decrease in inter-class distance, resulting in a flat, or even smaller silhouette mean value, or to put it simply, the two image classes are better clustered together than separated. The intra-class distance will continue to decrease as number of classes increases beyond the known classes, resolving further differences that result from rotational errors (i.e. the rotational errors account for a minimum intra-class distance of ~2). However, the silhouette mean will increase again when more individual particles are forming their individual classes (i.e. a class with only one particle) as by definition (see previous footnote) such classes have a silhouette mean value of 1 despite their similarities to other classes. We then examined hierarchical ascendant clustering (HAC). As with the previous example in smiley faces, the dendrogram of the hierarchical tree can be previewed (Figure G.12 middle row), and users can prune the tree with a user-specified flat cutoff threshold using either distance or desired cluster number (called Maximum Cluster Number134). The problem with

134 The tree may not always cut to clusters of an odd number if two clusters have equal distances. Therefore, the actual number of clusters may be lower if an odd number is specified, hence the name maximum cluster number is used.

221 cutting the tree with a flat distance cutoff is that the cutoff only reflects the maximum allowable intra-class distances. As long as the intra-class distances remain in the same envelop, the flat cutoff does not discriminate a tight cluster from a loose cluster (e.g. Figure G.10 cluster 2), nor does it adjust for scaling where a cluster proportionally expands its radius (e.g. Figure G.10 cluster 3 and 4). A more natural way to cut the tree is to use the inherent inconsistencies between hierarchical tree nodes, which reflects the relative changes in distance instead of the absolute distance values (Figure G.11). The inconsistency coefficient135 in HAC compares the height of a node to the average height of the nodes within a specified number of levels below it in the hierarchy, also known as depth. By specifying both the inconsistency coefficient cutoff and depth, it is possible to obtain a more natural division of the clusters in the hierarchic tree based on relative similarities instead of a flat distance cutoff. As with k-means clustering, the clustering silhouette values can be plotted and scanned against selective parameters for optimization

(Figure G.12). Interestingly, the flat cutoff scheme, when optimized, results in the exact same twelve classes as in the source templates (Figure G.6 left), but with a lower silhouette mean than that generated from the inconsistency scheme, which bolsters fewer classes (Figure G.12 dendrogram and silhouette plot). For comparison, the optimized classes are displayed in Figure

G.14.

135 For a full mathematical description and actual calculation, see MATLAB documentation on hierarchical clustering.

222

Figure G.10 Hierarchical Ascendant Clustering – Flat Cutoff Scheme

As an illustration, the particles are plotted as filled blue circles in a 2D space with the distances represented by their Euclidian distances (the actual representation will be a plot in the N-dimensional space with each dimension represented by particle attributes). In the HAC flat cutoff scheme, the clusters are separated by a flat distance. As illustrated in the figure, a flat distance of R (represented as cluster ‘radius’ for simplicity) results in five clusters. Because the cutoff is flat, it does not concern the inherent tightness of the cluster (e.g. cluster 2 can arguably be separated into two clusters), nor does it address the proportional expansion of a cluster (e.g. cluster 4 may be a natural extension of cluster 3 and they may be clustered together).

Figure G.11 Hierarchical Ascendant Clustering – Inconsistency Scheme

In the HAC inconsistency scheme, the clusters are separated by their inconsistent coefficients. For simplicity, the 푑 inconsistent coefficient may be visualized as the relative distance from one cluster to another (depth=1), i.e. 푖 , 푑푖−1 where i stands for level i in the hierarchical tree (from the bottom up). For example, an inconsistency of 2 means that the relative distance must not be greater than 2, or the distance between the clusters being joined must not be two times greater than their respective intra-class distances. As illustrated in the figure using the same example as in Figure G.10, an inconsistency of 2 would result in four clusters. Because the inconsistent coefficient compares relative similarities, it will separate clusters based on their inherent tightness (e.g. cluster 2 and cluster 3, where 푑 푑 푑 2 > 2), and properly address the proportional expansion of a cluster (e.g. cluster 4 where 2 < 2 and 3 < 2). 푑1 푑1 푑2

223

Figure G.12 HAC Parameter Optimization

As with k-means clustering (Figure G.9), the silhouette values can be used to optimize HAC cluster parameters (inconsistency, distance, etc.). The Euclidian distance (Eq. G. 23) is used for either clustering scheme. (Silhouette Optimization Plot) The silhouette mean is scanned by sweeping the parameter space in two HAC clustering schemes (flat cutoff vs. inconsistency). In the flat cutoff scheme, the silhouette mean is plotted against the max number of clusters or distance (only the distance plot is shown in the figure). In the inconsistency scheme, the silhouette mean is mapped over two parameters – inconsistency cutoff and depth. The position where the silhouette mean is peaked is marked by a circle. (Dendrogram) The flat cutoff scheme will prune the tree at a flat distance or a user-specified max cluster number, whereas the inconsistency scheme will prune the tree based on relative distances in between hierarchies. Note the difference between the two dendrograms – the inconsistency scheme does not have a flat cut, and results in fewer clusters than the flat cut scheme, which obtains the same number of image classes as the source templates. (Silhouette Plot) The silhouette values from all nodes are plotted and averaged. Interestingly, the clustering generated by the inconsistency scheme results in higher silhouette mean while bolstering fewer classes than the source templates.

224 Finally, we examine the eigenanalysis classification method. The two parameters that determine the eigenanalysis outcome are the correlation threshold (initial matching threshold from the pairwise correlation matrix) and the overlapping threshold (subsequent matching threshold from iterative Jaccard index operations). The silhouette means are mapped over these two parameters for optimization (Figure G.13 left), from which configuration with the highest silhouette mean is used to generate the silhouette plot (Figure G.13 right). It appears that all twelve image classes from the source (Figure G.6) are captured and the silhouette plot/mean is exactly the same as that generated from the HAC flat cutoff scheme (Figure G.12 bottom left) other than a few permutations in class/cluster number. This result is not surprising given that the eigenanalysis of segmenting factorial/principal components (classes/groups) can be seen as an analog to the reverse operation of hierarchical ascendant clustering. Instead of building the dendrogram from the ground up (leaves to root), the eigenanalysis essentially branches the dendrogram (an invert tree) from the top down (root to leaves, or sub-grouping, see also Figure

G.1). Using a flat threshold-based cutoff also means a flat cutoff of the tree, and therefore resulting in the same classes as that generated with the HAC flat cutoff scheme. Note that the eigenanalysis also allows for super-grouping, or the bottom-up approach to group smaller classes into super-classes (Figure G.2), similar to the typical ascending tree in hierarchical clustering.

The resulting classes are summarized in Figure G.14.

225

Figure G.13 Eigenanalysis Parameter Optimization

(Left) The silhouette mean is mapped over two eigenanalysis parameters – correlation threshold and overlapping threshold (see definition in Appendix G.I). The locations of peak silhouette mean are marked by blue circles. (Right) Silhouette Plot using one of the peak locations on the left figure. Note that all three peaks marked on the left figure generates the same silhouette plot with the exact same cluster outcome. In our application, creating classes with the highest silhouette means (Figure G.14) is advantageous in obtaining more refined class averages and improves resolution of particle structures using oversampling technique. However, as discussed in Section 3.7.3, the benefit of averaging may be limited for AFM images. Instead, we may want to include conformation variations within classes for a broader classification rather than based on the exact shapes. These

‘fuzzier’ classes are advantageous in identifying major conformational shifts while allowing for minor conformational changes, and are more analog to how human perceive shape differences.

One approach to solve for clusters under this broader class definition is to use local maxima during parameter optimization (e.g. coordinate (7.5,7.5) in Figure G.13; 푘 = 8 in Figure G.9).

Instead of shooting for the highest refined classes with the highest silhouette mean, a lower, but locally maximized silhouette mean may be used to obtain a set of more broadly defined classes

(with a reduced number of classes). Because multiple local maxima may exist, the result could

226 be drastically different if a different local maximum is used, and a trial and error process may be needed to obtain the best-case scenario. Here we tested one local maximum from each cluster scheme and the results are summarized in Figure G.15. For the local maximum being tested, it appears that the HAC flat cutoff scheme has the best resemblance to the six manual classes

(Figure G.15 HAC flat cutoff vs. Figure G.6). Not surprisingly, the class ‘round’ and the class

‘round with tail’ are assimilated into one class due to their small differences, and the two triplet image templates continue to consist two classes due to their distinctively different bending states.

To further resolve the ‘round with tail’ class from the ‘round’ class, sub-grouping technique can be used (Figure G.1) by sampling particles from within the target class. The correlation-based measure may not be a good shape descriptor for large deformations such as branching, bending, and extending, three commonly seen conformational variations. Skeleton-based shape descriptors [275] such as branch points, fiber width, and fiber length (Figure 3.14) may be better measures to classify these conformational deformations.

In Image Metrics, users can easily hybridize different classification schemes at different levels of the clustering process, a technique frequently used in EM [139, 276]. For example, users can also apply HAC or eigenanalysis to individual groups even though the initial groups are generated through k-means clustering. Since different schemes have their own advantages and disadvantages, using different strategies at different levels of the grouping process can often improve grouping efficacy, and give more controls to users on grouping outcomes.

227

Figure G.14 Summary of Classes (w/ Highest Silhouette Means)

Particles from each class (‘Group#N’, with N represents group number) are aligned and the first four are displayed. The number of particles in each class is indicated in the block on the right. These classes are obtained from parameter optimization with highest silhouette values, and therefore will be the highest refined classes that should replicate the twelve image templates that are used (instead of the five manual classes) in the source (Figure G.6). Indeed, both the eigenanalysis and the HAC (flat cutoff scheme) capture the designated classes in full. In the HAC (inconsistency scheme), Group#11 and Group#12, and Group#3 and Group#4 (first column) are merged into Group#10 and Group#3 (second column) respectively, probably due to insufficient parameter sampling during parameter optimization. The k-means method results in only eleven classes, and could be a result from interpolation errors (as discussed in Figure G.9 description).

228

Figure G.15 Summary of Classes (w/ Local Silhouette Maxima)

The local silhouette maxima may be used to obtain a set of more broadly defined classes that resemble the six classes obtained from manual inspection (Figure G.6). Because multiple local maxima may exist, a trial and error process may be needed to obtain the best-case scenario. Here only one local maximum from each cluster scheme is tested, so the result could be drastically different if a different local maximum is used. The local maxima locations being used are: Eigenanalysis – coordinate (7.5,7.5) in Figure G.13 (left); K-means – 푘 = 8 in Figure G.9; HAC flat cutoff scheme – Distance Threshold = 0.3 in Figure G.12 (top left); HAC inconsistency scheme – coordinate (4,2.4) in Figure G.12 (top right). The resulting classes, when compared to the manual classes, shows that HAC (flat cutoff scheme) and K-means have the best and the second-best resemblance for the local maxima being tested.

229 APPENDIX H. IMAGE SIMULATION

In Image Metrics, images can be synthesized using 3D models in the module Image

Simulator (Figure H.1). The 3D models can either be loaded from external files such as those containing protein crystal coordinates (e.g. PDB file and DX file), or synthesized through parameterization of shapes of simple geometry (e.g. disc, polygons, sphere). The models can then be visualized through 3D plots such as wireframe mesh plot, surface plot, and scatter plot

(Figure H.1 3D View). The AFM image of a 3D model is synthesized through its projection on the XY plane (Figure H.1 2D Projection – top), where users can specify the height of the image as either the topographic height (re-adjusted relative to the surface or XY plane) or the Z-density

(i.e. density map). The simulation is performed in real time as users re-orient the 3D model through its Euler angles to mimic a likely sample-surface interaction. Ideally, the Euler angles can be determined in a separate simulation that minimizes the potential energies between the model and the surface (which could involve the weight distribution or center of mass of the model, the sample and surface’s charge and magnetic properties and etc.), therefore producing a best-fit orientation of the model for the landing scenario. The projection map (topographic or density) is then digitized into user specified resolution to simulate an actual AFM image of the

3D model (Figure H.1 2D Projection – bottom). Finally, the AFM images of 3D models (aka. particles) can be used to generate AFM images of larger sizes that contain a user-specified number of particles by using random generators for particle positions and rotations. The particles can also include heterogenous models to simulate a more promiscuous situation.

230

Figure H.1 Image Simulator

The Image Simulator Module allows user to load 3D objects and simulate what they would look like if they were deposited as an AFM sample. Once the 3D object (blue box, an N-terminal 40kD fragment of E. coli MutL is shown) is loaded, users can manipulate its orientation and the program can simulate a 2D projection AFM image of the 3D project (purple box). The program can further simulate a random deposition of the 2D projection in an area of specified size (red box). This simulated image can then be loaded into Image Processor for further processing and/or overlaying on top of existing images. The Image Simulator module also allows users to perform tip modeling and apply tip dilation on images. Modeling of surface interaction and its impact to the orientation, docking 3D model into existing AFM image are also on the roadmap. Tip dilation can also be applied to simulated images. Currently, tip models can be imported or synthesized (Figure H.2) through parameterization of three geometries (pyramid, cone, and sphere). Supported parameters include tip apex radius, tip slope, tip height, tip rotation, and number of tip sides. The geometries are generated through stacking of right polygons, while using tip slope and tip height to dictate the size of each stacking layer (Figure H.3). Tip apex is simulated as described in [106] through ‘3D’ grayscale dilation of the tip geometry with a sphere of the same radius as the specified tip apex radius (Figure H.4). The simulated tip can then be used to dilate an image with the same image dilation technique as described below.

231

Figure H.2 Tip Modeling

The tip can be modeled through parameterization of shapes of simple geometries (cone, pyramid, and sphere). Parameters are chosen from specs listed in AFM probes so that users can conveniently generate tip that matches the spec of their probe, which includes tip apex radius, tip slope, tip height, tip rotation, number of tip sides. Showing in the figure above is a four-sided pyramid tip with a tip apex radius of 10nm and tip slope of 15 degree.

232

Figure H.3 Tip Geometry Parameterization

The tip’s geometry is generated by stacking layers of right polygons. The size of each polygon layer is calculated through tip height and tip slope. The number of sides in the right polygons can vary to simulate between any shape between an equilateral triangle and a full circle. Polygons can also be rotated to simulate tip rotations.

Figure H.4 Tip Apex Simulation

The cross-section of a cone-shaped tip is shown. The tip apex is simulated by dilating the cone with a sphere that has the same radius as the specified tip apex radius, also known as sweeping the cone with a sphere or sphere- swept-cone. (Figure adapted from [106]) The technique for tip dilation, or ‘3D’ Grayscale dilation is described previously in [106,

107]. Essentially, the grayscale dilation is a local-maximum operator performed on the overlaps of two image sets – the tip and the actual image. In order to perform the dilation that mimics the realistic scenario, an inverted tip, or negative tip is used (Figure H.5). Note that the tip models in

233 Figure H.2, Figure H.3, and Figure H.4 are all inverted tips since the normal tip in experiments will be pointing downwards to the surface, not upwards. The negative tip is an inverted tip with its peak shifted to zero (such that the rest of the tip is negative in heights). The negative tip is introduced to avoid inflating the actual image with the height of the tip (so that local maximum will not exceed the height of the actual image) – in other words, the dilation is restricted to the

XY plane only, mimicking the actual AFM imaging scenario (Figure H.5).

Figure H.5 Tip Dilation

(Top) Realistically, the tip is pointing down to the surface and dilation occurs when tip touches the sample on its side, resulting in dilated image (dashed line) than the actual sample topography (thick solid line). (Bottom) Computationally, the structural element used for image dilation requires an inverted tip (compare the tip direction in the top figure vs. bottom figure) to mimic the same dilation effect as the real-life scenario. The inverted tip, or negative tip, is a tip with its apex pointing upwards and height shifted to zero (so that the height of the image does not exceed that of the original image when performing local maximum operations). Figure is adapted from [107];

234 The result of tip dilation is compared with Gwyddion, which also features tip dilation function. Both Image Metrics and Gwyddion support import of external images as tip models, so the dilation can be performed using the same tip model and the same image. Below the comparison of tip dilation results between Gwyddion and Image Metrics is shown. There is no visible difference between the two software in this aspect.

Figure H.6 Image Tip-Dilation Comparison

An image simulated from the model used in Figure H.1 is used for tip dilation. The tip is modeled as shown in Figure H.2. The tip model is imported in Gwyddion and calibrated. Image dilation is performed in both Gwyddion and Image Metrics using the same tip model on the same image. The obtained results in both programs are identical visibly.

235 REFERENCES

1. Alberts, B., Bray, D., Hopkin, K., Johnson, A.D., Lewis, J., Raff, M., Roberts, K., and Walter, P. (2013). Essential Cell Biology (New York: Garland Science).

2. Kunkel, T.A., and Erie, D.A. (2005). DNA Mismatch Repair. Annu Rev Biochem 74, 681-710.

3. Wolf, A.I., Buchanan, A.H., and Farkas, L.M. (2013). Historical Review of Lynch Syndrome. Journal of Coloproctology (Rio de Janeiro) 33, 95-110.

4. Schmidt, M.H., and Pearson, C.E. (2016). Disease-Associated Repeat Instability and Mismatch Repair. DNA repair 38, 117-126.

5. Tessmer, I., Yang, Y., Zhai, J., Du, C., Hsieh, P., Hingorani, M.M., and Erie, D.A. (2008). Mechanism of Muts Searching for DNA Mismatches and Signaling Repair. J Biol Chem 283, 36646-36654.

6. Wang, H., Yang, Y., Schofield, M.J., Du, C., Fridman, Y., Lee, S.D., Larson, E.D., Drummond, J.T., Alani, E., Hsieh, P., and Erie, D.A. (2003). DNA Bending and Unbending by Muts Govern Mismatch Recognition and Specificity. Proc Natl Acad Sci U S A 100, 14822-14827.

7. Luijsterburg, M.S., von Bornstaedt, G., Gourdin, A.M., Politi, A.Z., Mone, M.J., Warmerdam, D.O., Goedhart, J., Vermeulen, W., van Driel, R., and Hofer, T. (2010). Stochastic and Reversible Assembly of a Multiprotein DNA Repair Complex Ensures Accurate Target Site Recognition and Efficient Repair. The Journal of cell biology 189, 445-463.

8. Hura, G.L., Budworth, H., Dyer, K.N., Rambo, R.P., Hammel, M., McMurray, C.T., and Tainer, J.A. (2013). Comprehensive Macromolecular Conformations Mapped by Quantitative Saxs Analyses. Nat Methods 10, 453-454.

9. Hura, G.L., Tsai, C.L., Claridge, S.A., Mendillo, M.L., Smith, J.M., Williams, G.J., Mastroianni, A.J., Alivisatos, A.P., Putnam, C.D., Kolodner, R.D., and Tainer, J.A. (2013). DNA Conformations in Mismatch Repair Probed in Solution by X-Ray Scattering from Gold Nanocrystals. Proc Natl Acad Sci U S A 110, 17308-17313.

10. Hennig, J., and Sattler, M. (2014). The Dynamic Duo: Combining Nmr and Small Angle Scattering in Structural Biology. Protein science : a publication of the Protein Society 23, 669-682.

11. Williams, G.J., Hammel, M., Radhakrishnan, S.K., Ramsden, D., Lees-Miller, S.P., and Tainer, J.A. (2014). Structural Insights into Nhej: Building up an Integrated Picture of the Dynamic Dsb Repair Super Complex, One Component and Interaction at a Time. DNA repair 17, 110-120.

236 12. Griffith, J.D., and Christiansen, G. (1978). Electron Microscope Visualization of Chromatin and Other DNA-Protein Complexes. Annu Rev Biophys Bioeng 7, 19-35.

13. Griffith, J.D. (2013). Many Ways to Loop DNA. J Biol Chem 288, 29724-29735.

14. Bustamante, C., Erie, D.A., and Keller, D. (1994). Biochemical and Structural Applications of Scanning Force Microscopy. Current Opinion in Structural Biology 4, 750-760.

15. Janicijevic, A., Ristic, D., and Wyman, C. (2003). The Molecular Machines of DNA Repair: Scanning Force Microscopy Analysis of Their Architecture. Journal of microscopy 212, 264-272.

16. Lyubchenko, Y.L., Gall, A.A., and Shlyakhtenko, L.S. (2001). Atomic Force Microscopy of DNA and Protein-DNA Complexes Using Functionalized Mica Substrates. Methods Mol Biol 148, 569-578.

17. Yang, Y., Wang, H., and Erie, D.A. (2003). Quantitative Characterization of Biomolecular Assemblies and Interactions Using Atomic Force Microscopy. Methods 29, 175-187.

18. Lohr, D., Bash, R., Wang, H., Yodh, J., and Lindsay, S. (2007). Using Atomic Force Microscopy to Study Chromatin Structure and Nucleosome Remodeling. Methods 41, 333-341.

19. Wanner, G., and Schroeder-Reiter, E. (2008). Scanning Electron Microscopy of Chromosomes. Methods in cell biology 88, 451-474.

20. Sanchez, H., Kertokalio, A., van Rossum-Fikkert, S., Kanaar, R., and Wyman, C. (2013). Combined Optical and Topographic Imaging Reveals Different Arrangements of Human Rad54 with Presynaptic and Postsynaptic Rad51-DNA Filaments. Proc Natl Acad Sci U S A 110, 11385-11390.

21. Erie, D.A., Yang, G., Schultz, H.C., and Bustamante, C. (1994). DNA Bending by Cro Protein in Specific and Nonspecific Complexes: Implications for Protein Site Recognition and Specificity. Science 266, 1562-1566.

22. Villarreal, S.A., and Stewart, P.L. (2014). Cryoem and Image Sorting for Flexible Protein/DNA Complexes. J Struct Biol 187, 76-83.

23. Yeh, J.I., Levine, A.S., Du, S., Chinte, U., Ghodke, H., Wang, H., Shi, H., Hsieh, C.L., Conway, J.F., Van Houten, B., and Rapic-Otrin, V. (2012). Damaged DNA Induced Uv- Damaged DNA-Binding Protein (Uv-Ddb) Dimerization and Its Roles in Chromatinized DNA Repair. Proc Natl Acad Sci U S A 109, E2737-E2746.

24. Trinh, M.H., Odorico, M., Pique, M.E., Teulon, J.M., Roberts, V.A., Ten Eyck, L.F., Getzoff, E.D., Parot, P., Chen, S.W., and Pellequer, J.L. (2012). Computational

237 Reconstruction of Multidomain Proteins Using Atomic Force Microscopy Data. Structure 20, 113-120.

25. Moreno-Herrero, F., de Jager, M., Dekker, N.H., Kanaar, R., Wyman, C., and Dekker, C. (2005). Mesoscale Conformational Changes in the DNA-Repair Complex Rad50/Mre11/Nbs1 Upon Binding DNA. Nature 437, 440-443.

26. Maletta, M., Orlov, I., Roblin, P., Beck, Y., Moras, D., Billas, I.M., and Klaholz, B.P. (2014). The Palindromic DNA-Bound Usp/Ecr Nuclear Receptor Adopts an Asymmetric Organization with Allosteric Domain Positioning. Nature communications 5, 4139.

27. Bazett-Jones, D.P., Mendez, E., Czarnota, G.J., Ottensmeyer, F.P., and Allfrey, V.G. (1996). Visualization and Analysis of Unfolded Nucleosomes Associated with Transcribing Chromatin. Nucleic Acids Res 24, 321-329.

28. Orlova, E.V., and Saibil, H.R. (2010). Methods for Three-Dimensional Reconstruction of Heterogeneous Assemblies. Methods Enzymol 482, 321-341.

29. Miyata, T., Suzuki, H., Oyama, T., Mayanagi, K., Ishino, Y., and Morikawa, K. (2005). Open Clamp Structure in the Clamp-Loading Complex Visualized by Electron Microscopic Image Analysis. Proc Natl Acad Sci U S A 102, 13795-13800.

30. He, Y., Fang, J., Taatjes, D.J., and Nogales, E. (2013). Structural Visualization of Key Steps in Human Transcription Initiation. Nature 495, 481-486.

31. Brown, A., Amunts, A., Bai, X.C., Sugimoto, Y., Edwards, P.C., Murshudov, G., Scheres, S.H., and Ramakrishnan, V. (2014). Structure of the Large Ribosomal Subunit from Human Mitochondria. Science 346, 718-722.

32. Fernandez, I.S., Bai, X.C., Hussain, T., Kelley, A.C., Lorsch, J.R., Ramakrishnan, V., and Scheres, S.H. (2013). Molecular Architecture of a Eukaryotic Translational Initiation Complex. Science 342, 1240585.

33. Barth, C., Foster, A.S., Henry, C.R., and Shluger, A.L. (2011). Recent Trends in Surface Characterization and Chemistry with High-Resolution Scanning Force Methods. Adv Mater 23, 477-501.

34. Melitz, W., Shen, J., Kummel, A.C., and Lee, S. (2011). Kelvin Probe Force Microscopy and Its Application. Surface Science Reports 66, 1-27.

35. Nonnenmacher, M., O’Boyle, M.P., and Wickramasinghe, H.K. (1991). Kelvin Probe Force Microscopy. Applied Physics Letters 58, 2921.

36. Glatzel, T. (2003). Amplitude or Frequency Modulation-Detection in Kelvin Probe Force Microscopy. Appl Surf Sci 210, 84-89.

238 37. Leung, C., Maradan, D., Kramer, A., Howorka, S., Mesquida, P., and Hoogenboom, B.W. (2010). Improved Kelvin Probe Force Microscopy for Imaging Individual DNA Molecules on Insulating Surfaces. Applied Physics Letters 97, 203703.

38. Thompson, H.T., Barroso-Bujans, F., Herrero, J.G., Reifenberger, R., and Raman, A. (2013). Subsurface Imaging of Carbon Nanotube Networks in Polymers with Dc-Biased Multifrequency Dynamic Atomic Force Microscopy. Nanotechnology 24, 135701.

39. Mikamo-Satoh, E., Yamada, F., Takagi, A., Matsumoto, T., and Kawai, T. (2009). Electrostatic Force Microscopy: Imaging DNA and Protein Polarizations One by One. Nanotechnology 20, 145102.

40. Kikukawa, A., Hosaka, S., and Imura, R. (1996). Vacuum Compatible High-Sensitive Kelvin Probe Force Microscopy. Review of Scientific Instruments 67, 1463-1467.

41. Stark, R.W., Naujoks, N., and Stemmer, A. (2007). Multifrequency Electrostatic Force Microscopy in the Repulsive Regime. Nanotechnology 18, 065502.

42. Ziegler, D., Rychen, J., Naujoks, N., and Stemmer, A. (2007). Compensating Electrostatic Forces by Single-Scan Kelvin Probe Force Microscopy. Nanotechnology 18, 225505.

43. Cleveland, J.P., Anczykowski, B., Schmid, A.E., and Elings, V.B. (1998). Energy Dissipation in Tapping-Mode Atomic Force Microscopy. Applied Physics Letters 72, 2613-2615.

44. Tamayo, J. (2005). Study of the Noise of Micromechanical Oscillators under Quality Factor Enhancement Via Driving Force Control. Journal of Applied Physics 97, 044903.

45. Rodrıguez,́ T.s.R., and Garcı́a, R. (2004). Compositional Mapping of Surfaces in Atomic Force Microscopy by Excitation of the Second Normal Mode of the Microcantilever. Applied Physics Letters 84, 449.

46. Martinez, N.F., and Garcia, R. (2006). Measuring Phase Shifts and Energy Dissipation with Amplitude Modulation Atomic Force Microscopy. Nanotechnology 17, S167-S172.

47. Martinez, N.F., Lozano, J.R., Herruzo, E.T., Garcia, F., Richter, C., Sulzbach, T., and Garcia, R. (2008). Bimodal Atomic Force Microscopy Imaging of Isolated Antibodies in Air and Liquids. Nanotechnology 19, 384011.

48. Kokavecz, J., and Mechler, A. (2008). Spring Constant of Microcantilevers in Fundamental and Higher Eigenmodes. Physical Review B 78, 172101.

49. Hoummady, M., and Farnault, E. (1998). Enhanced Sensitivity to Force Gradients by Using Higher Flexural Modes of the Atomic Force Microscope Cantilever. Applied Physics A: Materials Science & Processing 66, S361.

239 50. Ding, X.D., An, J., Xu, J.B., Li, C., and Zeng, R.Y. (2009). Improving Lateral Resolution of Electrostatic Force Microscopy by Multifrequency Method under Ambient Conditions. Applied Physics Letters 94, 223109.

51. Stark, R.W., Drobek, T., and Heckl, W.M. (1999). Tapping-Mode Atomic Force Microscopy and Phase-Imaging in Higher Eigenmodes. Applied Physics Letters 74, 3296-3298.

52. Rezek, B. (2005). Atomic and Kelvin Force Microscopy Applied on Hydrogenated Diamond Surfaces. New Diam Front C Tech 15, 275-295.

53. Albrecht, T.R., Grütter, P., Horne, D., and Rugar, D. (1991). Frequency Modulation Detection Using High-Q Cantilevers for Enhanced Force Microscope Sensitivity. Journal of Applied Physics 69, 668-673.

54. Martin, Y., Williams, C.C., and Wickramasinghe, H.K. (1987). Atomic Force Microscope–Force Mapping and Profiling on a Sub 100-å Scale. Journal of Applied Physics 61, 4723.

55. Giessibl, F.J. (1995). Atomic Resolution of the Silicon (111)-(7x7) Surface by Atomic Force Microscopy. Science 267, 68-71.

56. Colchero, J., Gil, A., and Baró, A. (2001). Resolution Enhancement and Improved Data Interpretation in Electrostatic Force Microscopy. Physical Review B 64.

57. Lei, C., Das, A., Elliott, M., and Macdonald, J.E. (2004). Quantitative Electrostatic Force Microscopy-Phase Measurements. Nanotechnology 15, 627.

58. Choi, K.J., Biegalski, M., Li, Y.L., Sharan, A., Schubert, J., Uecker, R., Reiche, P., Chen, Y.B., Pan, X.Q., Gopalan, V., Chen, L.Q., Schlom, D.G., and Eom, C.B. (2004). Enhancement of Ferroelectricity in Strained Batio3 Thin Films. Science 306, 1005-1009.

59. Trithaveesak, O., Schubert, J., and Buchal, C. (2005). Ferroelectric Properties of Epitaxial Batio3 Thin Films and Heterostructures on Different Substrates. Journal of Applied Physics 98.

60. Gruverman, A., Wu, D., Lu, H., Wang, Y., Jang, H.W., Folkman, C.M., Zhuravlev, M.Y., Felker, D., Rzchowski, M., Eom, C.B., and Tsymbal, E.Y. (2009). Tunneling Electroresistance Effect in Ferroelectric Tunnel Junctions at the Nanoscale. Nano Letters 9, 3539-3543.

61. Bonnell, D.A., and Kalinin, S.V. (2001). Local Polarization, Charge Compensation, and Chemical Interactions on Ferroelectric Surfaces: A Route toward New Nanostructures. MRS Proceedings 688.

62. Luger, K., Mader, A.W., Richmond, R.K., Sargent, D.F., and Richmond, T.J. (1997). Crystal Structure of the Nucleosome Core Particle at 2.8 a Resolution. Nature 389, 251- 260.

240 63. Luger, K., Dechassa, M.L., and Tremethick, D.J. (2012). New Insights into Nucleosome and Chromatin Structure: An Ordered State or a Disordered Affair? Nature reviews. Molecular cell biology 13, 436-447.

64. Obmolova, G., Ban, C., Hsieh, P., and Yang, W. (2000). Crystal Structures of Mismatch Repair Protein Muts and Its Complex with a Substrate DNA. Nature 407, 703-710.

65. Lamers, M.H., Perrakis, A., Enzlin, J.H., Winterwerp, H.H., de Wind, N., and Sixma, T.K. (2000). The Crystal Structure of DNA Mismatch Repair Protein Muts Binding to a G X T Mismatch. Nature 407, 711-717.

66. Warren, J.J., Pohlhaus, T.J., Changela, A., Iyer, R.R., Modrich, P.L., and Beese, L.S. (2007). Structure of the Human Mutsalpha DNA Lesion Recognition Complex. Mol Cell 26, 579-592.

67. Grilley, M., Welsh, K.M., Su, S.S., and Modrich, P. (1989). Isolation and Characterization of the Escherichia Coli Mutl Gene Product. J Biol Chem 264, 1000- 1004.

68. Schofield, M.J., Nayak, S., Scott, T.H., Du, C., and Hsieh, P. (2001). Interaction of Escherichia Coli Muts and Mutl at a DNA Mismatch. J Biol Chem 276, 28291-28299.

69. Hombauer, H., Campbell, C.S., Smith, C.E., Desai, A., and Kolodner, R.D. (2011). Visualization of Eukaryotic DNA Mismatch Repair Reveals Distinct Recognition and Repair Intermediates. Cell 147, 1040-1053.

70. Elez, M., Radman, M., and Matic, I. (2012). Stoichiometry of Muts and Mutl at Unrepaired Mismatches in Vivo Suggests a Mechanism of Repair. Nucleic Acids Res 40, 3929-3938.

71. Wang, H., Bash, R., Yodh, J.G., Hager, G.L., Lohr, D., and Lindsay, S.M. (2002). Glutaraldehyde Modified Mica: A New Surface for Atomic Force Microscopy of Chromatin. Biophys J 83, 3619-3625.

72. Lyubchenko, Y.L. (2014). Nanoscale Nucleosome Dynamics Assessed with Time-Lapse Afm. Biophysical reviews 6, 181-190.

73. Zlatanova, J., and Leuba, S.H. (2003). Chromatin Fibers, One-at-a-Time. J Mol Biol 331, 1-19.

74. Yang, G., Leuba, S.H., Bustamante, C., Zlatanova, J., and van Holde, K. (1994). Role of Linker Histones in Extended Chromatin Fibre Structure. Nat Struct Biol 1, 761-763.

75. Zlatanova, J., Leuba, S.H., Yang, G., Bustamante, C., and van Holde, K. (1994). Linker DNA Accessibility in Chromatin Fibers of Different Conformations: A Reevaluation. Proc Natl Acad Sci U S A 91, 5277-5280.

241 76. Bustamante, C., Zuccheri, G., Leuba, S.H., Yang, G., and Samori, B. (1997). Visualization and Analysis of Chromatin by Scanning Force Microscopy. Methods 12, 73-83.

77. Swygert, S.G., Manning, B.J., Senapati, S., Kaur, P., Lindsay, S., Demeler, B., and Peterson, C.L. (2014). Solution-State Conformation and Stoichiometry of Yeast Sir3 Heterochromatin Fibres. Nature communications 5, 4751.

78. Kunkel, T.A., and Erie, D.A. (2015). Eukaryotic Mismatch Repair in Relation to DNA Replication. Annual Review of Genetics 49, 291-313.

79. Allen, D.J., Makhov, A., Grilley, M., Taylor, J., Thresher, R., Modrich, P., and Griffith, J.D. (1997). Muts Mediates Heteroduplex Loop Formation by a Translocation Mechanism. The EMBO journal 16, 4467-4476.

80. Jiang, Y., and Marszalek, P.E. (2011). Atomic Force Microscopy Captures Muts Tetramers Initiating DNA Mismatch Repair. The EMBO journal 30, 2881-2893.

81. Gradia, S., Subramanian, D., Wilson, T., Acharya, S., Makhov, A., Griffith, J., and Fishel, R. (1999). Hmsh2-Hmsh6 Forms a Hydrolysis-Independent Sliding Clamp on Mismatched DNA. Mol Cell 3, 255-261.

82. Cho, W.-K., Jeong, C., Kim, D., Chang, M., Song, K.-M., Hanne, J., Ban, C., Fishel, R., and Lee, J.-B. (2012). Atp Alters the Diffusion Mechanics of Muts on Mismatched DNA. Structure 20, 1264-1274.

83. Qiu, R., Derocco, V.C., Harris, C., Sharma, A., Hingorani, M.M., Erie, D.A., and Weninger, K.R. (2012). Large Conformational Changes in Muts During DNA Scanning, Mismatch Recognition and Repair Signalling. The EMBO journal 31, 2528-2540.

84. Ratcliff, G.C., and Erie, D.A. (2001). A Novel Single-Molecule Study to Determine Protein--Protein Association Constants. J Am Chem Soc 123, 5632-5635.

85. Mendillo, M.L., Putnam, C.D., and Kolodner, R.D. (2007). Escherichia Coli Muts Tetramerization Domain Structure Reveals That Stable Dimers but Not Tetramers Are Essential for DNA Mismatch Repair in Vivo. J Biol Chem 282, 16345-16354.

86. Groothuizen, F.S., Fish, A., Petoukhov, M.V., Reumer, A., Manelyte, L., Winterwerp, H.H., Marinus, M.G., Lebbink, J.H., Svergun, D.I., Friedhoff, P., and Sixma, T.K. (2013). Using Stable Muts Dimers and Tetramers to Quantitatively Analyze DNA Mismatch Recognition and Sliding Clamp Formation. Nucleic Acids Res 41, 8166-8181.

87. Fumagalli, L., Edwards, M.A., and Gomila, G. (2014). Quantitative Electrostatic Force Microscopy with Sharp Silicon Tips. Nanotechnology 25, 495701.

88. Johnson, A.S., Nehl, C.L., Mason, M.G., and Hafner, J.H. (2003). Fluid Electric Force Microscopy for Charge Density Mapping in Biological Systems. Langmuir 19, 10007- 10010.

242 89. Gramse, G., Gomila, G., and Fumagalli, L. (2012). Quantifying the Dielectric Constant of Thick Insulators by Electrostatic Force Microscopy: Effects of the Microscopic Parts of the Probe. Nanotechnology 23, 205703.

90. Eaton, P., and West, P. (2010). Atomic Force Microscopy (New York: Oxford University Press).

91. Bickmore, B.R., Rufe, E., Barrett, S., and Hochella Jr, M.F. (1999). Measuring Discrete Feature Dimensions in Afm Images with Image Sxm. Geological Materials Research 1, 1.

92. Nečas, D., and Klapetek, P. (2012). Gwyddion: An Open-Source Software for Spm Data Analysis. Central European Journal of Physics 10, 181-188.

93. Schneider, C.A., Rasband, W.S., and Eliceiri, K.W. (2012). Nih Image to Imagej: 25 Years of Image Analysis. Nature Methods 9, 671.

94. Meijering, E., Jacob, M., Sarria, J.-C., Steiner, P., Hirling, H., and Unser, M. (2004). Design and Validation of a Tool for Neurite Tracing and Analysis in Fluorescence Microscopy Images. Cytometry 58, 167-176.

95. Barrett, S. (2002). Software for Scanning Microscopy. In Proceedings of the Royal Microscopical Society (London: Royal Microscopical Society), pp. 167-174.

96. Image Metrology (2018). The Scanning Probe Image Processor (Spip) User’s and Reference Guide (Copenhagen: Image Metrology).

97. Horcas, I., Fernández, R., Gomez-Rodriguez, J.M., Colchero, J., Gómez-Herrero, J., and Baro, A.M. (2007). Wsxm: A Software for Scanning Probe Microscopy and a Tool for Nanotechnology. Review of Scientific Instruments 78, 13705.

98. Sánchez, H., and Wyman, C. (2015). Sfmetrics: An Analysis Tool for Scanning Force Microscopy Images of Biomolecules. BMC Bioinformatics 16, 27.

99. Zahl, P., Bierkandt, M., Schröder, S., and Klust, A. (2003). The Flexible and Modern Open Source Scanning Probe Microscopy Software Package Gxsm. Review of Scientific Instruments 74, 1222-1227.

100. Chaves, R.C., and Pellequer, J.-L. (2013). Dockafm: Benchmarking Protein Structures by Docking under Afm Topographs. Bioinformatics 29, 3230-3231.

101. Roduit, C., Saha, B., Alonso-Sarduy, L., Volterra, A., Dietler, G., and Kasas, S. (2012). Openfovea: Open-Source Afm Data Processing Software. Nature Methods 9, 774.

102. Partola, K.R., and Lykotrafitis, G. (2016). Frame (Force Review Automation Environment): Matlab-Based Afm Data Processor. J Biomech 49, 1221-1224.

103. Shu-wen, W.C., and Pellequer, J.-L. (2011). Destripe: Frequency-Based Algorithm for Removing Stripe Noises from Afm Images. Bmc Struct Biol 11, 7.

243 104. Mikhaylov, A., Sekatskii, S.K., and Dietler, G. (2013). DNA Trace: A Comprehensive Software for Polymer Image Processing. Journal of Advanced Microscopy Research 8, 241-245.

105. Usov, I., and Mezzenga, R. (2015). Fiberapp: An Open-Source Software for Tracking and Analyzing Polymers, Filaments, Biomacromolecules, and Fibrous Objects. Macromolecules 48, 1269-1280.

106. Varadhan, G., Robinett, W., Erie, D., and Taylor, R.M. (2002). Fast Simulation of Atomic-Force-Microscope Imaging of Atomic and Polygonal Surfaces Using Graphics Hardware. In Visualization and Data Analysis 2002 (Bellingham: International Society for Optics and Photonics), pp. 116-125.

107. Villarrubia, J.S. (1997). Algorithms for Scanned Probe Microscope Image Simulation, Surface Reconstruction, and Tip Estimation. Journal of Research of the National Institute of Standards and Technology 102, 425.

108. Pan, X. (2014). Processing and Feature Analysis of Atomic Force Microscopy Images. (Rolla: Missouri University of Science and Technology).

109. Morris, V.J., Kirby, A.R., and Gunning, A.P. (2009). Atomic Force Microscopy for Biologists (London: Imperial College Press).

110. Canale, C., Torre, B., Ricci, D., and Braga, P.C. (2011). Recognizing and Avoiding Artifacts in Atomic Force Microscopy Imaging. In Atomic Force Microscopy in Biomedical Research (Berlin: Springer), pp. 31-43.

111. Vuori, T., Alakarhu, J., Salmelin, E., and Partinen, A. (2013). Nokia Pureview Oversampling Technology. In Multimedia Content and Mobile Devices (Bellingham: International Society for Optics and Photonics), p. 86671C.

112. Pandžić, E., Rossy, J., and Gaus, K. (2015). Tracking Molecular Dynamics without Tracking: Image Correlation of Photo-Activation Microscopy. Methods and Applications in Fluorescence 3, 14006.

113. Sun, Y., and Pang, J.H. (2006). Afm Image Reconstruction for Deformation Measurements by Digital Image Correlation. Nanotechnology 17, 933.

114. Bariani, P. (2005). Dimensional Metrology for Microtechnology. (Kgs. Lyngby: Technical University of Denmark (DTU)).

115. Frank, J. (2006). Two-Dimensional Averaging Techniques. In Three-Dimensional Electron Microscopy of Macromolecular Assemblies: Visualization of Biological Molecules in Their Native State (New York: Oxford University Press), pp. 71-144.

116. Eggeling, C., Ringemann, C., Medda, R., Schwarzmann, G., Sandhoff, K., Polyakova, S., Belov, V.N., Hein, B., von Middendorff, C., Schönle, A., and Hell, S.W. (2009). Direct

244 Observation of the Nanoscale Dynamics of Membrane Lipids in a Living Cell. Nature 457, 1159.

117. Berglund, A., and Mabuchi, H. (2005). Tracking-Fcs: Fluorescence Correlation Spectroscopy of Individual Particles. Optics Express 13, 8069-8082.

118. Jiang, L., Georgieva, D., and Abrahams, J.P. (2011). Ediff: A Program for Automated Unit-Cell Determination and Indexing of Electron Diffraction Data. Journal of Applied Crystallography 44, 1132-1136.

119. Klapetek, P., Necas, D., and Anderson, C. Gwyddion User Guide. (Czech Metrology Institute). Available from: http://gwyddion.net/.

120. Otsu, N. (1979). A Threshold Selection Method from Gray-Level Histograms. IEEE Transactions on Systems, Man, and Cybernetics 9, 62-66.

121. Nagel, W. (1988). Image Analysis and Mathematical Morphology. Volume 2: Theoretical Advances. Edited by Jean Serra. Journal of microscopy 152, 597-597.

122. Reed, S.K. (1972). Pattern Recognition and Categorization. Cognitive Psychology 3, 382- 407.

123. Brown, L.G. (1992). A Survey of Image Registration Techniques. ACM Computing Surveys (CSUR) 24, 325-376.

124. da Fontoura Costa, L., and Cesar Jr, R.M. (2010). Shape Analysis and Classification: Theory and Practice (Boca Raton: CRC Press).

125. Graf, M. (2010). Categorization and Object Shape. In Towards a Theory of Thinking (Berlin: Springer), pp. 73-101.

126. Salve, S.G., and Jondhale, K.C. (2010). Shape Matching and Object Recognition Using Shape Contexts. In 2010 3rd International Conference on Computer Science and Information Technology (New York: IEEE), pp. 471-474.

127. Ullman, S. (1986). An Approach to Object Recognition: Aligning Pictorial Descriptions. (Massachusetts Inst of Tech Cambridge Artificial Intelligence Lab).

128. Wang, J., Bai, X., You, X., Liu, W., and Latecki, L. (2012). Shape Matching and Classification Using Height Functions. Pattern Recognition Letters 33, 134-143.

129. Freeman, H. (1978). Shape Description Via the Use of Critical Points. Pattern Recognition 10, 159-166.

130. Hamsici, O.C., and Martinez, A.M. (2009). Rotation Invariant Kernels and Their Application to Shape Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 1985-1999.

245 131. Liu, Y.-S., Fang, Y., and Ramani, K. (2009). Idss: Deformation Invariant Signatures for Molecular Shape Comparison. BMC Bioinformatics 10, 157.

132. Berg, A.C., and Malik, J. (2006). Shape Matching and Object Recognition. In Toward Category-Level Object Recognition (Berlin: Springer), pp. 483-507.

133. Hodgetts, C.J., Hahn, U., and Chater, N. (2009). Transformation and Alignment in Similarity. Cognition 113, 62-79.

134. Milne, J.L., Borgnia, M.J., Bartesaghi, A., Tran, E.E., Earl, L.A., Schauder, D.M., Lengyel, J., Pierson, J., Patwardhan, A., and Subramaniam, S. (2013). Cryo-Electron Microscopy--a Primer for the Non-Microscopist. The FEBS journal 280, 28-45.

135. Ohi, M., Li, Y., Cheng, Y., and Walz, T. (2004). Negative Staining and Image Classification – Powerful Tools in Modern Electron Microscopy. Biological Procedures Online 6, 23-34.

136. Cheng, Y., Grigorieff, N., Penczek, P.A., and Walz, T. (2015). A Primer to Single- Particle Cryo-Electron Microscopy. Cell 161, 438-449.

137. Roseman, A.M. (2003). Particle Finding in Electron Micrographs Using a Fast Local Correlation Algorithm. Ultramicroscopy 94, 225-236.

138. Heel, M., Gowen, B., Matadeen, R., Orlova, E.V., Finn, R., Pape, T., Cohen, D., Stark, H., Schmidt, R., and Schatz, M. (2000). Single-Particle Electron Cryo-Microscopy: Towards Atomic Resolution. Quarterly Reviews of Biophysics 33, 307-369.

139. Frank, J. (2006). Multivariate Data Analysis and Classification of Images. In Three- Dimensional Electron Microscopy of Macromolecular Assemblies: Visualization of Biological Molecules in Their Native State (New York: Oxford University Press), pp. 145-192.

140. Lewis, J.P. (1995). Fast Template Matching. In Vision Interface (Quebec City: Canadian Image Processing and Pattern Recognition Society), pp. 15-19.

141. Essannouni, F., Thami, O.R., Aboutajdine, D., and Salam, A. (2007). Simple Noncircular Correlation Method for Exhaustive Sum Square Difference Matching. Optical Engineering 46, 107004.

142. Viola, P., and Iii, W.M. (1997). Alignment by Maximization of Mutual Information. International Journal of Computer Vision 24, 137-154.

143. Harrison, J.S., Cornett, E.M., Goldfarb, D., DaRosa, P.A., Li, Z.M., Yan, F., Dickson, B.M., Guo, A.H., Cantu, D.V., and Kaustov, L. (2016). Hemi-Methylated DNA Regulates DNA Methylation Inheritance through Allosteric Activation of H3 Ubiquitylation by Uhrf1. Elife 5, e17101.

246 144. Finch, H. (2005). Comparison of Distance Measures in Cluster Analysis with Dichotomous Data. Journal of Data Science 3, 85-100.

145. Sarstedt, M., and Mooi, E. (2014). Cluster Analysis. In A Concise Guide to Market Research (Berlin: Springer Texts in Business and Economics).

146. Förster, F., Pruggnaller, S., Seybert, A., and Frangakis, A.S. (2008). Classification of Cryo-Electron Sub-Tomograms Using Constrained Correlation. J Struct Biol 161, 276- 286.

147. Wang, L., and Sigworth, F.J. (2006). Cryo-Em and Single Particles. Physiology 21, 13- 18.

148. Zhang, X., Settembre, E., Xu, C., Dormitzer, P.R., Bellamy, R., Harrison, S.C., and Grigorieff, N. (2008). Near-Atomic Resolution Using Electron Cryomicroscopy and Single-Particle Reconstruction. Proc Natl Acad Sci U S A 105, 1867-1872.

149. Cressey, D., and Callaway, E. (2017). Cryo-Electron Microscopy Wins Chemistry Nobel. Nature News 550, 167.

150. Yang, Y., Sass, L.E., Du, C., Hsieh, P., and Erie, D.A. (2005). Determination of Protein- DNA Binding Constants and Specificities from Statistical Analyses of Single Molecules: Muts-DNA Interactions. Nucleic Acids Res 33, 4322-4334.

151. Wang, H., and Hays, J.B. (2003). Mismatch Repair in Human Nuclear Extracts: Effects of Internal DNA-Hairpin Structures between Mismatches and Excision-Initiation Nicks on Mismatch Correction and Mismatch-Provoked Excision. J Biol Chem 278, 28686- 28693.

152. Rivetti, C. (2011). DNA Contour Length Measurements as a Tool for the Structural Analysis of DNA and Nucleoprotein Complexes. In DNA Nanotechnology (Berlin: Springer), pp. 235-254.

153. Rivetti, C., and Codeluppi, S. (2001). Accurate Length Determination of DNA Molecules Visualized by Atomic Force Microscopy: Evidence for a Partial B- to a-Form Transition on Mica. Ultramicroscopy 87, 55-66.

154. Ficarra, E., Benini, L., Macii, E., and Zuccheri, G. (2005). Automated DNA Fragments Recognition and Sizing through Afm Image Processing. IEEE Transactions on Information Technology in Biomedicine 9, 508-517.

155. Sundstrom, A., Cirrone, S., Paxia, S., Hsueh, C., Kjolby, R., Gimzewski, J.K., Reed, J., and Mishra, B. (2012). Image Analysis and Length Estimation of Biomolecules Using Afm. IEEE Transactions on Information Technology in Biomedicine 16, 1200-1207.

156. Fuentes-Perez, M., Dillingham, M.S., and Moreno-Herrero, F. (2013). Afm Volumetric Methods for the Characterization of Proteins and Nucleic Acids. Methods 60, 113-121.

247 157. Mazur, A.K., and Maaloum, M. (2014). Atomic Force Microscopy Study of DNA Flexibility on Short Length Scales: Smooth Bending Versus Kinking. Nucleic Acids Res 42, 14006-14012.

158. Gupta, S., Gellert, M., and Yang, W. (2012). Mechanism of Mismatch Recognition Revealed by Human Mutsbeta Bound to Unpaired DNA Loops. Nat Struct Mol Biol 19, 72-78.

159. Wang, Y., Yu, D., Ouyang, Q., and Liu, H. (2017). The Determinant Factors for Model Resolutions Obtained Using Cryoem Method. arXiv e-prints arXiv:1712.09254.

160. Henderson, R. (2013). Avoiding the Pitfalls of Single Particle Cryo-Electron Microscopy: Einstein from Noise. Proc Natl Acad Sci U S A 110, 18037-18041.

161. Samsó, M., Palumbo, M.J., Radermacher, M., Liu, J.S., and Lawrence, C.E. (2002). A Bayesian Method for Classification of Images from Electron Micrographs. J Struct Biol 138, 157-170.

162. Leite, F.L., Bueno, C.C., Róz, A.L., Ziemath, E.C., and Oliveira, O.N. (2012). Theoretical Models for Surface Forces and Adhesion and Their Measurement Using Atomic Force Microscopy. International Journal of Molecular Sciences 13, 12773-12856.

163. Harkins William, D. (1928). Surface Energy and the Orientation of Molecules in Surfaces as Revealed by Surface Energy Relations. Zeitschrift für Physikalische Chemie 139A, 647.

164. Jiang, Q.-X., Chester, D.W., and Sigworth, F.J. (2001). Spherical Reconstruction: A Method for Structure Determination of Membrane Proteins from Cryo-Em Images. J Struct Biol 133, 119-131.

165. Ogura, T., and Sato, C. (2006). A Fully Automatic 3d Reconstruction Method Using Simulated Annealing Enables Accurate Posterioric Angular Assignment of Protein Projections. J Struct Biol 156, 371-386.

166. Scheres, S., Melero, R., Valle, M., and Carazo, J.-M. (2009). Averaging of Electron Subtomograms and Random Conical Tilt Reconstructions through Likelihood Optimization. Structure 17, 1563-1572.

167. Mallick, S.P., Zhu, Y., and Kriegman, D. (2004). Detecting Particles in Cryo-Em Micrographs Using Learned Features. J Struct Biol 145, 52-62.

168. Ogura, T., Iwasaki, K., and Sato, C. (2003). Topology Representing Network Enables Highly Accurate Classification of Protein Images Taken by Cryo Electron-Microscope without Masking. J Struct Biol 143, 185-200.

169. Ficarra, E., Masotti, D., Macii, E., Benini, L., Zuccheri, G., and Samori, B. (2005). Automatic Intrinsic DNA Curvature Computation from Afm Images. IEEE Transactions on Biomedical Engineering 52, 2074-2086.

248 170. Sampson, D. (2019). Gui Layout Toolbox. (MATLAB Central File Exchange). Available from: https://www.mathworks.com/matlabcentral/fileexchange/47982-gui-layout- toolbox.

171. Jackey, R. (2019). Widgets Toolbox. (MATLAB Central File Exchange). Available from: https://www.mathworks.com/matlabcentral/fileexchange/66235-widgets-toolbox.

172. Lodish H, B.A., Matsudaira P, Kaiser CA, Krieger M, Scott MP, Zipursky SL, Darnell J (2004). Molecular Biology of the Cell, 5 edn (New York: WH Freeman).

173. Friedberg, E.C., McDaniel, L.D., and Schultz, R.A. (2004). The Role of Endogenous and Exogenous DNA Damage and Mutagenesis. Current opinion in genetics & development 14, 5-10.

174. Bernstein, C., Prasad, A.R., Nfonsam, V., and Bernstein, H. (2013). DNA Damage, DNA Repair and Cancer. In New Research Directions in DNA Repair (London: IntechOpen).

175. Modrich, P. (2016). Mechanisms in E. Coli and Human Mismatch Repair (Nobel Lecture). Angewandte Chemie International Edition 55, 8490-8501.

176. Sancar, A. (2016). Mechanisms of DNA Repair by Photolyase and Excision Nuclease (Nobel Lecture). Angewandte Chemie International Edition 55, 8502-8527.

177. Lindahl, T. (2016). The Intrinsic Fragility of DNA (Nobel Lecture). Angewandte Chemie International Edition 55, 8528-8534.

178. Gallagher, J. (2017). Huntington’s Breakthrough May Stop Disease. (BBC). Available from: http://www.bbc.com/news/health-42308341.

179. Myers, R.H. (2004). Huntington's Disease Genetics. NeuroRx : the Journal of the American Society for Experimental NeuroTherapeutics 1, 255-262.

180. Brouwer, J.R., Willemsen, R., and Oostra, B.A. (2009). Microsatellite Repeat Instability and Neurological Disease. BioEssays : news and reviews in molecular, cellular and developmental biology 31, 71-83.

181. McColgan, P., and Tabrizi, S. (2018). Huntington's Disease: A Clinical Review. European Journal of Neurology 25, 24-34.

182. Ionis Pharmaceuticals (2015). Safety, Tolerability, Pharmacokinetics, and Pharmacodynamics of Ionis-Httrx in Patients with Early Manifest Huntington’s Disease. (ClinicalTrials.gov). Available from: https://clinicaltrials.gov/ct2/show/NCT02519036.

183. Zheng, M., Huang, X., Smith, G.K., Yang, X., and Gao, X. (1996). Genetically Unstable Cxg Repeats Are Structurally Dynamic and Have a High Propensity for Folding. An Nmr and Uv Spectroscopic Study. J Mol Biol 264, 323-336.

249 184. Tome, S., Manley, K., Simard, J.P., Clark, G.W., Slean, M.M., Swami, M., Shelbourne, P.F., Tillier, E.R., Monckton, D.G., Messer, A., and Pearson, C.E. (2013). Msh3 Polymorphisms and Protein Levels Affect Cag Repeat Instability in Huntington's Disease Mice. PLoS Genet 9, e1003280.

185. Gomes-Pereira, M., Fortune, M.T., Ingram, L., McAbney, J.P., and Monckton, D.G. (2004). Pms2 Is a Genetic Enhancer of Trinucleotide Cag.Ctg Repeat Somatic Mosaicism: Implications for the Mechanism of Triplet Repeat Expansion. Hum Mol Genet 13, 1815-1825.

186. Gannon, A.M., Frizzell, A., Healy, E., and Lahue, R.S. (2012). Mutsbeta and Histone Deacetylase Complexes Promote Expansions of Trinucleotide Repeats in Human Cells. Nucleic Acids Res 40, 10324-10333.

187. Pinto, R.M., Dragileva, E., Kirby, A., Lloret, A., Lopez, E., Claire, J.S., Panigrahi, G.B., Hou, C., Holloway, K., and Gillis, T. (2013). Mismatch Repair Genes Mlh1 and Mlh3 Modify Cag Instability in Huntington's Disease Mice: Genome-Wide and Candidate Approaches. PLoS Genet 9, e1003930.

188. Surtees, J.A., and Alani, E. (2006). Mismatch Repair Factor Msh2-Msh3 Binds and Alters the Conformation of Branched DNA Structures Predicted to Form During Genetic Recombination. J Mol Biol 360, 523-536.

189. Panigrahi, G.B., Slean, M.M., Simard, J.P., Gileadi, O., and Pearson, C.E. (2010). Isolated Short Ctg/Cag DNA Slip-Outs Are Repaired Efficiently by Hmutsbeta, but Clustered Slip-Outs Are Poorly Repaired. Proc Natl Acad Sci U S A 107, 12593-12598.

190. Lang, W.H., Coats, J.E., Majka, J., Hura, G.L., Lin, Y., Rasnik, I., and McMurray, C.T. (2011). Conformational Trapping of Mismatch Recognition Complex Msh2/Msh3 on Repair-Resistant DNA Loops. Proc Natl Acad Sci U S A 108, E837-E844.

191. Tian, L., Gu, L., and Li, G.M. (2009). Distinct Nucleotide Binding/Hydrolysis Properties and Molar Ratio of Mutsalpha and Mutsbeta Determine Their Differential Mismatch Binding Activities. J Biol Chem 284, 11557-11562.

192. Guo, J., Gu, L., Leffak, M., and Li, G.-M. (2016). Mutsβ Promotes Trinucleotide Repeat Expansion by Recruiting DNA Polymerase Β to Nascent (Cag)N or (Ctg)N Hairpins for Error-Prone DNA Synthesis. Cell Research 26, 775.

193. Panigrahi, G.B., Lau, R., Montgomery, S.E., Leonard, M.R., and Pearson, C.E. (2005). Slipped (Ctg)*(Cag) Repeats Can Be Correctly Repaired, Escape Repair or Undergo Error-Prone Repair. Nat Struct Mol Biol 12, 654-662.

194. Pluciennik, A., Burdett, V., Baitinger, C., Iyer, R.R., Shi, K., and Modrich, P. (2013). Extrahelical (Cag)/(Ctg) Triplet Repeat Elements Support Proliferating Cell Nuclear Antigen Loading and Mutlalpha Endonuclease Activation. Proc Natl Acad Sci U S A 110, 12277-12282.

250 195. Kovtun, I.V., Liu, Y., Bjoras, M., Klungland, A., Wilson, S.H., and McMurray, C.T. (2007). Ogg1 Initiates Age-Dependent Cag Trinucleotide Expansion in Somatic Cells. Nature 447, 447-452.

196. Antony, E., and Hingorani, M.M. (2003). Mismatch Recognition-Coupled Stabilization of Msh2-Msh6 in an Atp-Bound State at the Initiation of DNA Repair. Biochemistry 42, 7682-7693.

197. Wilson, T., Guerrette, S., and Fishel, R. (1999). Dissociation of Mismatch Recognition and Atpase Activity by Hmsh2-Hmsh3. J Biol Chem 274, 21659-21664.

198. Gorman, J., Chowdhury, A., Surtees, J.A., Shimada, J., Reichman, D.R., Alani, E., and Greene, E.C. (2007). Dynamic Basis for One-Dimensional DNA Scanning by the Mismatch Repair Complex Msh2-Msh6. Mol Cell 28, 359-370.

199. Dowen, J.M., Putnam, C.D., and Kolodner, R.D. (2010). Functional Studies and Homology Modeling of Msh2-Msh3 Predict That Mispair Recognition Involves DNA Bending and Strand Separation. Mol Cell Biol 30, 3321-3328.

200. Liu, J., Hanne, J., Britton, B.M., Bennett, J., Kim, D., Lee, J.-B.B., and Fishel, R. (2016). Cascading Muts and Mutl Sliding Clamps Control DNA Diffusion to Activate Mismatch Repair. Nature 539, 583-587.

201. Srivatsan, A., Bowen, N., and Kolodner, R.D. (2014). Mispair-Specific Recruitment of the Mlh1-Pms1 Complex Identifies Repair Substrates of the Saccharomyces Cerevisiae Msh2-Msh3 Complex. J Biol Chem 289, 9352-9364.

202. Groothuizen, F.S., Winkler, I., Cristovao, M., Fish, A., Winterwerp, H.H., Reumer, A., Marx, A.D., Hermans, N., Nicholls, R.A., Murshudov, G.N., Lebbink, J.H., Friedhoff, P., and Sixma, T.K. (2015). Muts/Mutl Crystal Structure Reveals That the Muts Sliding Clamp Loads Mutl onto DNA. Elife 4, e06744.

203. Kadyrov, F.A., Dzantiev, L., Constantin, N., and Modrich, P. (2006). Endonucleolytic Function of Mutlalpha in Human Mismatch Repair. Cell 126, 297-308.

204. Pluciennik, A., Dzantiev, L., Iyer, R.R., Constantin, N., Kadyrov, F.A., and Modrich, P. (2010). Pcna Function in the Activation and Strand Direction of Mutlalpha Endonuclease in Mismatch Repair. Proc Natl Acad Sci U S A 107, 16066-16071.

205. Kantartzis, A., Williams, G.M., Balakrishnan, L., Roberts, R.L., Surtees, J.A., and Bambara, R.A. (2012). Msh2-Msh3 Interferes with Okazaki Fragment Processing to Promote Trinucleotide Repeat Expansions. Cell reports 2, 216-222.

206. Lujan, S.A., Williams, J.S., Pursell, Z.F., Abdulovic-Cui, A.A., Clark, A.B., Nick McElhinny, S.A., and Kunkel, T.A. (2012). Mismatch Repair Balances Leading and Lagging Strand DNA Replication Fidelity. PLoS Genet 8, e1003016.

251 207. Pena-Diaz, J., and Jiricny, J. (2012). Mammalian Mismatch Repair: Error-Free or Error- Prone? Trends in biochemical sciences 37, 206-214.

208. Iyer, R.R., Pluciennik, A., Burdett, V., and Modrich, P.L. (2006). DNA Mismatch Repair: Functions and Mechanisms. Chem Rev 106, 302-323.

209. Iyer, R.R., Pluciennik, A., Genschel, J., Tsai, M.S., Beese, L.S., and Modrich, P. (2010). Mutlalpha and Proliferating Cell Nuclear Antigen Share Binding Sites on Mutsbeta. J Biol Chem 285, 11730-11739.

210. Iyer, R.R., Pohlhaus, T.J., Chen, S., Hura, G.L., Dzantiev, L., Beese, L.S., and Modrich, P. (2008). The Mutsalpha-Proliferating Cell Nuclear Antigen Interaction in Human DNA Mismatch Repair. J Biol Chem 283, 13310-13319.

211. Rogacheva, M.V., Manhart, C.M., Chen, C., Guarne, A., Surtees, J., and Alani, E. (2014). Mlh1-Mlh3, a Meiotic Crossover and DNA Mismatch Repair Factor, Is a Msh2-Msh3- Stimulated Endonuclease. J Biol Chem 289, 5664-5673.

212. Flores-Rozas, H., and Kolodner, R.D. (1998). The Saccharomyces Cerevisiae Mlh3 Gene Functions in Msh3-Dependent Suppression of Frameshift Mutations. Proc Natl Acad Sci U S A 95, 12404-12409.

213. Pearson, C.E., Nichol Edamura, K., and Cleary, J.D. (2005). Repeat Instability: Mechanisms of Dynamic Mutations. Nature reviews. Genetics 6, 729-742.

214. Mirkin, S.M. (2007). Expandable DNA Repeats and Human Disease. Nature 447, 932- 940.

215. Landles, C., and Bates, G.P. (2004). Huntingtin and the Molecular Pathogenesis of Huntington's Disease: Fourth in Molecular Medicine Review Series. EMBO reports 5, 958-963.

216. Walker, F.O. (2007). Huntington's Disease. The Lancet 369, 218-228.

217. Slean, M.M., Panigrahi, G.B., Ranum, L.P., and Pearson, C.E. (2008). Mutagenic Roles of DNA "Repair" Proteins in Antibody Diversity and Disease-Associated Trinucleotide Repeat Instability. DNA repair 7, 1135-1154.

218. McMurray, C.T. (2010). Mechanisms of Trinucleotide Repeat Instability During Human Development. Nature reviews. Genetics 11, 786-799.

219. Iyer, R.R., Pluciennik, A., Napierala, M., and Wells, R.D. (2015). DNA Triplet Repeat Expansion and Mismatch Repair. Annu Rev Biochem 84, 199-226.

220. Sinden, R.R., and Wells, R.D. (1992). DNA Structure, Mutations, and Human Genetic Disease. Curr Opin Biotechnol 3, 612-622.

252 221. Trinh, T.Q., and Sinden, R.R. (1991). Preferential DNA Secondary Structure Mutagenesis in the Lagging Strand of Replication in E. Coli. Nature 352, 544-7.

222. Ohshima, K., and Wells, R.D. (1997). Hairpin Formation During DNA Synthesis Primer Realignment in Vitro in Triplet Repeat Sequences from Human Hereditary Disease Genes. J Biol Chem 272, 16798-16806.

223. Budworth, H., and McMurray, C.T. (2013). Bidirectional Transcription of Trinucleotide Repeats: Roles for Excision Repair. DNA repair 12, 672-684.

224. Rolfsmeier, M.L., Dixon, M.J., and Lahue, R.S. (2000). Mismatch Repair Blocks Expansions of Interrupted Trinucleotide Repeats in Yeast. Mol Cell 6, 1501-1507.

225. Gacy, A.M., Goellner, G., Juranic, N., Macura, S., and McMurray, C.T. (1995). Trinucleotide Repeats That Expand in Human Disease Form Hairpin Structures in Vitro. Cell 81, 533-540.

226. Pearson, C.E., and Sinden, R.R. (1996). Alternative Structures in Duplex DNA Formed within the Trinucleotide Repeats of the Myotonic Dystrophy and Fragile X Loci. Biochemistry 35, 5041-5053.

227. Pearson, C.E., Wang, Y.H., Griffith, J.D., and Sinden, R.R. (1998). Structural Analysis of Slipped-Strand DNA (S-DNA) Formed in (Ctg)N. (Cag)N Repeats from the Myotonic Dystrophy Locus. Nucleic Acids Res 26, 816-823.

228. Pearson, C.E., Tam, M., Wang, Y.H., Montgomery, S.E., Dar, A.C., Cleary, J.D., and Nichol, K. (2002). Slipped-Strand Dnas Formed by Long (Cag)*(Ctg) Repeats: Slipped- out Repeats and Slip-out Junctions. Nucleic Acids Res 30, 4534-4547.

229. Potaman, V., Oussatcheva, E., Lyubchenko, Y.L., Shlyakhtenko, L., Bidichandani, S., Ashizawa, T., and Sinden, R. (2004). Length‐Dependent Structure Formation in Friedreich Ataxia (Gaa) N·(Ttc) N Repeats at Neutral Ph. Nucleic Acids Res 32, 1224- 1231.

230. Duzdevich, D., Li, J., Whang, J., Takahashi, H., Takeyasu, K., Dryden, D.T., Morton, A.J., and Edwardson, J.M. (2011). Unusual Structures Are Present in DNA Fragments Containing Super-Long Huntingtin Cag Repeats. PloS one 6, e17119.

231. Panigrahi, G.B., Slean, M.M., Simard, J.P., and Pearson, C.E. (2012). Human Mismatch Repair Protein Hmutlalpha Is Required to Repair Short Slipped-Dnas of Trinucleotide Repeats. J Biol Chem 287, 41844-41850.

232. Owen, B.A.L., Yang, Z., Lai, M., Gajek, M., Badger, J.D., Hayes, J.J., Edelmann, W., Kucherlapati, R., Wilson, T.M., and McMurray, C.T. (2005). (Cag)N-Hairpin DNA Binds to Msh2-Msh3 and Changes Properties of Mismatch Recognition. Nat Struct Mol Biol 12, 663-670.

253 233. Kovtun, I.V., and McMurray, C.T. (2001). Trinucleotide Expansion in Haploid Germ Cells by Gap Repair. Nature genetics 27, 407-411.

234. Savouret, C., Garcia-Cordier, C., Megret, J., te Riele, H., Junien, C., and Gourdon, G. (2004). Msh2-Dependent Germinal Ctg Repeat Expansions Are Produced Continuously in Spermatogonia from Dm1 Transgenic Mice. Mol Cell Biol 24, 629-637.

235. Yoon, S.R., Dubeau, L., de Young, M., Wexler, N.S., and Arnheim, N. (2003). Huntington Disease Expansion Mutations in Humans Can Occur before Meiosis Is Completed. Proc Natl Acad Sci U S A 100, 8834-8838.

236. Pearson, C.E. (2003). Slipping While Sleeping? Trinucleotide Repeat Expansions in Germ Cells. Trends in molecular medicine 9, 490-495.

237. Goula, A.-V., Berquist, B.R., Wilson III, D.M., Wheeler, V.C., Trottier, Y., and Merienne, K. (2009). Stoichiometry of Base Excision Repair Proteins Correlates with Increased Somatic Cag Instability in Striatum over Cerebellum in Huntington's Disease Transgenic Mice. PLoS Genet 5, e1000749.

238. Manley, K., Shirley, T.L., Flaherty, L., and Messer, A. (1999). Msh2 Deficiency Prevents in Vivo Somatic Instability of the Cag Repeat in Huntington Disease Transgenic Mice. Nature genetics 23, 471.

239. Dragileva, E., Hendricks, A., Teed, A., Gillis, T., Lopez, E.T., Friedberg, E.C., Kucherlapati, R., Edelmann, W., Lunetta, K.L., MacDonald, M.E., and Wheeler, V.C. (2009). Intergenerational and Striatal Cag Repeat Instability in Huntington's Disease Knock-in Mice Involve Different DNA Repair Genes. Neurobiology of disease 33, 37-47.

240. Tian, L., Hou, C., Tian, K., Holcomb, N.C., Gu, L., and Li, G.M. (2009). Mismatch Recognition Protein Mutsbeta Does Not Hijack (Cag)N Hairpin Repair in Vitro. J Biol Chem 284, 20452-20456.

241. Shlyakhtenko, L.S., Gall, A.A., Filonov, A., Cerovac, Z., Lushnikov, A., and Lyubchenko, Y.L. (2003). Silatrane-Based Surface Chemistry for Immobilization of DNA, Protein-DNA Complexes and Other Biological Materials. Ultramicroscopy 97, 279-287.

242. LeBlanc, S., Wilkins, H., Li, Z., Kaur, P., Wang, H., and Erie, D.A. (2017). Using Atomic Force Microscopy to Characterize the Conformational Properties of Proteins and Protein–DNA Complexes That Carry out DNA Repair. In Methods in Enzymology (Amsterdam: Elsevier), pp. 187-212.

243. Gourdon, G., Radvanyi, F., Lia, A.-S., Duros, C., Blanche, M., Abitbol, M., Junien, C., and Hofmann-Radvanyi, H. (1997). Moderate Intergenerational and Somatic Instability of a 55-Ctg Repeat in Transgenic Mice. Nature genetics 15, 190.

254 244. Oussatcheva, E.A., Shlyakhtenko, L.S., Glass, R., Sinden, R.R., Lyubchenko, Y.L., and Potaman, V.N. (1999). Structure of Branched DNA Molecules: Gel Retardation and Atomic Force Microscopy Studies. J Mol Biol 292, 75-86.

245. Sacho, E.J., Kadyrov, F.A., Modrich, P., Kunkel, T.A., and Erie, D.A. (2008). Direct Visualization of Asymmetric Adenine-Nucleotide-Induced Conformational Changes in Mutl Alpha. Mol Cell 29, 112-121.

246. Garcıa,́ R., and Pérez, R. (2002). Dynamic Atomic Force Microscopy Methods. Surface Science Reports 47, 197-301.

247. Rast, S., Wattinger, C., Gysin, U., and Meyer, E. (2000). The Noise of Cantilevers. Nanotechnology 11, 169-172.

248. Takagi, A., Yamada, F., Matsumoto, T., and Kawai, T. (2009). Electrostatic Force Spectroscopy on Insulating Surfaces: The Effect of Capacitive Interaction. Nanotechnology 20, 365501.

249. Hong, J.W., Kahng, D.S., Shin, J.C., Kim, H.J., and Khim, Z.G. (1998). Detection and Control of Ferroelectric Domains by an Electrostatic Force Microscope. J Vac Sci Technol B 16, 2942-2946.

250. Hong, J.W., Park, S.I., and Khim, Z.G. (1999). Measurement of Hardness, Surface Potential, and Charge Distribution with Dynamic Contact Mode Electrostatic Force Microscope. Review of Scientific Instruments 70, 1735-1739.

251. Gil, A., Colchero, J., Gómez-Herrero, J., and Baró, A. (2003). Electrostatic Force Gradient Signal: Resolution Enhancement in Electrostatic Force Microscopy and Improved Kelvin Probe Microscopy. Nanotechnology 14, 332.

252. Tevaarwerk, E., Keppel, D.G., Rugheimer, P., Lagally, M.G., and Eriksson, M.A. (2005). Quantitative Analysis of Electric Force Microscopy: The Role of Sample Geometry. Review of Scientific Instruments 76, 053707.

253. Leung, C., Kinns, H., Hoogenboom, B.W., Howorka, S., and Mesquida, P. (2009). Imaging Surface Charges of Individual Biomolecules. Nano Letters 9, 2769-2773.

254. Eom, C.-B., Cava, R., Fleming, R., Phillips, J.M., Marshall, J., Hsu, J., Krajewski, J., and Peck, W. (1992). Single-Crystal Epitaxial Thin Films of the Isotropic Metallic Oxides Sr1–Xcaxruo3 (0≤ X≤ 1). Science 258, 1766-1769.

255. Lowary, P.T., and Widom, J. (1998). New DNA Sequence Rules for High Affinity Binding to Histone Octamer and Sequence-Directed Nucleosome Positioning. J Mol Biol 276, 19-42.

256. Carruthers, L.M., Tse, C., Walker, K.P., 3rd, and Hansen, J.C. (1999). Assembly of Defined Nucleosomal and Chromatin Arrays from Pure Components. Methods Enzymol 304, 19-35.

255 257. Geng, H., Sakato, M., DeRocco, V., Yamane, K., Du, C., Erie, D.A., Hingorani, M., and Hsieh, P. (2012). Biochemical Analysis of the Human Mismatch Repair Proteins Hmutsalpha Msh2(G674a)-Msh6 and Msh2-Msh6(T1219d). J Biol Chem 287, 9777- 9791.

258. Sass, L.E., Lanyi, C., Weninger, K., and Erie, D.A. (2010). Single-Molecule Fret Tackle Reveals Highly Dynamic Mismatched DNA-Muts Complexes. Biochemistry 49, 3174- 3190.

259. Higgins, M.J., Proksch, R., Sader, J., Polcik, M., Endoo, M.S., Cleveland, J.P., and Jarvis, S.P. (2006). Noninvasive Determination of Optical Lever Sensitivity in Atomic Force Microscopy. Review of Scientific Instruments 77, 13701.

260. Sader, J.E., Chon, J.W., and Mulvaney, P. (1999). Calibration of Rectangular Atomic Force Microscope Cantilevers. Review of Scientific Instruments 70, 3967-3969.

261. Itkonen, H.M., Kantelinen, J., Vaara, M., Parkkinen, S., Schlott, B., Grosse, F., Nyström, M., Syväoja, J.E., and Pospiech, H. (2016). Human DNA Polymerase Α Interacts with Mismatch Repair Proteins Msh2 and Msh6. FEBS letters 590, 4233-4241.

262. Elena T. Herruzo, R.G., Roger B. Proksch Bimodal Dual Ac™ Afm Imaging of Collagen Fiber Ultrastructure. In Asylum Research APP NOTE.

263. Umar, A., Boland, C.R., Terdiman, J.P., Syngal, S., de la Chapelle, A., Ruschoff, J., Fishel, R., Lindor, N.M., Burgart, L.J., Hamelin, R., Hamilton, S.R., Hiatt, R.A., Jass, J., Lindblom, A., Lynch, H.T., Peltomaki, P., Ramsey, S.D., Rodriguez-Bigas, M.A., Vasen, H.F., Hawk, E.T., Barrett, J.C., Freedman, A.N., and Srivastava, S. (2004). Revised Bethesda Guidelines for Hereditary Nonpolyposis Colorectal Cancer (Lynch Syndrome) and Microsatellite Instability. J Natl Cancer Inst 96, 261-268.

264. Riedel, C., Alegria, A., Schwartz, G., Arinero, R., Colmenero, J., and Sáenz, J. (2011). On the Use of Electrostatic Force Microscopy as a Quantitative Subsurface Characterization Technique: A Numerical Study. Applied Physics Letters 99, 023101.

265. Umeda, K.-i., Kobayashi, K., Oyabu, N., Hirata, Y., Matsushige, K., and Yamada, H. (2014). Practical Aspects of Kelvin-Probe Force Microscopy at Solid/Liquid Interfaces in Various Liquid Media. Journal of Applied Physics 116, 134307.

266. Gramse, G., Edwards, M., Fumagalli, L., and Gomila, G. (2012). Dynamic Electrostatic Force Microscopy in Liquid Media. Applied Physics Letters 101, 213108.

267. Xu, S., and Arnsdorf, M.F. (1995). Electrostatic Force Microscope for Probing Surface Charges in Aqueous Solutions. Proc Natl Acad Sci U S A 92, 10384-10388.

268. Lilliu, S., Maragliano, C., Hampton, M., Elliott, M., Stefancich, M., Chiesa, M., Dahlem, M.S., and Macdonald, J.E. (2013). Efm Data Mapped into 2d Images of Tip-Sample Contact Potential Difference and Capacitance Second Derivative. Sci Rep 3, 3352.

256 269. Kuttler, K. (2007). An Introduction to Linear Algebra (Provo: Brigham Young University).

270. Scheres, S.H. (2010). Classification of Structural Heterogeneity by Maximum-Likelihood Methods. In Methods in Enzymology (Amsterdam: Elsevier), pp. 295-320.

271. MacKay, D.J.C., and Kay, D.J.C. (2003). Information Theory, Inference and Learning Algorithms (Cambridge: Cambridge University Press).

272. Sigworth, F.J., Doerschuk, P.C., Carazo, J.-M., and Scheres, S.H. (2010). An Introduction to Maximum-Likelihood Methods in Cryo-Em. In Methods in Enzymology (Amsterdam: Elsevier), pp. 263-294.

273. Rokach, L., and Maimon, O. (2005). Clustering Methods. In Data Mining and Knowledge Discovery Handbook (Berlin: Springer), pp. 321-352.

274. Rousseeuw, P.J. (1987). Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Journal of Computational and Applied Mathematics 20, 53-65.

275. Xie, J., Heng, P.-A., and Shah, M. (2008). Shape Matching and Modeling Using Skeletal Context. Pattern Recognition 41, 1756-1767.

276. Frank, J., Bretaudiere, J.P., Carazo, J.M., Verschoor, A., and Wagenknecht, T. (1988). Classification of Images of Biomolecular Assemblies: A Study of Ribosomes and Ribosomal Subunits of Escherichia Coli. Journal of microscopy 150, 99-115.

257