Nucleic Acid High-Throughput Sequencing Studies Present Unique Challenges in Analysis and Interpretation

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By Kenji Oman, B.S., M.S. Graduate Program in Physics

The Ohio State University 2015

Dissertation Committee:

Dr. Ralf Bundschuh, Advisor

Dr. Kurt Fredrick

Dr. Richard Furnstahl

Dr. Michael Poirier c Copyright by

Kenji Oman

2015 Abstract

From the discovery of nucleic acids, their significance as an information carrier in the cell, and with the development of high-throughput sequencing (HTS) techniques, molecular biology has seen ever-increasing developments in our understanding of the mechanisms of life. Here we first present a small overview of the progression of our understanding of nucleic acids, and current HTS techniques used to study them. We then investigate the interaction of methyl-binding-domain (MBD) with methylated DNA, as used in MethylCap-seq (a

HTS technique), and present a model for their interaction, and a Bayesian model utilizing our increased understanding to predict methylation levels in samples with an unknown methylation profile. We next introduce a HTS analysis pipeline we have developed, and examine the use of this pipeline in the analysis of 5′-end seq data, ultimately leading to its abandonment. Finally, we present a further application of the pipeline in our investigation of lepA’s role in translation initiation and elongation in E. coli.

ii To my parents, who always told me I could.

iii Acknowledgments

There are many people who have helped me get to the point of a Ph.D. First and foremost,

I would like to thank my parents and family for their words of encouragement and support through every step of my education. Their influence continues to be felt. I would also like to thank my many teachers and professors. Through their efforts, I have learned a little something of the world, and my eyes have been opened to the complexities of nature. Finally,

I would like to thank those who have had a direct hand in my training as a scientist.

First, I must thank my advisor, Dr. Ralf Bundschuh. Through our many interactions, his guidance, and patience with me, I have grown from knowing next to nothing about biology, programming, and data analysis, to gaining a grasp of each. His example, balancing work and family, is an inspiration to me. I must also thank my fellow graduate students working with Prof. Bundschuh: Cai Chen, Yi-Hsuan Lin, Billy Baez, Blythe Morland, and

Dengke Zhao. Through our interactions, I have gained a broader appreciation of the variety of biophysics questions and techniques. I would also like to thank Catharine Shipps for her dilligent effort in helping with our research, as well as Ryan Mangelson, who helped with another one of our projects.

I would also like to thank Dr. Michael Poirier and the students of his group for our weekly group meetings—they have been most informative to me as an introduction to some of the challenges of experimental biophysics, and have been a great means of giving me presentation practice.

There are also our collaborators, without whose work, I would have had no data to work with and would have had to do a very different Ph.D. I also appreicate the many conversations we had in our weekly meetings, and their paitience with me as I learned better

iv the biology of our projects. We have the PIs, Drs. Pearlly Yan, Kurt Fredrick, and Daniel

Schoenberg, as well as the post-docs, Drs. Dan Kiss, Chandrama Mukherjee, and Bappa

Roy, and graduate students Rohan Balakrishnan, Jackson Trotman, and David Frankhouser.

Finally, there is my committee, including my advisor Prof. Bundschuh, who read through all parts of my manuscript and provided numerous comments and suggestions for improvement, as well as Drs. Kurt Fredrick, Richard Furnstahl, and Michael Poirier. I thank them for reading through my dissertation and their patience with me as I have gone through the process of dissertation writing and defense. Despite their much help, I am certain errors remain, which are fully my own.

v Vita

May, 2009 ...... B.S., Carnegie Mellon University, Pitts- burgh, PA August, 2012 ...... M.S., The Ohio State University, Colum- bus, OH

Publications

Rohan Balakrishnan, Kenji Oman (co-first author), Shinichiro Shoji, Ralf Bundschuh, Kurt Fredrick. The conserved GTPase LepA contributes mainly to translation initiation in Escherichia coli. Nucl. Acids Res., 42:13370-13383 (2014).

Daniel L. Kiss, Kenji Oman, Ralf Bundschuh, Daniel R. Schoenberg. Uncapped 5 ends of mRNAs targeted by cytoplasmic capping map to the vicinity of downstream CAGE tags. FEBS Letters 3:279-284 (2015).

Daniel L. Kiss, Kenji Oman, Julie A. Dougherty, Chandrama Mukherjee, Ralf Bundschuh, Daniel R. Schoenberg. Cap homeostasis is independent of poly(A) tail length. Nucl. Acids Res, in review.

Blythe Moreland, Kenji Oman (co-first author), Pearlly Yan, Ralf Bundschuh. Methyl-CpG MBD2 interaction requires minimum separation and exhibits minimal sequence specificity. Biophys. J., in preparation.

Fields of Study

Major Field: Physics

Studies in Nucleic Acids: Dr. Ralf Bundschuh

vi Table of Contents

Page Abstract ...... ii Dedication ...... iii Acknowledgments ...... iv Vita ...... vi List of Figures ...... x List of Tables ...... xvi List of Abbreviations ...... xvii

Chapters

1 An Introduction to Nucleic Acids and Modern Sequencing Techniques . 1 1.1 An overview of nucleic acids ...... 1 1.1.1 Discovery of information transfer through DNA ...... 1 1.1.2 Structure of DNA and RNA ...... 2 1.1.3 DNA replication, the central dogma, and the code . . . . . 3 1.2 DNA/RNA Sequencing ...... 4 1.2.1 Modern sequencing techniques ...... 5 1.2.2 Applications of next-generation sequencing ...... 10 1.2.3 Challenges of next-generation sequencing ...... 13 1.3 Scientific contributions to the field ...... 14 1.4 Conclusions ...... 15

2 MBD-DNA Interactions as Probed through HTS ...... 17 2.1 Introduction ...... 17 2.1.1 MBD background ...... 18 2.2 Methods ...... 21 2.2.1 Pre-Data analysis ...... 21 2.2.2 Preliminary priming and questions asked ...... 22 2.2.3 Library analysis workflow overview ...... 22 2.2.4 Question 1a: Genomic CpG content vs input, examining protocol bias 23 2.2.5 Question 1b: Genomic G/C content vs input, examining protocol bias 25 2.2.6 Analysis overview for remaining questions ...... 26 2.2.7 Model-Building ...... 29 2.2.8 Model Predictions ...... 35 2.3 Results/ Discussion ...... 39

vii 2.3.1 Sequencing introduces a G/C content bias ...... 39 2.3.2 MBD binding to methylated CpG shows no significant position depen- dence ...... 40 2.3.3 MBD binding to two CpGs simultaneously requires minimum separa- tion, and shows reduced binding at an intermediate level of separation 42 2.3.4 MBD binding to 3 CpGs shows similar pairwise separation dependence as for the 2 CpG case ...... 43 2.3.5 MBD binding multiple CpGs shows unexpected pulldown behavior . 45 2.3.6 Model fitting to data ...... 47 2.3.7 Utilizing the Bayesian model ...... 50 2.4 Conclusion/ Future Work ...... 50

3 An Overview of HTS Analysis Pipeline/Tools Developed with a Case Study in the Analysis of 5′-end Sequencing Data ...... 52 3.1 Introduction ...... 52 3.2 Pipeline/tools developed ...... 52 3.2.1 Read sequencing ...... 52 3.2.2 Raw to aligned ...... 53 3.2.3 Computational removal of rRNA reads ...... 54 3.2.4 BAM file quality controls ...... 55 3.2.5 Genomic coverage summary ...... 57 3.2.6 Normalizations ...... 57 3.2.7 Coverage per position visualization techniques ...... 58 3.2.8 Differential expression analysis ...... 60 3.2.9 Local expression variability ...... 61 3.2.10 Analysis of local expression variability ...... 61 3.3 Capping of RNA: a regulator of transcript life cycle ...... 63 3.3.1 Cell culture preparation ...... 63 3.3.2 5′-end seq workflow overview ...... 64 3.3.3 Application of Methods to 5′-end Seq ...... 65 3.3.4 Results/ Discussion ...... 67 3.4 Conclusions ...... 73

4 An Investigation of LepA’s Function in E. coli ...... 74 4.1 Introduction: Background on LepA ...... 74 4.1.1 LepA is highly conserved, and yet not well understood ...... 74 4.1.2 Synthetic phenotypes exhibited by ∆lepA ...... 75 4.1.3 Deletion of the active-site histidine or the unique C-terminal domain (CTD) in LepA fails to complement the synthetic phenotypes . . . . 76 4.1.4 Examining LepA’s effect on the transcriptome and translatome . . . 76 4.2 Methods ...... 78 4.2.1 LepA investigations ...... 78 4.3 Results ...... 84 4.3.1 Without LepA, many mRNA coding regions exhibit reduced average ribosome density (ARD) ...... 84

viii 4.3.2 LepA’s effect on ARD is related to the sequence of the translation initiation region (TIR) ...... 87 4.3.3 LepA’s effect on ribosome distribution along mRNAs ...... 89 4.4 Discussion ...... 92 4.4.1 Loss of LepA mainly affects translation initiation ...... 92 4.4.2 LepA’s translation elongation effects are codon specific and are com- paratively minor ...... 94 4.4.3 Perturbations in expression likely explain the observed synthetic phenotypes of ∆lepA ...... 95 4.5 Conclusions ...... 95

Bibliography ...... 97

ix List of Figures

Figure Page

1.1 Space-filling atomistic and chemical structure models of DNA: A space-filling model of DNA (a) provides a 3 dimensional view of DNA’s basic structure, indicating the familiar double helix. (b) provides a chemical structure view of DNA’s composition. (a from [7], b from [8])...... 2 1.2 mRNA translation and the protein code: are produced through the translation of mRNA by the ribosome (a), which travels from the 5′ to the 3′ direction on the mRNA. tRNAs matching the open codon of the mRNA arrive, bringing with them their corresponding amino acid, as per the genetic code (b), correlating the mRNA nucleotide sequence of the codon to the corresponding amino acid (a modified from [12]; b from [13])...... 3 1.3 Illumina sequencing overview: (a) Adaptors are attached to DNA frag- ments. Upon hybridization of adaptors to their complements in the flow cell, a bridge amplification process is employed, producing localized clusters within the flow cell of identical copies of library fragments (b). (c) In sequencing, DNA polymerase incorporates fluorescently labeled nucleotides, modified to prevent continued strand synthesis, allowing for the instrument to detect (through imaging all clusters in the flow cell) which nucleotide was incor- porated in each cluster. Post-imaging, the synthesis block and fluorophore are removed, allowing for the next nucleotide incorporation to take place, iteratively determining the sequence of the fragment (from [27])...... 8 1.4 Paired-end sequencing: DNA library fragments are ligated with adaptors, uniquely identifying each end of the fragments. Single-end sequencing only sequences one end of the fragment, but paired-end sequencing sequences both ends (as shown), allowing for better alignment of fragments to the reference genome, de novo assembly of new genomes, as well as providing knowledge of fragment lengths. (adapted from [27]) ...... 9 1.5 An Overview of the Ribosome Profiling Protocol: As shown, un- protected transcript regions undergo nuclease digestion, leaving ribosome protected fragments for capture and sequencing. (Reprinted by permission from Macmillan Publishers Ltd: Nature Protocols [48] copyright 2012) . . . 12

x 1.6 MBD capture of methylated CpG oligos, vs an antibody-based ap- proach: MBD (blue) vs an antibody (green)’s capture of synthetic oligos is shown, displaying MBD’s enhanced ability to capture methylated DNAs. Of note is the very non-linear behavior of MBD’s capture rates for varying number of methylated CpG’s per oligo (0 vs 1 CpG is essentially unchanged, with significant enhancements for 2, 3, and 4 methylated CpGs). Oligos were 80 bp duplex DNAs containing 0 to 8 methylated CpGs. (From [72]) . . . . 15

2.1 Chicken MBD2 binding to methylated DNA: (a) The 3-D structure of MBD2 (cyan) binding to methylated DNA (mCpG bases in yellow in center). (b) Base-specific (solid lines) and phosphate backbone (dashed lines) contact points of MBD2 to the DNA. (From [93], used by permission of Oxford University Press) ...... 19 2.2 Synthetic oligo targets for mouse MBD2b: A schematic representa- tion of methylated CpGs on the respective fragments. (From [96], used by permission of Oxford University Press) ...... 20 2.3 Input samples show CpG and G/C content bias compared with genomic levels: When comparing DNA segments of 150 bps from input reads (red) and across the entire genome (green), we see a significant difference in CpG prevalence (a), and G/C content (b)...... 39 2.4 G/C bias correction fixes CpG content differences: Using correction factors to account for G/C content sequencing bias (a), we recover genomic levels (red) of CpG content in input samples (green; b)...... 40 2.5 MBD binding to methylated CpG shows no location dependence: 1 CpG Pulldown efficiency (a) shows weird edge effects on the 5′-end of fragments, but otherwise a relatively constant pulldown efficiency across the fragment length, with a sudden decrease around 100 bps from the 5′-end. Simultaneous fitting of our model to 1 and 2 CpG pulldown fractions, we see the model fits well to the 1 CpG data (panel b; model in red, data in green). Fitting produces average fragment length of 100.6 bps, a standard deviation of 15.51 bps, and a 1 CpG vs no CpG relative pulldown rate, R1, of 1.318. . . 41 2.6 MBD binding to methylated CpGs shows separation dependence: For 2 CpG pulldown efficiency (a), we see low pulldown when there are 0 or 1 base pairs between the 2 CpGs, an intermediate level of pulldown for 2 separation, and fully recovered pulldown at higher separations. Fitting our model (red) the pulldown sequenced fragments (green; b; simultaneous fitting with 1 CpG Location data, seen in Figure 2.5b), we find the 3 separation classes (0–1 bps, 2 bps, and ≥ 3 bps), along with a gaussian length distribution and G/C content correction, to be sufficient to capture MBD’s binding to 2 CpG fragments, suggesting that for sufficient separations, two MBDs are really binding to the two methylated CpG sites, and insufficient separation between CpGs leads to the expected steric clash between MBDs...... 42

xi 2.7 MBD shows similar pairwise separation dependence as for 2 CpGs: Looking at pairwise separation dependence of MBD binding to 3 CpGs as a whole (a), we see similar behavior as that exhibited in 2 CpGs. Namely, we see that low CpG separations inhibit pulldown efficiency (cuts along either axis), which is recovered for higher separations (points further away from either axis), with pulldown values steadily falling for higher separations. A cut along a separation of 10 bps is also shown (b), confirming this behavior (red is for first separation constrained to be 10 bps, and green is for second separation having the 10 bps constraint)...... 44 2.8 3 CpG pairwise separation cut along 2 bps separation shows inhib- ited pulldown for the intermediate state: Examining a cut along 2 bps of separation, we see the state CGNNCGNNCG is even more inhibited than the 2 CpG separation value for 2 bps of separation, almost as low as 0–1 bp separation...... 45 2.9 MBD binding to multiple CpGs: When examining MBD binding to multiple CpGs per DNA fragment, we see an expected qualitative cooperative binding type behavior, out to about 10 CpGs per fragment. However, we expected the pulldown efficiency to remain plateaued after this point, but we see a second increase in pulldown efficiency for higher CpGs...... 46 2.10 A range of 1 CpG pulldown values: Visually inspecting multiple 1 CpG pulldown values (R1; using the fitted average fragment length and standard deviation of 100.6 bps and 15.51 bps respectively) shows a range of pulldown values that seem to fit well. Fitting pulldown values R~ on all data pushes R1 up, so the highest visually acceptable value of R1 = 1.45 was accepted for further fittings...... 48 2.11 2 CpG pulldown fragments with fit: Constraining the average fragment length to be L = 100.6 bps, standard deviation to be S = 15.51 bps, and 1 CpG relative pulldown R1 = 1.45, and fitting to all data simultaneously, we see our model seems to fit the 2 CpG separation dependence fairly well. However, on closer inspection, we notice the intermediate state of 2 bps separation between the 2 CpGs is not recovered in the fitting...... 49

3.1 Consensus sequence determination for the top 500 reads quality check: once one of the top 500 most highly covered genomic regions is found (reads starting at the line on the left), with a minimum coverage of 10 reads (not shown), a consensus sequence (blue) is determined by counting the frequency of each base (A, C, G, T) at each position, from all reads intersecting with this region. Variability in read sequences (red) exist, but the most common base per position (the consensus sequence) is reported. . 56

xii 3.2 Transcript by transcript view of coverage provides local detailed view of treatment effects: (a - c) show cartoon representations of a tran- script view of coverage for 3 example , showing the effects of 3 treatments (control = blue, red/ green are other treatments) on the local gene abundance. Interpretation is dependent on experiment, but for ribosome profiling experi- ments, for example, this view can provide insight into translation pause sites. (d) shows metagene view of the coverage, averaging over the 3 genes shown, with no further normalizations (see 1 in Metagene view on page 60). . . . . 59 3.3 5′-end seq workflow: The capture of uncapped RNA 5′ ends. 1. 5′ capped (black circle) and uncapped (no circle) cellular extract transcripts are annealed to complementary sequences (random N’s), ligated to known adaptors. 2. RNA complementary to the adaptor sequence (cRNA) is annealed to the adaptors. 3. Upon the addition of RNA Ligase, uncapped transcripts are ligated to the cRNA sequences, while ligation is blocked by the 5′ caps of capped RNAs. 4. Upon denaturing of the random sequences (Ns) from the transcripts, cellular transcripts are left with either 5′ caps, or cRNA sequences on the 5′ ends. 5. cRNA sequences are selected and 6., cDNA libraries are constructed, providing a library of uncapped RNA transcript 5′ ends. . . . . 64 3.4 Uncapped 5′ end replicate comparisons. The fraction of 5′ end locations which passed a minimum coverage with the complementary replicate also having a minimum of one read expressed, is compared with the total number of 5′ end locations which passed the cutoff indicated (see Equation 3.1). The labeled replicate (“+dox, A”, for example), is the replicate which passed the coverage cutoff...... 69 3.5 Uncapped 5′ end replicates fail to replicate. 5′-end seq coverage for transcript ENST00000229239.5 (GAPDH), a highly expressed transcript as measured by the fraction of reads mapping to it in the 5′-end seq libraries, is shown as an example, normalized by total reads per library. (a-b) show No siRNA treatment, (c-d) show Xrn1 siRNA treatment, with (a, c) showing the full transcript coverage and (b, d) displaying the indicated smaller focus region of the transcript’s coverage. As indicated, the replicates with only the native capping enzyme expressed are displayed with red-tinted circles, while the replicates with the dominant negative capping enzyme expressed are displayed with blue-tinted squares. As clearly seen, replicate values per position fail to correlate in general...... 71 3.6 Global replicate vs replicate coverage comparisons show a subset of 5′ end positions have an average coverage shifted by ∼ 10 times the remaining values: A representative of the different correlation comparisons is shown, where (a) shows No siRNA, −dox, + strand coverages of replicates A vs B, (b) shows the same for Control siRNA, (c) shows Xrn1 siRNA, +dox, − strand coverages for A vs B, and (d) is likewise for replicate A’s + vs − strand coverages, as an example of uncorrelated data. Each data point represents a genomic position where the raw reads coverage is read off of the respective axis...... 72

xiii 4.1 Total RNA and ribosome footprint coverage is highly replicable. Shown are biological replicate 1 and 2 compared to each other in each of the strains (wild-type, mutant, complemented) for the two library types (total RNA and ribosome footprints), comparing the raw reads coverage for each genomic location. The Pearson correlation coefficients for all replicate- to-replicate comparisons are as follows: wild-type (WT) total RNA, r = 0.999, 0.997, and 0.998; mutant (M) total RNA, r = 0.998, 0.997, and 0.998; complemented (C) total RNA, r = 0.996, 0.962, and 0.968; WT ribosome footprints, r = 0.992, 0.990, and 0.995; M ribosome footprints, 0.997, 0.976, and 0.981; C ribosome footprints, r = 0.995, 0.997, and 0.995. (Data collected by R. Balakrishnan, correlated by K. Oman, from [54]) ...... 77 4.2 Ribosome footprint fragments end 14 nts downstream of the center of the P site. The coverages of 3’ end positions of ribosome footprint fragments were normalized for each gene to the total RNA coverage and length of the genes, indexed by their distances from the stop codons of their respective genes, and averaged over all genes, all three conditions, and all replicates (0 representing the third codon position of the stop codon). The peak at 10 nts beyond the stop codon, which we interpret as stemming from ribosomes located with the stop codon in their A site, indicates that the center of the P site is 14 nts upstream of the most prevalent 3’ end of the ribosome footprint fragments. (Data collected by R. Balakrishnan, analyzed by K. Oman, from [54]) ...... 79 4.3 Examples of genes that exhibit decreased ARD in the absence of LepA. Total-RNA and ribosome-footprint read counts for WT, M, and C strains are shown mapped back to the genome in the vicinity of ychH (A), raiA (B), lldP (C), dctA (D) and aldA (E). Ribosome-footprint reads are mapped to the genomic position corresponding to the predicted central nucleotide of the P codon, and total-RNA reads are mapped to the center of the read fragment. Read counts are normalized with respect to total number of reads (after quality control and rRNA read removal), making the histograms of analogous data tracks directly comparable in each panel. (Data collected by R. Balakrishnan, analyzed by K. Oman, from [54]) ...... 85 4.4 Gene expression is globally perturbed in the absence of LepA. Nor- malized gene-by-gene coverages are compared between ribosome footprints and total RNA in WT, M and C strains (as indicated). A measure of spread, d2, was calculated for each of the comparisons, yielding 0.62±0.02, 1.12±0.02 and 0.88±0.03 for the WT, M and C samples, respectively. Students t-tests on the values of d2 in all three replicates yield P = 1.6 × 10−5 for the WT versus M, and P = 1.2 × 10−3 for the M versus C comparison, showing significant perturbation of translation efficiencies (i.e. changes in ARD values) due to loss of LepA. (Data collected by R. Balakrishnan, correlated by K. Oman, from [54]) ...... 86

xiv 4.5 Effects of LepA on ARD are related to the TIR sequence. Nt fre- quencies at each position of the TIR were determined for the subset of genes with decreased ARD in the absence of LepA (WT, C > M; 237 genes), the subset of genes with increased ARD in absence of LepA (WT, C < M; 283 genes) and all genes analyzed (1870). Purine and pyrimidine frequencies for each subset, relative to those of the complete set, are plotted as a function of TIR position (as indicated; position zero corresponds to the first nt of the start codon). Binomial tests indicate that pyrimidines are significantly underrepresented in the former subset (WT, C > M) at positions -11 (P = 7.4 × 10−4), -10 (P = 2.9 × 10−3) and -9 (P = 1.6 × 10−3) and significantly overrepresented in the latter subset (WT, C < M) at positions -11 (P = 3.3 × 10−2), -10 (P = 3.7 × 10−4), -9 (P = 2.5 × 10−2) and -8 (P = 2.6 × 10−2), data points marked with asterisks. The differences seen at the first position of the start codon (position zero) are deemed less than statistically significant (only 40 of the 1870 analyzed genes begin with a pyrimidine). Genes infC and pcnB have the rare start codon AUU and hence were omitted from the analysis. The second and third nts of the start codon (UG) are otherwise invariant and assigned the value of 1.0. (Data collected by R. Balakrishnan, analyzed by K. Oman, from [54]) ...... 88 4.6 Metagene analysis reveals generally reduced ribosome density at the 5’ end of coding regions for both the mutant and complemented strains. Ribosome density values were calculated for each gene position and then averaged across all high-coverage genes. Shown are plots of metagene- averaged ribosome density as a function of gene position (codons 1-33) for the WT, M, and C strains (as indicated). (Data collected by R. Balakrishnan, analyzed by K. Oman, from [54]) ...... 89 4.7 LepA prevents ribosomal pausing at certain GGU codons. Aligned are the coding sequences corresponding to the predicted paused ribosomes seen specifically in the mutant strain. The left column identifies the gene and the codon number (of the P codon of the paused complex). Codon GGU (red) is significantly overrepresented as the A codon (P = 7.4 × 10−13), based on a two-tailed binomial test. GGC (blue), the other codon recognized by Gly-tRNAGly3, is seen to occupy the A site in two cases, which is deemed less-than-significant enrichment. (Data collected by R. Balakrishnan, analyzed by K. Oman, from [54]) ...... 91

xv List of Tables

Table Page

1.1 Typical next-generation sequencing platform specifications ...... 5

2.1 Fractional saturation of mouse MBD2b ...... 20 2.2 Binding affinity of chicken MBD2 ...... 21

3.1 STAR alignment and rRNA removal statistics of 5′-end seq data ...... 67

4.1 Strains in which ∆lepA confers a synthetic growth defect ...... 75 4.2 Coding regions that exhibit reduced ARD in the absence of LepA ...... 84 4.3 Relative abundance of various types of RNA in cells containing and lacking LepA ...... 87

xvi List of Abbreviations

ARD average ribosome density. viii, ix, xiv–xvi, 15, 81, 82, 84–89, 92–94

C complemented. xiv, xv, 76, 77, 79–86, 88–90, 94

CGI CpG island. 17

CTD C-terminal domain. viii, 74, 76

HTS high-throughput sequencing. ii, vii, viii, 13, 17, 52, 58, 63, 70, 73, 76

M mutant. xiv, xv, 76, 77, 79–86, 88–90, 94

MBD methyl-binding-domain. ii, vii, viii, xi, xii, xvi, 14, 15, 17–23, 29, 30, 32, 40–51

RD ribosome density. 82, 83, 92

SD Shine-Dalgarno. 15, 88, 92, 94, 96

TIR translation initiation region. ix, xv, 15, 82, 87, 88, 92–96

WT wild-type. xiv, xv, 76, 77, 79–86, 88–90, 94

ZMW zero-mode waveguide. 6, 7

xvii Chapter 1 An Introduction to Nucleic Acids and Modern Sequencing Techniques

1.1 An overview of nucleic acids

Nucleic acids, so named from their discovery within the nucleus, and for their containing phosphate groups (as seen also in phosphoric acid) [1], are fundamental carriers of genetic information in all living organisms. From their discovery to the present, much has been understood of their composition, function, and interaction with other bio-molecules. Here we present an overview of our progressive understanding of nucleic acids, some of what is known about them, and current techniques in the sequencing of nucleic acids.

1.1.1 Discovery of information transfer through DNA

From Charles Darwin’s 1859 text, On the Origin of Species by Means of Natural Selection, Or,

The Preservation of Favoured Races in the Struggle for Life [2], which presented the premise that evolution occurs as a transmission of inheritable characteristics through the process of natural selection, to Gregor Mendel’s work in establishing the laws of inheritance [3], we have understood that there is a form of information transfer from parent to offspring in living organisms. Thomas Hunt Morgan showed in 1910 that the inheritable information, or gene, is stored in [4], but it was still not understood which of DNA (deoxyribonucleic acid) or protein carries the genetic information. The Avery-MacLeod-McCarty experiment

1 (a) (b)

Figure 1.1: Space-filling atomistic and chemical structure models of DNA: A space- filling model of DNA (a) provides a 3 dimensional view of DNA’s basic structure, indicating the familiar double helix. (b) provides a chemical structure view of DNA’s composition. (a from [7], b from [8]).

in 1944 then showed that DNA is the hereditary information carrier in bacteria, [5], which was further supported by the Hershey-Chase experiments, which showed DNA, and not protein, contains the genetic information in the T2 bacteriophage [6].

1.1.2 Structure of DNA and RNA

Despite an increased understanding of the significance of DNA’s role in genetics, it was not until Watson and Crick’s paper in 1953 [9; 10] that the structure of DNA was known (see

Figure 1.2). DNA has a phosphate and deoxyribose backbone, which form the sides of the double-helical “ladder”. The “rungs” of the ladder are formed by the four bases, adenine

(A), cytosine (C), guanine (G), and thymine (T), where A and T form complementary bonds with each other, as do C and G. In RNA (ribonucleic acid), deoxyribose in the backbone is replaced by ribose (which has an additional hydroxyl group, or OH, at the 2′ position), the T

is replaced by uracil (U), which base pairs with A. Also in contrast to DNA, RNA in general is single-stranded, except when forming intra-stranded structures through complementary

2 (a) (b)

Figure 1.2: mRNA translation and the protein code: proteins are produced through the translation of mRNA by the ribosome (a), which travels from the 5′ to the 3′ direction on the mRNA. tRNAs matching the open codon of the mRNA arrive, bringing with them their corresponding amino acid, as per the genetic code (b), correlating the mRNA nucleotide sequence of the codon to the corresponding amino acid (a modified from [12]; b from [13]).

base pairing.

When identifying a DNA or RNA strand, a list of letters is given, indicating the sequence of the four bases (e.g. “ACTGTGA”). This sequence is written from the 5′ to 3′ direction along the strand, where 5′ and 3′ positions are based off of the asymmetric structure of the backbone. On one end of the backbone, there is a dangling phosphate on the 5′ position of the (deoxy)ribose, and on the other end of the backbone is a dangling hydroxyl group off of the 3′ position. For DNA, which is double stranded, the sequence is given based off of the reference strand in the published genome of the organism.

1.1.3 DNA replication, the central dogma, and the protein code

The double-stranded complementary nature of DNA immediately suggests a mechanism for

DNA replication, where each strand of the DNA acts as a template for newly synthesized

DNA. Indeed, this was first hypothesized by Watson and Crick with their discovery of the structure of DNA [9; 10], and was confirmed in the Meselson-Stahl experiment in 1958 [11].

3 Shortly after their discovery of the structure of DNA, Crick proposed what he coined,

“The Central Dogma of Molecular Biology” [14; 15], which states the information flow within a cell cannot transfer from protein to protein, or back from protein to nucleic acid. In general, it is the transfer of information from DNA→DNA (DNA replication), DNA→RNA

(transcription) and RNA→protein (translation; see Figure 1.2a), which has indeed been shown to be the case [15–17].

Considering how there are 20 standard amino acids used by living cells (in proteins), with only four different nucleotides, George Gamow postulated that there must be a code of 3 nucleotides to translate RNA into protein [18]. This is because, if there were only two nucleotides in the code, there would only be 42 = 16 different combinations, while

43 = 64 for a code of three nucleotides in length provides sufficient combinations. Subsequent experiments proved this to be true (see Figure 1.2b)[17; 19–21], with the resulting three nucleotide “words” being called codons, and the three possible interpretations for a given sequence being called “reading frames”, which are the set of consecutive, non-overlapping triplets of nucleotides. An Open Reading Frame (ORF) is a reading frame that has the potential to be translated into a protein, and starts with a start codon with a downstream stop codon within the same reading frame.

Thus, within a matter of ∼20 years, we have come to understand the universal storage and transfer of information within living organisms, as well as the lexicon translating from one information storage medium to another.

1.2 DNA/RNA Sequencing

Given that the genomic information is stored in DNA for living organisms, one may desire to determine the sequence of DNA, in an effort to interpret that information. Early efforts to obtain sequences were painstaking, such as the wandering-spot analysis by Gilbert and

Maxam [22]. However, Frederick Sanger improved the speed and efficiency of sequencing with his technique, now termed Sanger Sequencing [23; 24], which has been optimized to provide read sequences up to ∼ 1000 bps in length, with per-base “raw” accuracies as high as

4 Table 1.1: Typical next-generation sequencing platform specifications

Platform Illumina MiSeq Ion Torrent PGM PacBio RS Illumina HiSeq 2000

Instrument Cost $128K $80K $695K $654K

Sequence yield 1.5-2Gb 1Gb on 318 chip 100Mb 600Gb per run Sequencing cost $502 $1000 (318 chip) $2000 $41 per Gb

Run Time 27 hours 2 hours 2 hours 11 days

Reported Accuracy Mostly >Q30 Mostly Q20 Q30

Observed Raw 0.80% 1.71% 12.86% 0.26% Error Rate Avg. 1500 bases Read length up to 150 bases 200 bases up to 150 bases (C1 chemistry)

Paired reads Yes Yes No Yes

Insert size up to 700 bases up to 250 bases up to 10kb up to 700 bases

Typical DNA 50-1000ng 100-1000ng ∼1000 ng 50-1000ng requirements Adapted from [26]

99.999% [25]. With the high levels of accuracy and long sequence length, Sanger Sequencing is still in use today. However, with the limited capability of parallelization, making full genome sequencing a time-consuming and expensive task, alternative methods of sequencing have been developed.

1.2.1 Modern sequencing techniques

There are several modern sequencing techniques (termed “Next-Generation Sequencing”) that have been developed, each backed by the respective companies commercializing their platforms. Each of these technologies utilize DNA polymerase’s remarkable ability to synthesize a complementary strand based on a template single-stranded DNA, with the main difference being in the detection of incorporation of a new nucleotide in the complementary strand. A summary of their respective specifications is given in Table 1.1, with additional summaries of their techniques given below. We will provide additional details for Illumina’s technology, as data collected for future chapters were gathered on an Illumina HiSeq 2500 instrument. 5 Sequencing by monitoring pH change (Ion Torrent)

Upon the addition of a nucleotide to a DNA strand undergoing synthesis, a hydrogen ion is released. Ion Torrent has developed a novel silicon pH detector to capture the quantized changes in pH as a result of this incorporation event, converting the pH change into a voltage signal. Sequencing preparation is performed with a process called emulsion PCR where

DNA fragments are attached to beads, one fragment per bead, with fragments amplified, providing many copies of each fragment on each bead. Selecting for beads with sufficient coverage, the beads are placed in silicon wells for sequencing, where successive nucleotides are flowed through the sequencing chamber and nucleotide incorporation is detected with the corresponding pH changes.

Sources of noise include phasing, due to not all fragments on each bead incorporating the corresponding nucleotide at each flow-through, especially for library fragment sites with multiple successive nucleotides being of the same type (homopolymers). As a result, the most common error identified are insertion or deletion errors, common for homopolymers, and is compounded as the length of the homopolymer increases. Substitution errors may also occur, although at a much lower frequency, primarily due to carry-over effects from the previous incorporation cycle, where the incorporation of the flowed-in nucleotide may not have been registered before the successive nucleotide flow was started [27].

Polymerase active-site monitoring for single-molecule sequencing (Pacific Bio- sciences)

Pacific Biosciences has developed a single-molecule sequencing technique by observing nu- cleotide incorporation events of fluorescent nucleotides, utilizing the zero-mode waveguide

(ZMW).A ZMW is a light-focusing structure allowing for sub-wavelength resolution mea- surements [28], and is produced in an array on a silicon wafer with up to thousands of

ZMWs (∼150 thousand on Pacific Biosciences’ SMRT R Cell), allowing for the parallel

measurement of many nucleotide incorporation events simultaneously. Template DNA is

bound to a polymerase, and is added to the surface of the SMRT chip. Fluorescent dyes

6 attached to nucleotides of each type are then added, where the dyes are removed by the polymerase upon nucleotide incorporation. The instrument sequences each of the DNA fragments by simultaneously monitoring for fluorescent signals from active ZMW positions.

Due to this simultaneous monitoring of many positions, the instrument computer must perform many calculations to condense the data collected into a manageable format for sequence interpretation.

Although single-molecule techniques are able to avoid some of the modes of error associated with other techniques (e.g. polymerase amplification bias or phasing), they present unique challenges as well. For example, a small fraction of nucleotides escape labeling, and so incorporation events in strand synthesis will be missed. In addition, there are nucleotides that dwell long enough in the active region to be detected, but are not incorporated, as well as those that are incorporated too rapidly for detection. Finally, upon successive incorporation of identical nucleotides, if some nucleotide dwell times are longer than average, additional successive incorporations may be recorded erroneously. Together, these sources of error lead primarily to insertion and deletion errors, although a smaller fraction of substitution errors may also occur [27]. Although the raw error rate for individual reads are high, upon reading each template multiple times, a consensus rate can be brought much lower, even down to < 0.1% [29].

Sequencing using reversible dye terminators (Illumina)

Similar to other techniques, Illumina’s library preparation begins with fragmentation of the DNA of interest, to which adaptors are ligated on each end (see Figures 1.3 and 1.4).

Fragments are then applied to the surface of the flow cell, to which adaptor-complementary sequences are covalently anchored, allowing hybridization of adaptors to their anchors. The

finely tuned concentration of now-anchored fragments are then amplified in a process called bridge amplification, producing localized clusters of identical fragments on the surface of the

flow cell.

In sequencing, all fragments within each cluster corresponding to the same strand of the original DNA are selected and sequenced. Sequencing itself follows a cyclic procedure

7 Figure 1.3: Illumina sequencing overview: (a) Adaptors are attached to DNA fragments. Upon hybridization of adaptors to their complements in the flow cell, a bridge amplification process is employed, producing localized clusters within the flow cell of identical copies of library fragments (b). (c) In sequencing, DNA polymerase incorporates fluorescently labeled nucleotides, modified to prevent continued strand synthesis, allowing for the instrument to detect (through imaging all clusters in the flow cell) which nucleotide was incorporated in each cluster. Post-imaging, the synthesis block and fluorophore are removed, allowing for the next nucleotide incorporation to take place, iteratively determining the sequence of the fragment (from [27]).

of: 1) a nucleotide is added to the synthesis strand by the polymerase, 2) unincorporated nucleotides are washed away 3) the flow cell is imaged to identify the fluorescent signal reported by each cluster, 4) the fluorescent groups are chemically cleaved, and, 5) the 3′ -OH is chemically deblocked. As mentioned in 5) and seen in panel (c) of Figure 1.3, reversible dye terminator sequencing uses nucleotides which not only have a fluorophore attached (one unique color per base), but are also chemically blocked from forming any further extensions 8 Figure 1.4: Paired-end sequencing: DNA library fragments are ligated with adaptors, uniquely identifying each end of the fragments. Single-end sequencing only sequences one end of the fragment, but paired-end sequencing sequences both ends (as shown), allowing for better alignment of fragments to the reference genome, de novo assembly of new genomes, as well as providing knowledge of fragment lengths. (adapted from [27])

(a blocking group is added to the 3′ -OH position of the ribose sugar). This prevents the

polymerase from incorporating more than one nucleotide prior to the imaging step, and

allows for the step-wise determination of the fragment sequence. The flow cell is flooded with each of the four bases simultaneously, allowing for the parallel sequencing of all read clusters in the flow cell per imaging step, based strictly on the color of the fluorophore that was incorporated in each cluster.

In paired-end sequencing (see Figure 1.4), after these fragments are sequenced, the

9 opposing strand is then selected for and sequenced, providing the corresponding downstream fragment sequence (so, both ends of the fragment are sequenced, instead of only one). This benefits de novo assembly of new genomes, allows a more unique placement of fragments within known genomes for multiply mapped reads, and also allows for the determination of fragment lengths [27]. However, it is more costly due to the additional sequencing required, and so at times is opted out in favor of single-end sequencing, which only provides fragment sequences of one end.

Sources of noise for Illumina’s sequencing technology include phasing and residual

fluorescent noise. In phasing, an increasing number of fragments in a cluster fall out of phase of the majority of the fragments in the cluster, due to incomplete deblocking reactions in the prior sequencing cycle, or from imperfect blocking group additions to nucleotides, allowing for more than one nucleotide incorporation to occur for a limited number of strands per cluster. Residual fluorescent noise occurs with incomplete cleavage of fluorophores in prior cycles. As a result of these noise sources, the primary form of error in Illumina’s technology is substitution errors, where the base for a given nucleotide is incorrectly identified [27].

1.2.2 Applications of next-generation sequencing

With the vastly increased throughput of next-generation sequencing techniques, a tremendous amount of growth has been seen in the variety of their applications. Here we present a few of these applications, split by nucleic acid sequenced.

DNA sequencing

The first effort to sequence the took ∼10 years to publish a first draft, and to date, has cost the United States $3 billion [30]. Now, the $1000 genome milestone has been achieved by Illumina’s HiSeq X Ten system (a collection of 10 HiSeq X instruments), and takes only a few days to complete. However, it should be noted that the $1000 milestone requires all 10 instruments to be continuously employed at full capacity, producing 18,000 genomes per year, and the initial instrument purchase cost of at least $10 million makes this level of production only suitable for large institutions performing population genetic studies.

10 With the cost and time for sequencing continually decreasing, however, whole-genome sequencing is increasing in feasibility, including applications in de novo assembly of new

genomes and variation studies in known genomes. Since the 1000 Genomes Project [31], the first large-scale human genetic variation study, even larger studies have been launched

([32] for example), which provide unprecedented levels of understanding of the relationship between genomic variation and phenotype [33]. Applications of whole-genome sequencing also extend to translational research such as forensic genetics [34] and clinical diagnostics

[35; 36].

For many applications, the breadth of information provided by whole-genome sequencing is not necessary, nor practical to be produced, and a variety of techniques have been developed to sample smaller subsets of the genome. For example, in whole-exome sequencing, only coding regions of the genome are sequenced [37], as the exome represents less than 2% of the genome, but contains ∼ 85% of known disease-causing variants [38]. In amplicon

sequencing, specific genome regions are targeted for amplification by PCR before being sequenced, providing better coverage for the regions of interest in a study such as the genomic variations in disease causing genes [39], or the 16S microbial rRNA gene in a metagenomics study [40].

RNA sequencing

Whole-transcriptome analysis, like whole-genome analysis, is possible through RNA-seq, which is similar to the process of DNA-seq, but in general has the added step of reverse transcription, where RNA fragments are converted to cDNA fragments. Due to the complexity of the eukaryotic transcriptome where many genes produce antisense transcripts [41], strand-

specific RNA-seq protocols have been developed [42], and have helped with the identification of novel antisense regulatory transcripts [43–45]. In addition, similar to DNA-seq, targeted

RNA-seq methods have been developed [46; 47], and have led to the discovery of novel isoforms of well-annotated protein coding transcripts [45].

Another novel method developed for RNA sequencing, termed Ribo-seq, captures 28-

30 nt long RNA fragments shielded by ribosomes during nuclease footprinting [48; 49] (see

11 Figure 1.5: An Overview of the Ribosome Profiling Protocol: As shown, unprotected transcript regions undergo nuclease digestion, leaving ribosome protected fragments for capture and sequencing. (Reprinted by permission from Macmillan Publishers Ltd: Nature Protocols [48] copyright 2012)

Figure 1.5). This technique provides a translatomic snapshot of the cell, and has been used to study the translational control of gene expression [50], to help in annotating translated regions [51], and to study mechanisms of protein synthesis [52; 53]. We utilized this technique, along with traditional RNA-seq, in the study of LepA’s role in E. coli translation [54] (see also Chapter 4).

12 Additional techniques

A large variety of additional methods, with variations on those methods, have been developed.

ChIP-seq, which studies protein–DNA interactions [55], has been used to study a breadth of biological processes. RNA–protein interactions can be studied with CLIP-seq [56] or iCLIP

[57], RNA–DNA interactions with CHART [58] or CHiRP [59], and DNA–DNA interactions with conformation capture (3C)-based methods [60]. Utilizing these methods, the three-dimensional organization of genomes can be determined at a higher resolution than ever before [61].

In addition to interaction-probing methods, epigenetic methods have been developed

[62; 63]. An extensively studied epigenetic modification is DNA methylation, which will be further discussed in Chapter 2.

1.2.3 Challenges of next-generation sequencing

Next-generation sequencing, with large (hundreds of gigabytes) data sets consisting of fragmented regions of DNA or RNA, present unique challenges for those hoping to utilize the technology [30]. Compared to earlier sequencing efforts, where data generation was

the bottleneck, the challenge with next-generation sequencing is in the storage, handling,

and analysis of information obtained [64]. Especially considering the wide range of high- throughput sequencing (HTS) applications described above, one concern is that there is not a “Swiss army knife”-type method in the analysis of next-generation sequencing data that covers all possible applications. As a result, individual users will have to carefully document for the community the analysis for their specific application [65]. In Chapter 3, we present a pipeline we have developed in the analysis of HTS data, and further discuss some of the analysis challenges posed by these techniques. We then utilize the pipeline in the analysis of a novel 5′-end sequencing approach (Chapter 3), as well as in the analysis of Ribo-seq and

RNA-seq data (Chapter 4).

13 1.3 Scientific contributions to the field

There are two main areas of contribution to our scientific understanding. The first is with the development of a predictive model of DNA methylation, utilizing an increased understanding of methyl-binding-domain (MBD)’s interaction with methylated DNA as found in MethylCap-seq, outlined in Chapter 2. The second is in our knowledge of LepA’s role in E. coli, as discussed in Chapter 4.

MBD-DNA interactions

The “gold standard” for measuring DNA methylation is bisulfite sequencing [66], providing nucleotide level resolution of methylated CpGs. However, to gain a whole genome picture of

DNA methylation (the methylome), the sequencing depth required is equivalent to sequencing the human genome ∼ 30× per sample [67], which presents a prohibitive cost for larg-scale methylation studies. In contrast, MethylCap-seq utilizes the MBD-DNA interaction to select for methylated CpGs, providing a cost effective and easily automatable system to study

DNA methylation [68; 69].

However, existant MethylCap-seq analysis protocols: (i) have resolutions on the order of a DNA’s fragment size [70] or larger [68], (ii) do not provide methylation percentages

(how likely given CpGs are methylated), and only provide differentially methylated regions

[69], or (iii) both [71]. We have developed a model to overcome these challenges, providing likelihood of methylation with single CpG resolution.

In addition, DNA recovery in MBD pulldown experiments has been observed to be very non-linear when measured on synthetic oligos containing 0 to 8 methylated CpGs (Figure

1.6)[72]. However, none of the current techniques take this into account, and assume a linear relationship in their underlieing analysis. As far as we know, our model is the first to evaluate relative pulldown efficiencies per accessible CpG, as well as the distribution of CpGs on DNA fragments, incorporating these effects and values in determining the probability of methylation. We believe this added consideration will allow for a more accurate determination of methylation levels than existant MethylCap-seq analysis techniques.

14 Figure 1.6: MBD capture of methylated CpG oligos, vs an antibody-based ap- proach: MBD (blue) vs an antibody (green)’s capture of synthetic oligos is shown, displaying MBD’s enhanced ability to capture methylated DNAs. Of note is the very non-linear behav- ior of MBD’s capture rates for varying number of methylated CpG’s per oligo (0 vs 1 CpG is essentially unchanged, with significant enhancements for 2, 3, and 4 methylated CpGs). Oligos were 80 bp duplex DNAs containing 0 to 8 methylated CpGs. (From [72])

LepA in E. coli

In E. coli, there is a widely conserved protein called LepA. However, despite its widespread conservation across all bacteria, LepA has remained poorly understood. Widely thought to be involved in the elongation phase of protein synthesis, we have shown that, although

LepA does contribute to translation elongation, these effects are codon specific and relatively minor (Section 4.4.2). In contrast, we have found that LepA mainly affects translation initiation (Section 4.4.1), with many genes being affected. In particular, we find pyrimidines are significantly underrepresented in the Shine-Dalgarno (SD) region (which is a part of the translation initiation region (TIR) of genes) for genes with significantly lower average ribosome density (ARD) upon the deletion of lepA, and find that LepA promotes translation initiation.

1.4 Conclusions

Molecular biology has seen incredible advancements in our understanding of the mechanisms of life. In particular, with the increased understanding of the information storage and

15 transfer that occurs in a cell, and the development of increasingly efficient techniques to read such information, we have seen explosive growth in questions being asked, and answers determined. In the following chapters, we will present a few of the questions we have asked, the techniques we have developed, and the answers gained, with the common thread across all chapters being that all data obtained for analysis were sequenced by Illumina’s HiSeq

2500, a next-generation sequencing instrument, reading information from nucleic acids.

16 Chapter 2 MBD-DNA Interactions as Probed through HTS

2.1 Introduction

DNA methylation is an extensively studied chemical modification of DNA (a simple PubMed search on “DNA Methylation” yields 47,583 publications as of 28 February 2015, with 4,670

′ in 2014 alone) where a methyl group (−CH3) is added to the 5 carbon position of cytosine (5-mC) [73], and is primarily found in the CpG context [73; 74]. It is an important epigenetic modification, with proper control of methylation being linked to embrionic development and tissue differentiation [75], gene imprinting [76], and X-chromatin inactivation [77].

Contrary to expectation, the CpG di-nucleotide content of the human genome is rather low, due to the mutagenic properties of 5-mC [78; 79], with CpGs being primarily clustered together in what are called CpG islands (CGIs), found primarily in the promoter or first exon regions of genes [73]. Nearly 70% of all annotated gene promoters have been found to have high-CpG prevalence [80], of which, most all housekeeping genes, as well as a portion of tissue specific and developmental regulator genes have increased levels of CGIs [81; 82].

These finding suggest CGIs may be tied to transcription initiation [73], and methylation is related to gene silencing [83]. Associated with the gene silencing nature of methylation, hypermethylation of tumor suppressor genes and hypomethylation of oncogenes have been shown to play a key role in the development of cancer [84–88]. Thus, an effective means of studying methylation levels is needed to further understand the complex disease of cancer.

17 Bisulfite sequencing has been called the “gold standard” as a technique to measure methylation in DNA [66], where unmethylated cytosines are converted to uracil (which PCR

amplification converts to thymine), and 5-mC is untouched [89]. This allows the methylation state of individual cytosines to be determined based on these modifications to the genomic sequence. However, this method suffers from the inability to provide the depth of sequencing required for full methylome determination in a cost-effective manner [90].

MethylCap-seq has been developed as a method to study DNA methylation on a whole- genome level [70], and utilizes the human methyl-binding-domain (MBD) protein, which preferentially binds to methylated DNA fragments, allowing for their selective enrichment.

Sequencing of these fragments allows for a whole-genome snapshot of DNA methylation, providing a depth of coverage in the regions of greatest interest. However, MethylCap-seq suffers from low resolution (on the order of the fragment size), and cannot specify the methylation state of any given CpG [70].

In order to utilize MethylCap-seq as a tool to more effectively study genome-wide methylation levels in such diseases as cancer, we needed to develop a higher-resolution view of methylation. Here we have set out to better characterize the interaction between MBD and

DNA as observed in the MethylCap-seq protocol. In particular, we have investigated such factors as DNA fragment size, methylated CpG positioning on fragments, CpG separation dependence, and the overall CpG counts per fragment on MBD capture and sequencing.

We have also developed a bayesian predictive model to take the MBD-DNA interaction parameters we determined in this study to extrapolate methylation state of future studies, although implementation of this model has yet to be performed.

2.1.1 MBD background

A key component of MethylCap-seq is its use of MBD, or more specifically, human MBD2

(NCBI RefSeq: NM 003927; [91]). MBD2 is thought to have evolved from a MBD2/3 precursor which exists in invertebrates, and retained the ability to bind to methylatd DNA, while MBD3 in mammals did not (but does for other vertebrates) [92]. Here we outline some of what is known about MBD2, particularly about its interaction with methylated DNA.

18 (a) (b)

Figure 2.1: Chicken MBD2 binding to methylated DNA: (a) The 3-D structure of MBD2 (cyan) binding to methylated DNA (mCpG bases in yellow in center).(b) Base- specific (solid lines) and phosphate backbone (dashed lines) contact points of MBD2 to the DNA. (From [93], used by permission of Oxford University Press)

MBD size/ structure

Human MBD2 has two main isoforms, with the longer being 411 amino acids long and weighing 43,255 Da [94; 95]. Although the human structure has not been solved, the active region of chicken MBD2, interacting with methylated DNA, has [93] (Figure 2.1). As seen in

Figure 2.1a, the active region of the MBD2 structure is formed with a three strand β-sheet, with a fairly large loop between the first and second strands, and a tighter loop between the second and third strands. Immediately following the β-sheet, the backbone turns in to an α-helix. Neither the N- nor C-terminal regions of the MBD domain form any secondary structures, and simply pack closely to the α-helix and β-sheet. As seen, the large loop

between strands 1 and 2 of the β-sheet extend down into the major groove of the DNA, making direct contact with the methylated CpGs. Figure 2.1b shows the contact points of

MBD to methylated DNA, showing direct contact with the methylated CpG di-nucleotide, as well as with nucleotides immediately surrounding the mCpG (solid lines). In addition,

19 Figure 2.2: Synthetic oligo targets for mouse MBD2b: A schematic representation of methylated CpGs on the respective fragments. (From [96], used by permission of Oxford University Press)

Table 2.1: Fractional saturation of mouse MBD2b GAC BRCA1 MLH1 GSTP1 P16INK4a Unmethylated 188.7 ± 46.8 200.4 ± 53.6 257.4 ± 125.4 136.5 ± 37.3 - Methylated 2.7 ± 0.8 3.5 ± 1.5 7.9 ± 1.9 1.09 ± 0.1 1.6 ± 0.2 U/M 69.8 57.3 32.5 125.2 -

R1/2 values of mouse MBD2b for unmethylated and methylated oligos, as well as their ratio (U/M). GAC is a 42 bp long DNA fragment, with the 20th and 21st bps being the (un)methylated CpG. A schematic of the remaining oligos is shown in Figure 2.2. (Values from [96], used by permission of Oxford University Press)

there are contacts made with the DNA backbone (dashed lines).

MBD binding affinity

Similarly to the structure, binding characteristics of human MBD2 have not been studied, but have been determined for mouse MBD2b, the shorter isoform of MBD2 [96], and for the active domain of chicken MBD2 [93].

In 42 bp long DNA fragments consisting of ACG repeats, with the 20th and 21st bases being a CpG which is methylated or not, Fraga, et al., observed R1/2 values for mouse

MBD2b to be 2.7 ± 0.8 nM when binding to methylated DNA, while R1/2 was 188.7 ± 46.8 nM for unmethylated DNA (GAC in Table 2.1). This gives a 70× fold change in MBD2b binding of methylated vs unmethylated CpGs (indicating its ability to discriminate between the two). 50 bp strands with 3 to 7 mCpGs, varying in position and density (see Figure

2.2), were also used in characterizing MBD2b’s binding affinity, where binding events were interpreted to be of only one MBD2b per fragment. Once again, we observe enhanced

20 Table 2.2: Binding affinity of chicken MBD2

MBD2 mCpG KD (µM) ± SE WT WT 2.1±0.1 K32A WT 291±19 Y36F WT 109±3 R46C WT 590±71 R67M WT 197±17 K19W WT 135±17 WT Thy104Gua 2.2±0.1 WT Gua107Thy 29±2 WT Inverted 2.3±0.5 (From [93], used by permission of Oxford Uni- versity Press)

affinities for methylated oligos compared to unmethylated oligos, with fold changes ranging from 32.5× to 125.2×.

In measuring the binding affinity of the active domain of chicken MBD2, Scarsdale, et al., find similar affinities of MBD2 to methylated DNA [93] (see Table 2.2). They performed

KD measurements also on modified MBD2 domains and modified DNA sequences (based on where MBD2 is shown to interact with the DNA, Figure 2.1b), and show that the amino

acid sequence is more important in maintaining the MBD2-DNA interaction (see [93] for details). From these data, we see that MBD does indeed bind specifically to methylated

DNA, up to ∼100 times more than unmethylated DNA, and so should provide good capture of methylated DNA fragments for analysis.

2.2 Methods

2.2.1 Pre-Data analysis

To facilitate our investigation of MBD-DNA interactions, our collaborator Dr. Pearlly Yan and her students generated artificially methylated human DNA utilizing CpG Methylase

SssI[97]. These were then treated with the MethylCap-seq protocol for study. Specifically, the DNA was then sonicated (which breaks apart the DNA into smaller fragments) to fragments of unknown size (originally thought to be ∼ 150 bps), captured utilizing MBD- 21 coated magnetic beads (which preferentially captures methyl-CpG containing fragments, the interaction of which we are attempting to characterize), amplified with PCR (which provides more copies of the captured fragments, allowing them to be sequenced), and were single-end high-throughput sequenced. Single-end sequencing proved to be a challenge as we only know the 5′ end of reads, and thus do not know how long fragments really are. In order to tease out the effects of MBD, another set of samples (called “input”) were prepared in an identical way, leaving out the MBD pulldown step, which allows us to normalize away the systematic experimental bias.

2.2.2 Preliminary priming and questions asked

Dr. Yan supplied to us the sequencer output, which they had processed and aligned to the human genome, hg18 [98], providing us only the uniquely-aligned reads across our samples

(we had two replicates of MBD pulldown, and four replicates of input, with a combined total of 29,338,030 and 128,537,430 reads for pulldown and input respectively). We then took each of the reads from these samples, which were all 36 nts in length, and extended the sequence on the 3′ ends (utilizing the genomic location of the reads and the hg18 reference genome) to 250 nts total, as we were fairly confident the vast majority of sonicated fragments would be shorter than 250 nts. We then probed the data based on the following questions:

1. Are the input samples representative of the genome, in particular, with regards to

CpG content? If not, can they be corrected to make them representative?

2. Does a methylated CpG’s location on a fragment affect MBD pulldown efficiency?

3. Is there a separation dependence between 2 or 3 methylated CpGs to MBD pulldown?

4. What is the overall effect of the number of methylated CpGs on MBD pulldown

efficiency?

2.2.3 Library analysis workflow overview

For further analysis of the libraries (pulldown or input), we followed the same basic workflow, with the difference in each investigation being the underlying question asked above (Sec- 22 tion 2.2.2). To estimate the uncertainty in the observables of interest, we first partitioned each replicate into b bins, where b = 20. Following that, in general, the workflow is:

1. Determine the fraction of reads in each bin, per sample, which match the observable of

interest, given the constraints imposed by the question being asked (see Equations 2.1

and 2.8), for each replicate of each library type.

2. Find the weighted average across all bins of the observable, and the standard deviation

as an estimate of variability.

3. Combine information from across replicates, to determine the overall effect per library

type (combined weighted average and standard deviation).

4. Normalize the effects observed in pulldown samples by the effect in input samples to

determine MBD-pulldown’s contribution to the effect (pulldown efficiency).

We first apply this workflow to question 1.

2.2.4 Question 1a: Genomic CpG content vs input, examining protocol bias

As this question examines how representative input libraries are of genomic values (ignoring pulldown), it lacks step 4 from the workflow.

Step 1: Fraction of reads in each bin with the respective CpG counts

For input samples, we observe the number of CpGs which appear within the first 150 bps of the start of a read, per read, and determine the fraction of total reads per bin with each number of CpGs. In other words, we calculate:

Si(x) fi(x) = (2.1) Ni where fi(x) is the fraction of reads in bin i with x number of CpGs within 150 bps of the start of the read, Si(x) is the total number of reads in bin i with x CpGs, and Ni is the total number of reads in bin i. 23 Step 2: Average across bins with error estimate

The weighted average across all bins just provides the fraction of reads with each CpG count across the entire sample, so we would have:

Sr(x) fr(x) = (2.2) Nr where fr(x) is the fraction of reads in replicate r with x number of CpGs within 150 bps of the start of the read, Sr(x) is the total number of reads in replicate r with x number of

CpGs, and Nr is the total number of reads in replicate r. As Si(x) and Ni are exact read counts, we simply calculate the variance in fr(x) as a deviation from fr(x) (as opposed to a raw average of fi(x) across bins) for all fi(x)’s, to simulate the uncertainty in fr(x):

b 2 1 2 σr (fr(x)) = (fi(x) − fr(x)) (2.3) Nr Xi=1

Step 3: Overall input CpG count fractions and variance

We combined fractions fr(x)’s across all replicates to get an overall fraction of reads f per CpG count x, for the input libraries. The weighted average across replicates, similarly to across bins, simply gives the overall fraction of reads across all input libraries:

Sl(x) fl(x) = (2.4) Nl where fl(x) is the fraction of reads in the library l = input, with x number of CpGs, Sl(x) is

the total number of reads in the input labraries with x CpGs, and Nl is the total number of reads across all input libraries (128,537,430 reads). However, the calculation of the variance is more complicated, with:

4 1 σ2 (f (x)) = N 2(x)σ2 (2.5) l l N 2 r r l r=1 X where the sum goes from r = 1 to r = 4, as there are 4 input library replicates. We have 2 2 2 ∂f 2 ∂f 2 used the uncertainty propogation function σ (f) = ∂x σ (x) + ∂y σ (y) + ..., but as     there is no uncertainty in the total number of reads per replicate Nr, the only contributions

24 2 come from the uncerainties in fr(x). We take fl(x) and σl (fl(x)) to compare with genomic values.

Genomic CpG counts per “strand”

Dr. Yan supplied CpG location files, which contain the genomic locations of every single

CpG in hg18 across all chromosomes. Utilizing these chromosomal locations, we partitioned every chromosome into segments of 150 nts and determined how many CpGs, x, fall into

each segment. We generated a histogram of the fraction of segments at each CpG count

2 across the full genome, and compared this with fl(x) and σl (fl(x)) (see Figure 2.3a).

2.2.5 Question 1b: Genomic G/C content vs input, examining protocol bias

Analagous to Section 2.2.4, we examine the G/C content of input reads vs genomic values, and once again, exclude step 4, as this does not include any pulldown libraries in the analysis.

All calculations are identical with Section 2.2.4, except instead of x being the number of

CpGs per 150 bps from the start of an input read, it is the number of G or C bases. These were compared to genomic G/C values.

To compare with the genomic G/C content, every chromosome in hg18 was scanned, shifting one nucleotide at a time, with windows of 150 nts wide (except for the last windows in a chromosome, which extended to the remaining length of the chromosome from the start of the scanning window). Windows were excluded from analysis if they contained any “N”s, which are indeterminate nucleotides in the genomic sequence. The G/C content of each window was determined, and a histogram of the fraction of windows at each G/C count was

2 compared with fl(x) and σl (fl(x)) (see Figure 2.3b). The G/C content correction g(c) was determined from this comparison by taking the fraction of genomic windows w(c) at each G/C count c, and dividing by fl(x), per x = c. In other words: w(c) g(c) = (2.6) fl(c)

25 for each value of c observed, and

w(c) 2 σ2(g(c)) = σ2(f (c)) (2.7) f 2(c) l  l  This correction was used for all further analysis, simulating per sequenced read, the number of reads that should have been sequenced, had the sequencing protocol been bias-free. c, once again, is the number of G or C bases that appear within the first 150 bps of the start of a read.

2.2.6 Analysis overview for remaining questions

For the remaining questions, due to the inclusion of the G/C content correction and the addition of pulldown samples, a few modifications are made to the original workflow overview presented in Section 2.2.4, in the investigation of Question 1.

Step 1: Fraction of reads in each bin with the respective observables and con- straints

For the remaining questions (Questions 2-4), we have the following observables and constraints based on the observables for Step 1:

Common Constraint For 1, 2, and 3 CpG observables described below, a constraint was applied requiring the given number of CpGs to be positioned within the first 150 bps of the start of a read, and no further CpGs to be present out to 200 bps. Since we are unaware of the original fragment sizes, but believe them to be ∼150 bps on average, the constraint of no additional CpGs out to 200 bps helps us gain confidence that there probably were only the given number of CpGs per fragment.

1 CpG position dependence (Question 2) We observe the position of the 1 CpG from the start of a read, for all reads that met the common constraint above, and determine the

G/C content corrected fraction of total reads with each 1 CpG position, per bin of each sample. Read position is defined based on the “C” of a CpG, and is defined to be 1 if the

26 “C” is the 5′-most nucleotide in a read.

2 CpG separation dependence (Question 3) We observe the separation between two successive CpGs, where separation is defined as the number of nucleotides in between the

“G” of the first CpG and the “C” of the second CpG. So, CGNCG would be defined to have a separation of 1, with separations ranging from 0 (CGCG) to 146 (CG(N×146)CG) for strands of 150 bps. We determine the fraction of G/C content corrected total reads per bin of each sample with each 2 CpG separation.

3 CpG pairwise separation dependence (Question 3) Analagous to the 2 CpG separation case, we use the common constraint for 3 CpGs within the first 150 bps of the start of a read and observe the pairwise separations between the 3 CpGs. For example,

CGNCGNNCG has separations of 1 and 2 respectively. We observe the fraction of G/C content corrected reads at each pairwise separation compared to the total number of reads per bin per sample.

Overall CpG count dependence (Question 4) Exactly analagous to the input vs

genome case (Question 1) discussed above, we observe the number of CpGs that appear within the first 150 bps of the start of a read, and determine the G/C content corrected fraction of reads with each number of CpGs, out of the total number of reads per bin per sample. This analysis differs from the input vs genome case as this analysis also utilizes the G/C content correction, and is also performed on pulldown samples, with the eventual normalization process described in step 4 of the workflow.

Summary: fraction calculation In summary, analogous to equation 2.1, we calculate:

Si(x) fi(x) = (2.8) Ni where x is the observable described above for each question considered, per bin i. Si(x) = g(c), where “{reads}” is the set of reads meeting the constraints above for observable {readsX } x, and g(c) is the G/C content correction from equation 2.6, simulating the corrected number

27 of reads with G/C content c that should have been sequenced. Similar to Si, Ni = g(c), {readsX } where “{reads}” this time is all reads placed into bin i, corrected for their G/C content, with the sum over g(c). Both Si(x) and Ni have uncertainties associated with them, based

2 2 on the uncertainty in the G/C content correction, where σ (Si(x)) = σ (g(c)), and {readsX } 2 2 σ (Ni) = σ (g(c)), with “{reads}” corresponding to the same set as for Si(x) and {reads} X 2 Ni in each case, and σ (g(c)) is from equation 2.7, based on the G/C content, c, per read considered. This leads to the following uncertainty in the fraction of reads with observable x, per bin i: σ2(S (x)) S (x) 2 σ2(f (x)) = i + i σ2(N (x)) (2.9) i N 2 N 2 i i  i 

Step 2: Average across bins with error estimate

The weighted average, analagous to equations 2.2 and 2.8, simplifies to the same form as equation 2.2: Sr(x) fr(x) = (2.10) Nr with Sr(x) and Nr exactly analagous to equation 2.8, being the G/C content corrected number of reads matching observable x in replicate r for Sr(x), and the total G/C corrected number of reads in replicate r for Nr. However, the uncertainty in fr(x) is now more complicated:

b 2 2 2 2 Ni 2 fi(x) 2 fi(x)Ni 2 σ (fr(x)) = σ (fi(x)) + σ (Ni) + 2 σ (Nr) (2.11) " Nr Nr Nr # Xi=1       2 2 2 where σ (Nr) = σ (g(c)) over all the reads in the replicate, analagous to σ (Ni). {readsX }

Step 3: Overall library type (pulldown or input) fractions and variance of the observable

Exactly analagous to equations 2.4 and 2.10, we have over the entire library type:

Sl(x) fl(x) = (2.12) Nl

28 where Sl(x) is the G/C content corrected number of reads matching observable x in library type l, and Nl is th total G/C corrected number of reads in library type l. For the uncertainty, we have, analagous to equation 2.11:

N 2 f (x) 2 f (x)N 2 σ2(f (x)) = r σ2(f (x)) + r σ2(N ) + r r σ2(N ) (2.13) l N r N r N 2 l r " l l l # X       2 2 where σ (Nl) = σ (g(c)) is the uncertainty in the total number of G/C corrected {readsX } reads per library type.

Step 4: Pulldown efficiency (MBD-pulldown’s contribution to the observable effect)

Finally, to deconvolve the MBD-DNA interaction effects from library prep, sequencing biases, and genomic background, we normalize the fractions from the pulldown libraries by the corresponding input library values. Pulldown efficiency, then, is:

f (x) e(x) = pulldown (2.14) finput(x) with uncertainty

2 σ(f (x)) 2 f (x) σ(f (x)) σ2(e(x)) = pulldown + pulldown input (2.15) f (x) f 2 (x)  input  input !

2.2.7 Model-Building

After determining the pulldown effectiveness of MBD on fully methylated DNA, we developed a model to describe this interaction, utilizing 18 fit parameters. These included the mean and standard deviation of the length distribution, and relative binding probabilities for the accessible CpGs.

Data-Fitting

The minimization of χ2 for each model-fitting were performed utilizing the R statistical computing package’s non-linear minimization (“nlminb”) function [99]. All model fitting

29 was performed not to the pulldown efficiencies e(x) (see equation 2.14), but to:

Sl(x) fl(x) = (2.16) Nl where fl(x) (analagous to equation 2.4) represents the fraction of raw reads (i.e. it does not include a G/C content correction) for each observable x in the pulldown libraries. Sl(x) is the number of raw reads with observable x, and Nl is the total number of reads in the library. And so, for our fitting, we are simulating the actual fraction of reads that would be extracted from the sequencer for each given type of data.

1 CpG Fitting

We developed a model to simulate the 1 CpG position dependence by utilizing a G/C content aware gaussian distribution of lengths. First, to incorporate G/C content of strands, we determined the number of strands, G(l, p), that could theoretically be sequenced and contribute to the final fraction of pulldown reads:

G(l, p) = 1/g(c) (2.17) X where we summed over all genomic segments with lengths l, where 3 ≤ l ≤ 200, and there was only one CpG at position p from the 5′-end of the segment (which could be past l, which would simulate a fragment that was not captured by MBD, since it did not contain a methylated CpG but survived the pulldown step and was sequenced nonetheless), with no further CpGs out to 200 bps from the 5′-end. g(c) is the G/C correction in equation

2.6, and utilized the percent G/C of segment lengths to determine the correction values, rounded to the nearest g(c) G/C content value. For example, a genomic strand of “ACTG”, which has 50% G/C content, would utilize 0.5 × 150 = 75 = c for g(c). Note that, currently,

G(l, p) assumes all fragment lengths are equally likely to be sequenced/ captured, given equal fractions of G/C’s out of all nucleotides in the fragments.

To take into account our model of a gaussian fragment length distribution, and how this

30 relates to pulldown efficiency, we took:

200 l=3 G(l, p) · N(l, L, S) · R0, p > l − 1 F (p, L, S, R~) =  (2.18) P200  l=3 G(l, p) · N(l, L, S) · R1, p ≤ l − 1 P where F simulates the raw number of fragments that would be pulled down for a given

1 CpG position, p. We see that F is a sum over all fragment lengths (l from 3 to 200), for the G/C corrected number of genomic fragments, G(l, p), weighted by how likely such a fragment length will be seen (N(l, L, S) is the value of the gaussian distribution at length l for average length L and standard deviation S), and the relative increase in pulldown efficiency for strands with 1 CpG, R1, as opposed to strands with no CpG, R0. Note that L,

S, and R~ are all fit-parameters for the model, although R0 is defined to be equal to one, with all higher R values (based on the number of accessible CpGs) set relative to R0 (with constraint Rj ≥ Ri, for j > i). The simulated fraction of pulled down reads, then, is:

F (p, L, S, R~) P (p, L, S, R~) = (2.19) T (L,S, R~)

~ 149 ~ where T (L, S, R) = p=1 F (p, L, S, R), is the total number of simulated pulldown fragments across all CpG locations.P

To fit this to data, we minimized:

2 149 f (p) − P (p, L, S, R~) χ2 = pulldown, 1 CpG (2.20) σ(f (p)) p=8 pulldown, 1 CpG ! X where fpulldown, 1 CpG(p) was calculated as described in Data-Fitting (page 29) with no G/C content correction, since G(l, p) is already simulating what the sequencer outputs based on g(c), and we only took into account the pulldown efficiencies from the data for CpG positions

8 ≤ p < 149, since the first few CpG positions show strange edge-effects for pulldown (see

Figure 2.5b).

31 Accessible CpG

Based on observations of pulldown efficiency for 2 CpGs (see Section 2.3.3), we developed an “accessible CpG” count per strand, which combines closely separated (separation ≤ 1 nt)

CpGs to 1 accessible CpG, and CpGs separated by 2 bps to count as 1.5 accessible CpGs.

CpGs separated by ≥ 3 bps are independent.

For strands with over 2 CpGs, we extended the accessible CpG model to determine the number of accessible CpGs for any given actual distribution of CpGs on a strand, maximizing the number of accessible CpGs in order to find the maximum likelihood of MBD binding.

To do this, we flagged every CpG in a strand as either “bound”, “partially bound”, or “not bound”, where “bound” CpGs counted as one accessible CpG, “partially bound” CpGs counted as half an accessible CpG, and “not bound” CpGs did not contribute to the number of accessible CpGs. For every strand, we determined every possible distribution of CpG binding and took the maximal accessible CpG count from all distributions.

In determining the distribution of MBD-CpG binding, we followed the rules introduced above:

1. CpGs ≥ 3 bps apart are independent, and don’t constrain each other in any way.

2. For 2 CpGs separated by ≤ 1 nt, at least one of them must be “not bound”.

3. For 2 CpGs separated by 2 bps, if one CpG is “bound”, then the other can at most be

“partially bound”.

4. For 3 consecutive dependent CpGs (separations between 2 consecutive CpGs are

≤ 2 bps), we can have at most 2 “bound” CpGs. where we introduced item 4 as a result of observations of 3 CpG separations (see Section

2.3.4).

2 CpG Fitting

Analogous to the 1 CpG Fitting case, we first determined a simulated number of sequenced strands, utilizing the G/C content of genomic strands of lengths l from 3 to 200 bps, where 32 strands had to have 2 CpGs within the first 150 bps and no additional CpGs beyond that.

In other words, we determined:

G(l, p1, p2) = 1/g(c) (2.21) X ′ where p1 is the position of the first (5 -most) CpG, p2 is the position of the second CpG, and we summed over all strands that met the criterion for length l and the 2 CpG positions.

The separation between the 2 CpGs, as defined above in the discussion for 2 CpG separation dependence (page 27), is s = p2 − p1 − 2. c is, identically to what was done in the 1 CpG fitting case, determined as the nearest integer c that corresponded to the same G/C content percent in these genomic strands as would have corresponded to the G/C content of the

150 nt strands used in determining g(c) originally.

G(l, p1, p2) was then utilized to calculate the simulated raw number of fragments, F , that would be pulled down for a given 2 CpG separation, s:

200 l=3 G(l, s) · N(l, L, S) · R0, p1 > l − 1  P200  l=3 G(l, s) · N(l, L, S) · R1, (p1 ≤ l − 1 < p2)or (p2 ≤ l − 1, s < 2) F (s, L, S, R~) =   P200  l=3 G(l, s) · N(l, L, S) · R1.5, p2 ≤ l − 1, s = 2

P200  G(l, s) · N(l, L, S) · R2, p2 ≤ l − 1, s ≥ 3  l=3  P (2.22)  Once again, the G/C corrected number of genomic fragments, G(l, s) is weighted by how likely such a fragment length will be seen based on the gaussian distribution of lengths, N, and the relative increase in pulldown efficiency, R~.

The simulated fraction of pulled down reads is of the same form as for the 1 CpG location case (equation 2.19), but for separations s instead of position p, giving us a P (s, L, S, R~).

We minimized:

2 146 f (s) − P (s, L, S, R~) χ2 = pulldown, 2 CpG (2.23) σ(f (s)) s=0 pulldown, 2 CpG ! X calculated over all possible separation values (0 ≤ s ≤ 146), where, once again, fpulldown, 2 CpG(s) was generated without utilizing a G/C content correction.

33 3 CpG Fitting

Fitting to all 3 CpG pairwise separation data was not computationally feasible (too time intensive), so only s1, s2 separations of 0–4, 10, and 50 were considered. Analogous to prior

fittings, a simulated possible number of pulldown strands, G(l, p1, p2, p3), was calculated from the genome and the G/C correction, utilizing strand lengths, l, and CpG positions p1 through p3. Once again, all 3 CpGs were constrained to lie within the first 150 bps, with no further CpGs appearing out to 200 bps, and strand lengths l going from 3 bps to 200 bps.

Pairwise separations s1 = p2 − p1 − 2 and s2 = p3 − p2 − 2 were determined, and were used in the calculation of the number of fragments, F , that would be pulled down for a given pair of separations:

200 F (~s, L, S, R~) = G(l, ~s) · N(l, L, S) · Ri, for i accessible CpGs (2.24) Xl=3 The model was then used to give a fraction of pulled down strands, P (~s, L, S, R~), analogous to the 1 CpG case (equation 2.19). This was then used to calculate χ2, which was minimized: 2 f (~s) − P (~s, L, S, R~) χ2 = pulldown, 3 CpG (2.25) σ(f (~s)) s1 s2 pulldown, 3 CpG ! X X where sums over s1 and s2 were taken over separations of 0–4, 10, and 50, as previously described, and fpulldown, 3 CpG(~s) was calculated without a G/C content correction.

CpG Count Fitting

Finally, a model for all possible CpG counts per strands of length 3 ≤ l ≤ 200 bps was generated, where only CpGs within the first 150 bps were counted in the raw number of

CpGs, n, but CpGs had to be within the length l to contribute to the accessible CpG count a. As before, a simulated number of potential pulldown strands, G(l, a, n), was determined, summing over 1/g(c) for strands that met the constraints on n and a. These were then used to generated a simulated number of actually pulled down strands, F (n, L, S, R~), in an analogous manner to equation 2.24, which was used to generate a predicted fraction of pulled down strands, P (n, L, S, R~). R~ was only fitted out to 8 accessible CpGs, where all 34 2 higher accessible CpG counts were constrained to pulldown with the same R8 efficiency. χ came out to be:

2 f (n) − P (n, L, S, R~) χ2 = pulldown, All CpG (2.26) σ(f (n)) n pulldown, All CpG ! X where we summed over all raw CpG counts observed in the data (0–40 CpGs), and fpulldown, All CpG(n) was again calculated without a G/C content correction.

Fitting

1 CpG location and 2 CpG separations were simultaneously fitted, combining the χ2 values to minimize through a simple sum. This was used to determine the mean fragment length,

L, and the standard deviation of fragment lengths, S. R~ values were determined by fitting all data together (1 CpG Location, 2 CpG separation, 3 CpG pairwise separation, and CpG counts per strand), summing together the respective χ2’s, but weighing the 3 CpG χ2 by

108, since the uncorrected contributions of 3 CpG pairwise separations to χ2 were 8 orders of magnitude smaller.

However, R1 for the 1 and 2 CpG simultaneous fitting was consistently found to be smaller (R1 = 1.32) than with a full simultaneous fitting of R~, which pushed R1 up to 2.24.

Placing upper limits on R1 < 2.24 and performing a full fitting, it was seen that R1 was

consistently pushed to its upper limit. Upon inspection of a plot of a range of R1 fitted values on a 1 CpG location plot (keeping L and S to the 1 and 2 CpG simultaneous fitted values), R1 = 1.45 was chosen to be a good compromise that still represented the 1 CpG data within reason. A full fitting of all data was then performed, constraining L, S, and R1.

2.2.8 Model Predictions

We developed a predictive model of CpG methylation state based on the fitted parameters of our model. We start with Bayes’ Theorem:

P (D|Λ)P (Λ) P (Λ|D) = P (D)

35 which gives the conditional probability, P (Λ|D), that we obtain a set of parameters, Λ, given

that we observe a set of data, D, and relates P (Λ|D) to the probability that we observe the data given a set of parameters, P (D|Λ), the probability of obtaining said set of parameters,

P (Λ), and the probability of obtaining the data, P (D). Or, in Bayesian language, the posterior, P (Λ|D), is equal to the likelihood, P (D|Λ), times the prior, P (Λ), divided by the evidence, P (D).

We know, given proper normalizations of the probabilities, that:

1 1 = P (Λ|D)dΛ = P (D|Λ)P (Λ)dΛ P (D) Z Z and so we have that

P (D) = P (D|Λ)P (Λ)dΛ. (2.27) Z Λ in our case is the unknown state of global CpG methylation, µ = {µi}, where µi is the methylated state of a specific CpG, i, in the genome, and is equal to 0 if the CpG is not methylated and is equal to 1 if the CpG is methylated. The prior probability of a given CpG N methylation pattern in the genome, P (µ), then, comes out to mµi (1 − m)1−µi , where m i=1 is the a priori probability that a CpG in the genome is methylatedY in general, and we take a product over all individual CpG methylation states for all CpGs in the genome, N. Putting everything together so far, we have:

N P (D|µ) mµi (1 − m)1−µi P (µ|D) = i=1 (2.28) YN P (D|µ) mµi (1 − m)1−µi X{µ} Yi=1 where we discretized the integral in equation 2.27 to go over all possible global CpG methylation states, {µ}.

The remaining portion of 2.28 is to determine the probability, P (D|µ), of obtaining the observed data, D, given a certain global methylation state, µ (the likelihood function).

Given a specific genomic location (both position and strand on a certain chromosome), x,

36 we have:

P (D|µ) = P (nx|µ) x Y where nx is the number of observed reads at the location x, and we multiply over all possible genomic locations to obtain P (D|µ). We will take P (nx|µ) to be a Poisson distribution for each x: nx (fλx) −fλx P (nx|µ) = e nx! with f being a fitting parameter for the expected number of reads observed at x, fλx, and is related to the number of sequenced reads for the experiment that generated the given data, D.

λx is generated from the parameters fitted before to artificially methylated data, operated on the current global methylation pattern predicted:

200 λx(L, S, R~) = G(l, x)N(l, L, S)Ra. for fragmentsX at x; l=3 We are summing over all fragment lengths (in practice, up to l = 200 bps is reasonable, given

the fragment lengths discovered) for genomic fragments that start at x, and are summing over

the G/C correction factor for the strands, G(l, x) (analogous to equation 2.21), multiplied by the probability of seeing such a strand length based on the gaussian distribution, N(l, L, S), and the respective pulldown rate, Ra, based on the accessible CpGs on the strand, a. As f is another fitting parameter, we will explicitly pull it out of P (D|µ) in equation

2.28: N P (D|µ, f) mµi (1 − m)1−µi P (f) P (µ, f|D) = i=1 YN  P (D|µ, f) mµi (1 − m)1−µi P (f) df Z {µ} i=1 X Y   and remove the f dependence:

N P (D|µ, f) mµi (1 − m)1−µi P (f) df P (µ|D) = Z i=1 YN  P (D|µ, f) mµi (1 − m)1−µi P (f) df Z {µ} i=1 X Y   37 A priori, we do not know P (f), and so we choose it to be constant for all f. However, as f ranges from 0 < f < ∞ (it is related to the number of sequenced reads in the experiment), and requiring the integral over f to be finite, we choose the form P (f) = Be−Bf , with the limit B → 0 to be taken later, giving all f’s equal probability of occurence as desired.

Putting everything together so far, we have:

N ∞ (fλ )nx x e−fλx mµi (1 − m)1−µi Be−Bf df 0 nx! P (µ|D) = Z x   i=1 Y YN  ∞ (fλ )nx x e−fλx mµi (1 − m)1−µi Be−Bf df nx! Z0 {µ} x   i=1 ! X Y Y   N λnx ∞ x mµi (1 − m)1−µi f (Px nx)e−f(B+Px λx) df nx! 0 = x   i=1 Z Y YN  λnx ∞ x mµi (1 − m)1−µi f (Px nx)e−f(B+Px λx) df nx! {µ} x   i=1 Z0 ! X Y Y   N nx µi 1−µi (−1−Px nx) (λx ) m (1 − m) (B + xλx) Γ(1 + xnx) = x i=1 Y YN  P P nx µi 1−µi (−1−Px nx) (λx ) m (1 − m) (B + xλx) Γ(1 + xnx) {µ} x i=1 ! X Y Y   P P N nx µi 1−µi (−1−Px nx) (λx ) m (1 − m) (B + xλx) = x i=1 Y YN  P nx µi 1−µi (−1−Px nx) (λx ) m (1 − m) (B + xλx) {µ} x i=1 ! X Y Y   P where we have simplified by dropping all common terms in the numerator and denominator.

We take now the limit that B → 0, discussed above, to get our final form:

N (−1−Px nx) nx µi 1−µi (λx ) m (1 − m) λx x i=1 x ! P (µ|D) = Y Y   X (2.29) N (−1−Px nx) (λnx ) mµi (1 − m)1−µi λ  x x  {µ} x i=1 x ! X Y Y   X   To determine the probability that a particular CpG is methylated, we sum over all global

38 GC Content Comparison, w/ ylog 0.1 Sequenced Human Genome 0.01

0.001

0.0001

1e-05 Fraction

1e-06

1e-07

1e-08

1e-09 0 20 40 60 80 100 120 140 160 GC Count

(a) (b)

Figure 2.3: Input samples show CpG and G/C content bias compared with ge- nomic levels: When comparing DNA segments of 150 bps from input reads (red) and across the entire genome (green), we see a significant difference in CpG prevalence (a), and G/C content (b).

methylation states, µ, where µi = 1:

N (−1−Px nx) (λnx ) mµi (1 − m)1−µi λ  x x  {µ}|µi=1 x i=1 x ! P (µi|D) = X Y Y   X (2.30)  N (−1−Px nx)  (λnx ) mµi (1 − m)1−µi λ  x x  {µ} x i=1 x ! X Y Y   X   In practice, doing the sums and products in equation 2.30 over the methylation state of the full genome, µ, over all positions, x, is not feasible, so we will only select genomic regions of interest, going over all methylation possibilities and locations within the region, where the region is selected to be sufficiently large to be representative of the global methylation state.

2.3 Results/ Discussion

2.3.1 Sequencing introduces a G/C content bias

Comparing the fraction of sequenced fragments per CpG count in input samples and genomic values, we see a divergence at higher CpG counts (see Figure 2.3a). Knowing that sequencing technologies are not perfect and can introduce biases [100], and as GG/CC binds ∼ 150× more tightly than AA/TT [101], we examined the G/C content of input samples compared

39 GC Seq. Bias Correction: HG/ Input 10000 "gc_ratios" u 1:5:6

1000

100 Ratio

10

1

0.1 0 20 40 60 80 100 120 140 GC Count

(a) (b)

Figure 2.4: G/C bias correction fixes CpG content differences: Using correction factors to account for G/C content sequencing bias (a), we recover genomic levels (red) of CpG content in input samples (green; b).

to genomic values (see Figure 2.3b).

Here, we see a difference between genomic and input sample G/C content. Taking the ratio of genomic G/C content and input reads (Figure 2.4a), we construct a G/C sequencing correction, g(c) (equation 2.6). Applying this correction to our fraction of CpGs counts in input samples (instead of counting the raw number of reads per CpG count, we add up g(c) per read, based on the G/C content, c), we recover the genomic CpG values (Figure

2.4b). This is used as a correction to pulldown and input sequenced reads for all further investigations.

2.3.2 MBD binding to methylated CpG shows no significant position de- pendence

We then investigated MBD binding and pulldown dependence on a methylated CpG’s position within a DNA fragment. Examining segments with only one CpG, we determined pulldown coverage at each strand position divided by input coverage at the corresponding position to determine a CpG-position-based pulldown efficiency (Figure 2.5a).

Examining Figure 2.5a, we see obvious edge effects on the 5′-end of fragments, which we suspect to be due to effects generated by the sequencing technology. These effects are 40 MBD Pulldown / Non-Pulldown Ratio vs. CpG Location for 1 CpG, w/ Seq. Correction 1/2 CpG Fit Param Model vs Data for CpG Location Plot 1.4 0.01 "cpg_location" u 1:6:7 avg_len=100.6; std_len=15.51; pull_prob=1.318 data 0.0095 1.3 0.009

1.2 0.0085

0.008 1.1

0.0075 Ratio

1 Frequency 0.007

0.9 0.0065

0.006 0.8 0.0055

0.7 0.005 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 160 Location Index CpG Location

(a) (b)

Figure 2.5: MBD binding to methylated CpG shows no location dependence: 1 CpG Pulldown efficiency (a) shows weird edge effects on the 5′-end of fragments, but otherwise a relatively constant pulldown efficiency across the fragment length, with a sudden decrease around 100 bps from the 5′-end. Simultaneous fitting of our model to 1 and 2 CpG pulldown fractions, we see the model fits well to the 1 CpG data (panel b; model in red, data in green). Fitting produces average fragment length of 100.6 bps, a standard deviation of 15.51 bps, and a 1 CpG vs no CpG relative pulldown rate, R1, of 1.318.

recovered by the 8th base pair in the strand (position 1 is the 5′-most base in the strand).

We then observe a relatively constant pulldown efficiency across fragments, until we get to roughly the 90th base pair, at which there is a significant decrease in pulldown efficiency.

This rate of decrease of pulldown efficiency is recovered by about the 110th base pair, at which point the pulldown efficiency again remains relatively constant.

Believing that, barring edge effects, there should be no CpG position dependence to

MBD binding, we developed a model which took into account only a gaussian distribution of fragment lengths, N(L, S) (with average fragment length L, and standard deviation, S), genomic CpG positions/densities and G/C content, G(l, p), and 1 CpG vs 0 CpG increased pulldown rate, R1 (see 1 CpG Fitting on page 30). This model simulates the sequenced fraction of pulled down fragments, fl(x) (equation 2.16, for 1 CpG positions), which does not include a G/C correction as G/C sequencing bias is accounted for in the model.

Fitting our model to the 1 CpG position and 2 CpG separation data (to follow; see

Figure 2.6b) simultaneously, we found that the model described the data very well (red

41 MBD_Pulldown/ Non_Pulldown Ratio of Frac. of Strands vs. CpG Separation, w/ Seq. Correction 1/2 CpG Fit Param Model vs Data for 2CpG Separation Plot 1.4 0.025 "conf_cpg2_sep_ratio_w_gc_v2" u 1:6:7 avg_len=100.6; std_len=15.51; pull_prob=1.318; pp_2CpG_2=2.241; pp_2CpG_3=3.907 Data

1.2 0.02

1

0.015 0.8 Ratio

0.6 Frequency 0.01

0.4

0.005 0.2

0 0 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 CpG Separation CpG Separation

(a) (b)

Figure 2.6: MBD binding to methylated CpGs shows separation dependence: For 2 CpG pulldown efficiency (a), we see low pulldown when there are 0 or 1 base pairs between the 2 CpGs, an intermediate level of pulldown for 2 base pair separation, and fully recovered pulldown at higher separations. Fitting our model (red) the pulldown sequenced fragments (green; b; simultaneous fitting with 1 CpG Location data, seen in Figure 2.5b), we find the 3 separation classes (0–1 bps, 2 bps, and ≥ 3 bps), along with a gaussian length distribution and G/C content correction, to be sufficient to capture MBD’s binding to 2 CpG fragments, suggesting that for sufficient separations, two MBDs are really binding to the two methylated CpG sites, and insufficient separation between CpGs leads to the expected steric clash between MBDs.

line in Figure 2.5b is model, green is data). And so we see that, indeed, MBD binding of methylated DNA does not have a CpG position preference, and that a gaussian distribution of fragment lengths, in conjunction with CpG distributions found in the genome, is sufficient to fully describe the nature of the pulldown efficiency, including the sudden decrease centered around 100 bps.

2.3.3 MBD binding to two CpGs simultaneously requires minimum sepa- ration, and shows reduced binding at an intermediate level of sepa- ration

We investigated if there was any kind of an effect of MBD binding to two methylated CpGs simultaneously. If anything, we believed there would be some separation dependence at low separations, if only due to steric clashes between two MBDs if the two CpGs are too close

42 together. Examining segments with only 2 CpGs, we determined the pulldown efficiencies per separation between the CpGs (counting the number of base pairs between the 1st CpG’s

“G” and the second CpG’s “C” as the separation; see Figure 2.6a).

Examining the pulldown efficiency data, we observe that separations of 0 and 1 base pairs have similar levels of pulldown, lower than higher separation values. 2 base pairs of separation shows an intermediate pulldown efficiency, and ≥ 3 base pairs shows recovered efficiencies. Roughly 3 to 40 base pairs of separation show fairly level pulldown values

(within error), with higher separation values sloping downward, flattening back out at about

100 bps of separation.

Recognizing the length distribution discovered previously, we hypothesized that the decreasing tail of pulldown efficiencies might be an effect of the length distribution. Some of the second CpGs constrained to be within 150 bps of the 5′-ends of reads might actually not be on the fragment, and so 1 CpG pulldown efficiencies might be decreasing the pulldown efficiencies of higher separations. With increasing separations, the likelihood of the second

CpG not actually being on the fragment increases, and once separations exceed the average fragment length of 100 bps, almost all the pulldown contribution is now actually from 1

CpG values, suggested by the leveling of the pulldown efficiency in Figure 2.6a.

To determine if this really was the case, and testing if the 3 groups of separations

(0–1 bps, 2 bps, and ≥ 3 bps) were really sufficient, we developed a model to simulate the pulled down fragments, and fitted it to the data (see 2 CpG Fitting on page 32; Figure

2.6b). We see from this that a gaussian distribution of lengths, G/C content correction and genomic CpG separation distributions, with relative pulldown rates for 0 through 2 CpGs is sufficient to describe MBD’s interaction with 2 CpGs.

2.3.4 MBD binding to 3 CpGs shows similar pairwise separation depen- dence as for the 2 CpG case

An overall look at 3 CpG separations (Figure 2.7a), shows similar separation behavior between

2 CpGs (holding the other separation constant) as for the 2 CpG case. Specifically, if we examine a representative cut along 10 bps of separation (CG(N×10)CG(all separations)CG

43 3 CpG Separation Comparison for 1 Sep=10 2.5 "conf_cpg3_3sep_ratio_w_gc_10-all" u 2:5 "conf_cpg3_3sep_ratio_w_gc_all-10" u 1:5

2

1.5 Fraction 1

0.5

0 0 20 40 60 80 100 120 140 CpG Separation

(a) (b)

Figure 2.7: MBD shows similar pairwise separation dependence as for 2 CpGs: Looking at pairwise separation dependence of MBD binding to 3 CpGs as a whole (a), we see similar behavior as that exhibited in 2 CpGs. Namely, we see that low CpG separations inhibit pulldown efficiency (cuts along either axis), which is recovered for higher separations (points further away from either axis), with pulldown values steadily falling for higher separations. A cut along a separation of 10 bps is also shown (b), confirming this behavior (red is for first separation constrained to be 10 bps, and green is for second separation having the 10 bps constraint).

in red, CG(all separations)CG(N×10)CG in green, Figure 2.7b), we observe 0–1 bps of separation has inhibited pulldown efficiency, 2 bps of separation is at an intermediate level, and ≥ 3 bps of separation being fully recovered. This suggests that only when 2 consecutive

CpGs are too close together do we see clash effects between multiple MBDs.

However, when we examine a cut along 2 bps of separation (Figure 2.8), we see the intermediate level for 2 bps of separation (CGNNCGNNCG) is further suppressed than previously, almost down to 0–1 bp separation. This suggests that the center CpG, normally partially accessible if positioned 2 bps away from only one other CpG, is now almost completely inaccessible. This observation leads us to form rule 4 for our determination of accessible CpGs (see Accessible CpG on page 32), where we have brought together all our observations of MBD binding on CpG separation dependence in the formulation of all the rules. In brief, CpGs separated by ≥ 3 bps are independent of each other, and both may be freely bound by MBD. Once two CpGs approach 2 bps of separation, steric interactions

44 3CpG Separations 1.2 "conf_cpg3_3sep_ratio_w_gc_2-all" u 2:5 "conf_cpg3_3sep_ratio_w_gc_all-2" u 1:5

1

0.8

0.6 Fraction

0.4

0.2

0 0 20 40 60 80 100 120 140 160 Location

Figure 2.8: 3 CpG pairwise separation cut along 2 bps separation shows inhibited pulldown for the intermediate state: Examining a cut along 2 bps of separation, we see the state CGNNCGNNCG is even more inhibited than the 2 CpG separation value for 2 bps of separation, almost as low as 0–1 bp separation.

between two MBDs inhibit both binding to the two CpGs as effectively as if the two CpGs were further apart. Finally, two CpGs that are too close together (0–1 bps) are not accessible by more than one MBD, and so only count as one binding event, and only contribute to pulldown efficiency in that manner.

2.3.5 MBD binding multiple CpGs shows unexpected pulldown behavior

When examining MBD binding to methylated DNA as a whole, only accounting for the number of methyl-CpGs there are per fragment, we discover the expected qualitative cooperative binding behavior (cooperative behavior is expected, as, if a fragment has more than one possible binding site, MBD capture of one of the CpGs brings all other methylated

CpGs in closer proximity to the other MBDs that coat the magnetic beads, enhancing their capture of CpGs on the fragment). With cooperative binding, we expect an eventual saturation, perhaps related to the density of MBDs on a magnetic bead, or based on a

45 MBD Pulldown/ Non-Pulldown Ratio vs. CpG Count, w/ Seq. Correction 100000 "norm_MBD_Pulldown_cpg_hist_w_ratios2" u 1:6:7

10000

1000

100 Ratio 10

1

0.1

0.01 0 5 10 15 20 25 30 35 CpG Count

Figure 2.9: MBD binding to multiple CpGs: When examining MBD binding to multiple CpGs per DNA fragment, we see an expected qualitative cooperative binding type behavior, out to about 10 CpGs per fragment. However, we expected the pulldown efficiency to remain plateaued after this point, but we see a second increase in pulldown efficiency for higher CpGs.

relationship between the flexibility of the the tethers connecting MBDs to the bead and the persistence length of DNA. However, we observed further enhancements to the pulldown efficiency for even higher CpGs per fragment.

We believe this may be an artifact of not having as stringent of constraints on the number of CpGs per fragment (as had been applied to other MBD binding characteristics investigated, where CpGs for a given count were constrained to lie within the first 150 bps of the 5′-end of a read, with no further CpGs out to 200 bps). With the average fragment length of 100 bps discovered, it is plausible that one or more of the CpGs “found” on a fragment are not actually on the fragment, but were simply CpGs further downstream in the genome, included as a result of our a priori not knowing the length distribution. As higher CpG counts are encountered, more of them are now more likely to actually be on the fragments, further enhancing the pulldown efficiency. The de-convolution of these effects

46 cannot be determined without further analysis.

2.3.6 Model fitting to data

We developed a model simulating the pulldown and sequencing of DNA fragments captured by MBD binding to methylated CpGs, F (x, L, S, R~). We incorporated in our model the probability of finding a fragment length of any size, l, based on a gaussian distribution of fragment lengths, N(l, L, S), the genomic distribution of CpGs and sequencing-bias-inducing

G/C content, G(x, l, c), and the relative increase in pulldown based on accessible CpG content, R~ (see Section 2.2.7).

We first fit our model to the 1 and 2 CpG pulldown data simultaneously, as these are more sensitive to the fragment length distribution and CpG separations (see Figures 2.5b and 2.6b). Doing so, we discover an average fragment length, L, of 100 bps, a standard deviation to the fragment lengths, S, of 15.5 bps, and a 1.32× greater pulldown of fragments with 1 accessible CpG compared to no CpGs. However, when trying to include the 3 CpG pairwise separation and full CpG counts data (only constraining the average fragment length and standard deviation found with the first fitting), we find the 1 CpG pulldown parameter is increased to higher values (whatever limit is set, up to R1 = 2.24), causing us to wonder what such relative pulldown rates might look like when used to simulate the data.

Examining the effects of varying R1, the relative pulldown of 1 accessible CpG compared to no CpGs on 1 CpG Location data (Figure 2.10), we see a range of possible 1 CpG pulldown parameters that seem to describe the data fairly well. We find that varying the relative 1 accessible CpG pulldown rate affects how closely the model fits regions where the fraction of pulldown fragments is gently sloping or remains relatively constant (7–80 bps or

≥ 110 bps). Examining the range of fits, we choose R1 = 1.45 to be the highest value that reasonably matches the data (we want the highest value, again, because full fitting across all data consistently pulls R1 up to the highest values allowable). Now constraining the average fragment length, standard deviation of the fragment lengths, and 1 accessible CpG pulldown rates, we performed again a full fitting of our model across all data. Doing so, we observe our model fits our data relatively well, but there is one

47 Figure 2.10: A range of 1 CpG pulldown values: Visually inspecting multiple 1 CpG pulldown values (R1; using the fitted average fragment length and standard deviation of 100.6 bps and 15.51 bps respectively) shows a range of pulldown values that seem to fit well. Fitting pulldown values R~ on all data pushes R1 up, so the highest visually acceptable value of R1 = 1.45 was accepted for further fittings.

main point of concern. Specifically, in the 2 CpG pulldown fit (Figure 2.11), we observe that the 1.5 accessible CpG pulldown (for separation between CpGs =2 bps) is not recovered, and the corresponding fitted pulldown rate comes out to R1.5 = 7.007, the same as for R2, the pulldown for fully independent 2 CpGs.

To solve the issue of 3 CpG pairwise separation and full CpG pulldown pulling our 1.5 accessible CpG rate up, we believe it would be better to perform a successive fitting of the data, instead of a simultaneous fitting. So, perform a 1 CpG fitting independent of any other data. Then, constraining parameters from the 1 CpG model, fit to 2 CpGs, further constraining the R1.5 and R2 parameters for a 3 CpG fitting. We believe building up the fit parameters in this way will better capture the MBD-DNA interaction parameters within the constraints of the model.

48 2CpG Separation vs. Model for fit_all_1_constraint_1.45max_wCpG_cut_a8 0.025 avg_len=100.6 std_len=15.51 pull_prob=1.45 pp_2CpG_2=7.007 pp_2CpG_3=7.007 Data

0.02

0.015 Frequency 0.01

0.005

0 0 20 40 60 80 100 120 140 160 CpG Separation

Figure 2.11: 2 CpG pulldown fragments with fit: Constraining the average fragment length to be L = 100.6 bps, standard deviation to be S = 15.51 bps, and 1 CpG relative pulldown R1 = 1.45, and fitting to all data simultaneously, we see our model seems to fit the 2 CpG separation dependence fairly well. However, on closer inspection, we notice the intermediate state of 2 bps separation between the 2 CpGs is not recovered in the fitting.

Model R1 vs binding affinity

As noted above, 1 accessible CpG pulldown rates were found to fit well with pulldown data when ranged from 1.3-1.4× the background pulldown (0 accessible CpGs). In binding affinity measurements (Section 2.1.1), however, it was found that MBD binds up to ∼ 100× better with DNA fragments containing a methylated CpG compared to fragments without a methylated CpG. However, we note that this study does not probe the MBD-DNA interaction directly, but investigates effects of the entire MethylCap-seq protocol, which includes other library preparation effects. For example, MBDs in the protocol are bound to beads by tethers, which could restrict their mobility, preventing optimal bindings. Also, the elution of bound DNA from MBD is likely non-trivial with respect to methylated CpG counts, and would also point to a discrepancy between pulldown rates vs binding affinities. Finally, although we presume the difference in effect to be small, the MethylCap-seq protocol utilizes

49 human MBD2, while the binding affinity measurements were performed with 1) mouse

MBD2b, and 2) the active domain of chicken MBD2.

Finally, we note that methylated CpG capture was measured on synthetic oligos directly by the manufacturer of the MethyCap-seq kit (Figure 1.6), which qualitatively matches our measured values (Figure 2.9). Both these measurements show negligible difference in MBD pulldown of 1 vs 0 methylated CpGs per fragment, providing confidence in the 1.3-1.4× enhanced pulldown rate.

2.3.7 Utilizing the Bayesian model

Once all MBD pulldown parameters have been determined by fitting the descriptive model to the artificially methylated data described above, we then desire to utilize the parameters, with the Bayesian model (Section 2.2.8), to make methylation-state predictions for CpGs from real patient data (where the methylation state is not known). In this application, we foresee many patient MethylCap-seq libraries to be generated by the same lab, with existant biases (e.g. fragment length selection) per individual or lab producing the libraries. Thus, in order to implement the Bayesian model, the experimental biases inherent in the individual constructing the libraries must be taken into account.

Assuming library preparation will be fairly consistent per batch of libraries prepared, we foresee one library per batch (or, with good reproducability, per lab) to be prepared with an additional SssI treatment, artificially methylating all CpGs. Then, perfoming MBD pulldown on this library, the descriptive model from above can be fit to it, determining model parameters. This then takes into account the biases inherent in different individuals’ library preparations, and will allow for the use of the model parameters in the Bayesian predictions of actual patient samples in the batch.

2.4 Conclusion/ Future Work

We have seen that MBD interactions with methylated CpGs on fragmented DNA exhibit straightforward characteristics. Specifically, MBD binding does not show a CpG fragment

50 position dependence, requires minimum separation between two consecutively positioned methylated CpGs for full binding to occur, and shows cooperative binding type characteristics for low (≤ 10) CpG count per fragment. We have seen that the model developed, utilizing a gaussian distribution of fragment lengths, G/C content correction to genomic CpG distributions, and relative pulldown rates for accessible CpG counts is successfully able to describe the data so far.

However, this is still a work in progress, requiring further efforts to prune pulldown data to more accurately reflect MBD’s true interactions with methylated DNA (the > 10 CpG portion of the full CpG count data; Figure 2.9), and to further refine how our model is fit to the data to tease out interaction parameters. These preliminary results, however, are promising, and give hope that this model can be used to begin to help inform future MBD pulldown experiments and make quantitative predictions to real patient data, utilizing the bayesian predictive component developed (a new graduate student in the group, Blythe

Moreland, is working on implementing the bayesian model at the moment).

51 Chapter 3 An Overview of HTS Analysis Pipeline/Tools Developed with a Case Study in the Analysis of 5′-end Sequencing Data

3.1 Introduction

As seen in Chapter 1, HTS presents a unique set of challenges, and opportunities, in the

“omics” investigation of DNA and RNA. Here we will present a set of tools developed in the analysis of HTS data, as well as a specific application of these methods in the study of a

HTS data set, leading up to the determination that the data was fundamentally flawed on an experimental level. Finally, these tools will be applied in the understanding of lepA’s function in E. Coli in Chapter 4.

3.2 Pipeline/tools developed

3.2.1 Read sequencing

All analysis we performed were on sequences generated by Illumina’s HiSeq 2500 sequencer, which utilizes a sequencing-by-synthesis approach [102]. For our case, reads were exclusively

50 nt long single-end reads, where library preparations varied from total RNA samples [103], ribosome profiling [49], and a novel 5′-end sequencing technique developed by collaborators

(Section 3.3.2).

52 3.2.2 Raw to aligned

Common to all library types investigated is a need to align reads to a reference genome

(since the genomes are known for the organisms studied in our case—namely, human and E.

Coli). However, prior to alignment, 3′ adapter sequences need to be trimmed, as sequenced fragments shorter than 50 nts will include the 3′ adapter sequence (as commonly happens for ribosome profiling samples, where fragments average ∼ 30 nts in length [49]).

3′ adapter removal and read quality control

The 3′ adapter sequences were obtained either from the experimentalists who constructed the libraries, or were determined based on inspection of sequenced reads. To determine the sequence by inspection, the 3′-ends of reads were visually examined for a sequence motif.

Fragments matching the motif on the 5′ end were then extracted from all sequenced reads, from which a full 50 nt fragment motif (the 3′ adapter sequence) was determined by visual inspection.

The 3′ adapter sequence was utilized by a program developed by Prof. Ralf Bundschuh to remove the adapter sequence from all sequenced reads. In general, we used as default, parameters which would convert any nucleotide with a phred quality score ≤ 20 (99% base call accuracy) into an “N”, with any trailing “N”s removed. 3′ adapters were identified and removed if at least 5 nts of the 3′ adapter sequence matched the sequenced read from a given nucleotide to the end of the read and there were less than 20% mismatches between the candidate read region and the adapter sequence (doubly penalizing mismatches for any nucleotide if the read did not show an “N” at that location). Post-adapter-removal, any reads shorter than 20 nts or containing more than 5 “N”s were discarded.

Quality control: length distribution

We determined length distribution histograms of fragments surviving 3′ adapter removal, which is used as a quality control test, particularly for ribosome profiling samples. These distribution histograms were used to, on a fragment length level, provide feedback of

53 ribosome profiling library preparation quality (ribosome footprints have a characteristic length distribution based on RNA fragment origin [104], and inspection of the length distribution can provide confidence of sequencing results truly being from footprints).

Alignment

Reads were then aligned to the pertinent genome (human genome hg19 for human samples

[98], or NCBI reference sequence NC 000913.2 for E. Coli samples [105]). For E. Coli samples, we utilized Bowtie2 version 2.0.0-beta5 [106] using default parameters, while for human samples, we utilized STAR [107], which has the capability of aligning reads over exon splice junctions, where we used splice junction annotations from GENCODE v17 [108]. For multiply mapped reads in both cases, default parameters select a random position from the possible mapping sites for downstream analysis, although we kept up to 9 additional map sites for human samples, as per STAR’s default, for future comparisons. SAM files from alignment were then converted to the BAM format, sorted, and indexed for easy read access using SAMtools version 0.1.18 [109].

3.2.3 Computational removal of rRNA reads

Due to the nature of ribosome profiling experiments (sequencing of ribosome protected fragments), much of the sequenced reads come out to be rRNAs, which in part make up ribosomes [49]. Even with rRNA depletion oligos, human ribosome profiling samples ranged from 50% rRNA and up, and even in RNA-seq samples for E. Coli, we observe over 60% rRNA reads. To make more efficient further computation and eliminate rRNA contamination, we developed a method to remove sequenced reads based on genomic annotated regions. rRNA genomic coordinates

For E. Coli, we downloaded genomic locations of 5S rRNA from the UCSC Genome Browser’s

Rfam track [110]. We also downloaded the sequences of the 5S, 16S and 23S rRNAs from http://ecoliwiki.net/colipedia/index.php/Category:rRNA and used the Basic Local

Alignment Search Tool (BLAST) [111] to identify their genomic locations.

54 For human rRNA, genomic locations were compiled from UCSC Table Browser’s Re- peatMasker, RefSeq, GENCODE v14 rRNA and pseudogene tracks [112]. Finally, two ad- ditional rRNA locations were included from http://www.ncbi.nlm.nih.gov/nucleotide/

486293152. rRNA removal

First, overlapping rRNA regions were expanded to their maximum ranges. Then, all unmapped reads and any reads overlapping with the rRNA genomic regions were removed, preserving strandedness (only reads appearing on the same strand as the rRNA region were removed). For human samples, a multiply mapping read was also removed from further analysis if any of the up to 9 other alternative mapping sites were found to be in an rRNA region.

3.2.4 BAM file quality controls

Missing uniquely mapping reads

As a check on the rRNA removal, we examine which uniquely mapping reads were removed, comparing pre- and post-rRNA removal reads. With each rRNA having multiple loci, rRNA reads should in general be flagged as multiply mapping, so, a minimum of unique reads should be removed in the rRNA removal process, and if any were, we are able to be examine them in greater detail.

Additional Mapping

There are times when we wish to align reads to the genome utilizing less stringent parameters that allow for less-optimal mappings (for example, when processing lower quality reads).

In such instances, we have developed a quality control check to examine which previously unmappable reads are now considered mappable.

55 ACGCTTTAGCAGCTTAATAACCTGCTTAGA TTTTACGCTTTAGCAGCTTAATAACCTGCTG CACGCTTTAGCAGCTTAATAACCTGCTTAG ACGCTTTAGGAGCTTAATAACCTGC ACGCTTTAGCAGCTTAATAACCTGCTTNNN ACGCTTTAGCAGCTTAATAACCTGCTAAAG ACGCTTTAGCAGCTTAATAACCTGCTTAGC GCTTTAGCAGCTTAATAACCTGCTTAGAG TTTAGCAGCTTAATAACCTGCTTAGAGC AGCAGCTTAATAACCTGCTTAGAGCCC AGCTTAATAACCTGCTTAGAGCCCTC

Figure 3.1: Consensus sequence determination for the top 500 reads quality check: once one of the top 500 most highly covered genomic regions is found (reads starting at the line on the left), with a minimum coverage of 10 reads (not shown), a consensus sequence (blue) is determined by counting the frequency of each base (A, C, G, T) at each position, from all reads intersecting with this region. Variability in read sequences (red) exist, but the most common base per position (the consensus sequence) is reported.

Top 500 reads

At times, we wish to determine the most frequently occurring fragment sequence, for example, when designing rRNA depletion oligos for Ribosome Profiling experiments, or to determine if a source of contamination made it into the cDNA libraries. We developed a method to determine the top 500 most highly mapped fragment sequences.

First, we scan through all reads mapped to the genome and track the 500 most highly mapped genomic coordinates. Then, for each of these genomic coordinates which pass a minimum coverage of 10 reads, we examine all reads mapped to the coordinate on the more highly represented strand, and determine the consensus sequence of this genomic region based on the highest frequency nucleotide (A, C, G, or T), which consensus sequence we report (see Figure 3.1). It is important to determine the consensus sequence, due to the frequent occurance of sequencing errors.

56 Mappable Read Alignment Statistics

Particularly when comparing different alignment technologies, there are times when one desires to have a means to compare the mapping “quality”, beyond the fraction of reads mapped uniquely, multiply, or not at all, as provided by the alignment programs themselves.

To do this, we developed a technique of determining the overall percentage of the amount of each read that is actually matched with the genome (as opposed to insertion/ deletions from the genome, for example).

3.2.5 Genomic coverage summary

It was found that, in all downstream analyses required for our studies, the only information of interest was how many reads mapped to which genomic positions (including strand), for which treatment groups. We developed a technique to extract only this information from all samples of interest to facilitate further processing. One nuance in this process is read position assignment—what will be considered the position of a finite-length read?

We developed techniques to assign the 5′-end, 3′-end, midpoint, and a specified number of nucleotides upstream of the 3′-end as the read position, properly handling reads mapped over splice junctions. Finally, we also allowed for every position from each of a read’s nucleotides to also be counted, incorporating the full information of read length in coverage overlap.

3.2.6 Normalizations

Considering the variability in sequencing depth [113], normalization techniques are required to allow for the comparison of expression across sample types [114]. Normalization techniques of both total number of reads per sample, as well as total number of coding reads have been developed (where the number of reads at any given genomic coordinate are divided by either total reads or total coding reads).

To normalize by coding reads, the coverage of the respective coding transcripts were determined based on genomic positions of the transcripts (or exons, for human), based on annotated positions. For E. Coli, the K12’s Genbank Ref-Seq gene annotation file was

57 used (downloaded from the UCSC Microbial Genome Browser on 15 April 2013; [110]). For human samples, GENCODE v17 was utilized [108].

In addition, further downstream techniques may, dependent on technique, be further normalized.

3.2.7 Coverage per position visualization techniques

Once genomic positions have been assigned read coverage, one frequently desires to visualize the data, allowing for manual inspection of the coverage landscape. We developed several methods to allow for data visualization.

UCSC Genome Browser

The UCSC Genome Browser [115] and Microbial Browser [110] are powerful tools to visualize

HTS data, with particular value added in the correlation of sample data with publicly available data sources. Utilizing this tool, users can view trends in their own data, scanning over the full genome, as well as compare their data with conservation of genomic regions across species, over specific annotated genes, as well as more exotic data sets such as common

SNP locations or poly-A tail sites. Considering the usefulness of this tool, we developed a method to easily convert our coverage data into a format for easy upload, normalizing coverage values to reads per million.

Transcript by transcript view

There are times when one desires to examine the coverage of each transcript independently for a subset of transcripts, displaying the coverage across the mature transcript (introns removed), particularly with samples from techniques such as ribosome profiling, where fragment position encodes biologically relevant information. There are two main methods that we can do this:

1. As a heat map, comparing the overall coverage distribution for a large number of

transcripts simultaneously.

58 60 Gene 1 Control 60 Gene 2 Control Treatment 1 Treatment 1 Treatment 2 Treatment 2 50 50

40 40

30 30 Coverage (AU) (AU) Coverage (AU) Coverage 20 20

10 10

0 0 -10 -5 0 5 10 15 20 -10 -5 0 5 10 15 20 Nucleotides from Start Codon Nucleotides from Start Codon

(a) (b)

60 Gene 3 Control 60 Metagene Control Treatment 1 Treatment 1 Treatment 2 Treatment 2 50 50

40 40

30 30 Coverage (AU) (AU) Coverage (AU) Coverage 20 20

10 10

0 0 -10 -5 0 5 10 15 20 -10 -5 0 5 10 15 20 Nucleotides from Start Codon Nucleotides from Start Codon

(c) (d)

Figure 3.2: Transcript by transcript view of coverage provides local detailed view of treatment effects: (a - c) show cartoon representations of a transcript view of coverage for 3 example genes, showing the effects of 3 treatments (control = blue, red/ green are other treatments) on the local gene abundance. Interpretation is dependent on experiment, but for ribosome profiling experiments, for example, this view can provide insight into translation pause sites. (d) shows metagene view of the coverage, averaging over the 3 genes shown, with no further normalizations (see 1 in Metagene view on page 60).

2. Transcript by transcript, comparing the effect of different treatments on a particular

transcript (see Figure 3.2a-c).

For 2, we view transcript coverage with nucleotide resolution, but for 1, we must bin the coverage across each transcript into an equal number of bins (as a consequence, bin size varies based on transcript length). For nucleotides falling on a bin boundary, the coverage is split between the bins proportionally. 59 Metagene view

At times, one desires to observe if there is a transcript-aligned, position-specific global effect of a treatment to samples, once again, where fragment position encodes biologically relevant information. A metagene view of read coverage (displaying average read coverage per position across all transcripts of interest; see Figure 3.2d) provides this view, where reads can be aligned to the 5′ or 3′ ends of transcripts, utilizing two main normalizations:

1. No additional normalizations (only dividing by the total reads per sample, or total

coding reads per sample, as in section 3.2.6).

2. In addition to 1, we may also wish to further normalize by dividing by the normalized

read density per transcript (normalized total reads in the transcript divided by the

length of the transcript).

The first method examines coverage closer to the “raw” levels, where higher density transcripts (those with higher coverage per length) will tend to dominate. The second normalization scales coverage per transcript by its coverage density, giving all transcripts equal footing.

3.2.8 Differential expression analysis

Once transcript coverages have been determined (as in Section 3.2.6) across sample types, it is frequently desireable to find which transcripts significantly differ in expression from treatment to treatment. First, we normalize the coverage of the respective transcripts, as described in 3.2.6, then take the base 2 logarithm of the coverage for each transcript, and perform a Student’s t-test across replicates of the respective sample types, corrected for

multiple testing utilizing the Benjamini-Hochberg correction [116]. The base 2 logarithm is used to probe the question of if there is a significant difference in fold changes between treatments, as opposed to a difference in coverage alone.

60 3.2.9 Local expression variability

In addition to overall transcript expression differences, treatments may have more local effects on expression, for example, as seen in Figure 3.2c at 13 nucleotides past the start codon. In order to capture these local expression variations, we developed a technique to scan through the transcriptome in equal sized windows, shifting by half a window width at a time, comparing window coverages. Window coverages are the sum of coverage per nucleotide across all nucleotides in the window, and can be normalized with the same normalizations as done for the metagene view described above. Transcript and window coverage constraints can additionally be enforced, where a specified number of reads must either appear in one of the samples for the transcript as a whole, or must appear in the specific window being compared, in at least one of the samples. Significant difference between window coverages

(if tested) are compared using a Benjamini-Hochberg [116] corrected Student’s t-test.

3.2.10 Analysis of local expression variability

Once significantly different local expression regions have been found, as in Section 3.2.9,

it follows that one would be interested in determining if there are any cis-acting factors

correlated with the expression variation. To this end, we have developed techniques to

examine the sequence and base pairing probabilities in the region of the local expression

differences (base pairing probabilities, in particular, are for transcriptome-related studies such as RNA-seq or Ribo-seq, where single stranded RNA is capable of forming secondary structures that might correlate with the coverage differences).

Sequence extraction and motif discovery

Utilizing the transcriptome coordinate of significantly different windows found, we first combine consecutive or overlapping windows, generating a list of transcriptome regions with significantly different coverage across experimental conditions. With this list of regions, transcript annotation files (as discussed in 3.2.6), and the genome of the organism studied, we extract the transcript nucleotide sequence surrounding and incorporating the regions

61 discovered. These sequences can be sent to MEME [117] for motif discovery, with a quality control test examining the translation reading frames of the motif per region, in each motif found.

Region by region or meta-region coverage views

Analogous to full transcript by transcript or metagene views, we can examine the actual coverage of the list of regions obtained, either with each region independent, showing the full expression differences across treatments for each region, or in a meta-region fashion, examining if there is an overall coverage profile across significant regions extracted. Based on the coverages across regions, we are able to align further analysis from any identifying characteristics of the coverage profile, on a region by region manner (for example, if regions show a clear peak or trough in coverage, shifted slightly from region to region, downstream analysis can be aligned to the peak or trough across each region).

Base pairing probabilities

The nucleotide sequence of regions, aligned to any coverage characteristic mentioned above, is sent to the ViennaRNA Package’s RNAfold program [118] to predict base pairing probability of each nucleotide. The base pairing probabilities can be examined in a metagene fashion, averaging across all regions per nucleotide, or independently of one another. The base pairing probabilities per position in each region can also be correlated with overall expression levels of the transcripts (see Section 3.2.8) testing for significance using Spearman’s Rank

Correlation test [119].

Significant sequence features

Aligning the nucleotide sequence to the coverage characteristic mentioned above, we can determine the frequency of each nucleotide, codon, and amino acid per position within the region. For any frequency that appears to be significantly enriched or suppressed, we test for significance utilizing a binomial test, comparing to the frequency of nucleotide, codon, or amino acid from background (based on the genome, transcriptome, or coding regions).

62 3.3 Capping of RNA: a regulator of transcript life cycle

We now discuss mRNA capping and the application of these tools to the analysis of 5′-end seq

data, which probes the 5′ cap of mRNAs. There are many post-transcriptional modifications that are made to a eukaryotic mRNA throughout its life cycle. Specifically, the first is an addition of a 5′ cap [120; 121], which plays key roles in later steps of pre-mRNA processing

[122], export [123; 124], translation [125–127], and decay [128–130].

The exonucleolytic decay pathway begins with deadenlylation of an mRNA’s poly-A tail, decapping of the 5′ cap, and the progressive exoribonucleolytic decay in the 5′ to 3′ direction by Xrn1, and/or in the 3′ to 5′ direction by the cytoplasmic RNA exosome complex

[131; 132]. The 5′ cap, then, plays a critical role as a gatekeeper in the 5′ to 3′ degradation pathway, with the presence of the cap preventing degradation.

Traditionally, it has been thought that capping only happens in the nucleus, co- transcriptionally [133], and the removal of a cap from a mature mRNA is an irreversible process, followed by rapid degradation by Xrn1. However, our collaborators recently found the capping enzyme to be present in the cytoplasm, and have shown a pool of cytoplasmic stable uncapped RNAs, identifying a cyclic process of RNA decapping and recapping which they have coined cap homeostasis [134]. In an effort to comprehensively study the process of cytoplasmic capping on a transcriptome-wide level, our collaborators developed a novel

5′-end seq protocol, intended to capture the 5′ ends of the pool of uncapped cytoplasmic

RNAs for HTS. We participated in the collaboration to help with the analysis of these 5′-end seq libraries.

3.3.1 Cell culture preparation

Collaborators stably transfected U2OS cells with a tetracycline-inducible dominant negative form of the capping enzyme, which had a deletion of the nuclear localization signal and an addition of the HIV Rev nuclear export signal, restricting the dominant negative form of the capping enzyme to the cytoplasm. Cells were then transfected with No siRNA, Control

(scrambled) siRNA, or Xrn1 siRNA (to knockdown Xrn1), and were also treated with

63 ±doxycycline to induce the dominant negative capping enzyme (+dox induces the dominant negative form; see [134; 135] for details). Utilizing the novel 5′-end seq method to capture uncapped 5′ ends of transcripts, collaborators prepared cDNA libraries for sequencing. All treatments were done in duplicates, giving a total of 12 cDNA libraries that were sequenced

(see Table 3.1).

3.3.2 5′-end seq workflow overview

1. 4. cRNA + Adaptor -NNNNN Adaptor -NNNNN cRNA Adaptor -NNNNN 2. + Adaptor -NNNNN Denature

Adaptor -NNNNN 5. + cRNA cRNA + 3. Select for cRNA cRNA Adaptor -NNNNN 6. cRNA cRNA cRNA Adaptor -NNNNN cRNA + + RNA Ligase Construct cDNA Libraries

Figure 3.3: 5′-end seq workflow: The capture of uncapped RNA 5′ ends. 1. 5′ capped (black circle) and uncapped (no circle) cellular extract transcripts are annealed to complementary sequences (random N’s), ligated to known adaptors. 2. RNA complementary to the adaptor sequence (cRNA) is annealed to the adaptors. 3. Upon the addition of RNA Ligase, uncapped transcripts are ligated to the cRNA sequences, while ligation is blocked by the 5′ caps of capped RNAs. 4. Upon denaturing of the random sequences (Ns) from the transcripts, cellular transcripts are left with either 5′ caps, or cRNA sequences on the 5′ ends. 5. cRNA sequences are selected and 6., cDNA libraries are constructed, providing a library of uncapped RNA transcript 5′ ends.

The novel 5′-end seq method involves the annealing of random sequences to their

64 complementary sequences in cell extract RNAs, with a sufficient variety of random sequences to capture a good sample of cellular RNA (see Figure 3.3). The random sequences have a known adaptor sequence pre-ligated to them, which, upon annealing of the random sequences to the 5′ ends of transcripts, extend past the 5′ ends. RNA complementary

to the adaptors are then annealed to the adaptors, upon which, RNA Ligase is added

to ligate the complementary RNA to the 5′ end of the transcripts. Ligation only occurs with uncapped 5′ ends, where capped RNAs maintain a single-stranded break between the adaptor’s complementary RNA, and the RNA from the cell. Upon melting, where the adaptor and random sequences are removed from the cellular transcripts, all uncapped cellular RNAs maintain the adaptor’s complementary sequence on the 5′ ends, while capped

RNAs do not, allowing for the complementary adaptor sequences to be selected, from which the uncapped 5′ end libraries are constructed.

3.3.3 Application of Methods to 5′-end Seq

Read alignment and position assignment

Reads were prepared and aligned according to Section 3.2.2, with the 5′ ends of reads used to determine read coverage per genomic location (as per Section 3.2.5).

Top 500 unmapped reads

In contrast to our previous discussion on finding the top 500 mapped read consensus sequences

(see Top 500 reads on page 3.2.4), we extracted the top 500 raw unmapped read sequences for the Control, −dox, replicate A sample, which showed a high amount of unmapped reads.

The most frequent ∼ 10% of these were then examined using BLAST [111].

5′-end seq quality control

To probe shared vs unique transcript 5′ end positions, we determined:

Number of 5′ end locations in i which passed cutoff, with ≥ 1 read in j c = i,j Total number of 5′ end locations in i which passed the cutoff, irrespective of j (3.1)

65 across treatments i, j (see Table 3.1) and cutoffs ranging from 1 to 10 reads (see Figure 3.4).

Local 5′ end expression variability

Transcriptome coverages were prepared for previously published recapping target transcripts

(see [134]), comparable as would be done for item 2 in Transcript by transcript view on page 59, including 10 nts prior to the 5′ ends of transcript annotated start sites, in an effort to capture all 5′-end seq positions per transcript. Coverages were normalized to the total number of reads per sample (totals reported in Table 3.1). Local expression differences were compared as described in Section 3.2.9, scanning with both 10 and 20 nt windows, enforcing a minimum of 10 reads per transcript in both window sizes, and additionally a minimum of

20 reads for an additional 20 nt window comparison.

Replicate vs replicate comparisons

Transcript by transcript replicate coverage comparisons were performed as described on page

58, comparing each transcript independent of others. The 22 highest expressed transcripts were compared, along with a selection of recapping target transcripts, ENST00000368681

(ILF2-002), ENST00000398822 (MAPK1-002), and ENST00000409784 (RAB1A-001). The highest expressed transcripts were determined by summing the total 5′-end seq raw reads mapped to each transcript, across all treatments and replicates. Coverages were normalized by the total number of reads per library (see Figure 3.5 for a representative example).

Global replicate vs replicate comparisons were also performed, comparing the raw reads coverage of every genomic position in one replicate, mapped to ± strands of the genome, with the corresponding coverage at the same genomic position and strand in the other replicate. Also, as a comparison, the coverage of the + strand per genomic position were compared with the coverage of the the − strand, within the same replicate.

66 Table 3.1: STAR alignment and rRNA removal statistics of 5′-end seq data

Treatment Total % Unique % Multiple % Unmapped % Too Short % rRNA % Remaining None, −dox, A 25,403,561 76.20 21.73 2.07 1.31 1.94 95.99 None, −dox, B 22,251,755 73.82 23.59 2.59 1.95 2.19 95.22 None, +dox, A 22,636,996 74.03 24.33 1.64 1.02 0.93 97.43 None, +dox, B 23,498,813 75.66 22.81 1.53 0.91 1.13 97.34 Control, −dox, A 27,263,868 37.45 10.92 51.63 51.14 1.55 46.82 Control, −dox, B 39,296,593 68.29 24.03 7.68 7.00 2.56 89.76 Control, +dox, A 36,488,201 67.14 24.50 8.37 7.55 4.82 86.81 Control, +dox, B 42,793,144 68.48 24.05 7.47 6.59 2.86 89.66 Xrn1, −dox, A 34,803,075 69.95 23.48 6.56 5.58 2.68 90.76 Xrn1, −dox, B 31,258,505 67.79 22.27 9.93 8.94 2.89 87.17 Xrn1, +dox, A 40,527,326 66.89 24.61 8.50 7.27 5.4 86.1 Xrn1, +dox, B 36,080,844 70.54 22.73 6.73 5.72 1.89 91.38

Treatments are No siRNA, Control siRNA (scrambled sequence), and Xrn1 siRNA (for Xrn1 knockdown). ±dox is the presence/absence of doxycycline, used to promote the translation of the dominant negative capping enzyme. A/B are the two replicates per treatment. % Unique, Multiple, and Unmapped are STAR [107] alignment statistics, with the majority of unmapped reads being unaligned due to their length being too short. Post-rRNA removal, majority of reads are seen to be aligned to the genome, with the exception of Control, −dox, replicate A.

3.3.4 Results/ Discussion

STAR reveals good alignment rate for libraries, with low rRNA contamination, with the exception of one library

Reads were mapped to the genome utilizing STAR [107], which provided alignment statistics

(see Table 3.1). We observe that, aside from the Control, −dox, replicate A sample, we get really good alignments across the board with over 67% of reads uniquely mapping in all cases and over 90% of all reads mapped, out of a range of 22 to 40 million reads per sample.

We also see that the majority of unmapped reads are due to the reads being too short for

STAR, with 99% of Control, −dox, replicate A’s unmapped reads being too short.

Short read investigation of Control, −dox, replicate A Observing the large number of unmappable reads considered “too short” by STAR, we investigated possible causes. We learned that STAR will flag a large number of reads as “too short” if there is [136]:

1. Poor sequencing quality

2. Contamination

67 (a) with exogenous sequences

(b) with ribosomal RNA

(c) with primer-dimers (rare)

3. Inserts between paired-ends that are too short

Item 1 is unlikely, as we had sequenced the other samples in the same lane with good quality,

and item 3 is not possible, as we only had single-end reads. Taking the ∼ 10% most common unmapped sequences, we determined that there was a source of Pseudomonas contamination in this sample (item 2a), explaining the high level of unmapped reads.

An attempt at 5′-end seq quality control: unique vs shared position comparisons yield some expected trends

In an effort to determine the quality of the 5′-end seq libraries, with preliminary dominant capping enzyme induction effects potentially observable, 5′ end positions shared across two libraries were compared with those only found in one library, for a given comparison (see

Equation 3.1). Comparisons attempted were: a) replicate A vs B for a given treatment, b) No vs Control or Control vs Xrn1 siRNA, per ±dox, per replicate, and c) +dox vs −dox, per siRNA treatment, per replicate, where only the replicate vs replicate comparisons (item a) are shown in Figure 3.4.

As expected, we observe more 5′ end positions shared between the two libraries as the coverage cutoff is increased (as seen by the positive slopes), since more highly covered 5′ end positions are less likely to be an observation due to measurement noise, but truly uncapped transcript 5′ ends. However, with all the different comparisons made, we were unable to

establish any other significant trends by observation of these comparisons alone, and so we performed a more detailed comparison of the libraries, scanning across all transcripts, looking for local coverage differences across library types.

68 0.7 No siRNA

0.6

0.5

0.4

0.3

0.2 Fraction of 5' ends of 5' Fraction 0.1

0 1 2 3 4 5 6 7 8 9 10 Coverage cutoff

0.7 Control siRNA

0.6

0.5

0.4

0.3

0.2 Fraction of 5' endsof 5' Fraction 0.1

0 1 2 3 4 5 6 7 8 9 10 Coverage cutoff

0.7 Xrn1 siRNA

0.6

0.5

0.4

0.3 +dox, A 0.2 +dox, B Fraction of 5' ends of 5' ends Fraction 0.1 -dox, A -dox, B 0 1 2 3 4 5 6 7 8 9 10 Coverage cutoff

Figure 3.4: Uncapped 5′ end replicate comparisons. The fraction of 5′ end locations which passed a minimum coverage with the complementary replicate also having a minimum of one read expressed, is compared with the total number of 5′ end locations which passed the cutoff indicated (see Equation 3.1). The labeled replicate (“+dox, A”, for example), is the replicate which passed the coverage cutoff.

69 Local 5′ end coverage comparisons yield no significant differences across treat- ments

Upon scanning across all recapping target transcripts and correcting for multiple testing utilizing the Benjamini-Hochberg correction [116], we found no significant local coverage differences. This was even upon increasing the comparison window size, as well as even after doubling the minimum coverage constraint per transcript from 10 to 20 reads per transcript.

One possibility was there were too many windows tested in these coverage comparisons, with the smallest number of comparisons made (20 nt windows scanned, with a minimum of

20 reads required per transcript in at least one of the treatments used in the comparison) having 232,999 windows compared. Thus, with the multiple testing comparison enforced, even the smallest p-value of 2.3 × 10−6 is adjusted to a q-value of 0.53. However, in an effort to gain confidence in the local coverage difference technique employed, we examined a few target transcript coverages directly.

Replicates fail to replicate

A transcript-by-transcript view of replicate coverage exhibited the expected coverage of localized coverage peaks, with the majority of transcript positions having zero coverage.

However, upon comparison of the coverage peaks across replicates, it was quickly evident that treatment replicates failed to replicate (see Figure 3.5), with coverage from one replicate frequently being half or less of the other, and coverage from one treatment sandwiching the other treatment at times. Upon observation of the failure of replicates to replicate for a few transcripts, we determined to examine replicate reproducibility on a more global level.

Full genome coverage correlations comparing raw read coverages per genomic position between replicates show highly unexpected behavior (see Figure 3.6). In general, HTS

experiments show a funnel shape for read coverage comparisons between replicates (see for example, Figure 4.1), and indeed, 5′-end seq replicate correlations do show this as a general trend. However, for the replicates analyzed in this study, we also observe a subset of genomic positions that are shifted in coverage in one replicate by up to a few orders of

70 (a) No siRNA; Full Transcript (b) No siRNA; Focus ) 6 -4 5 4 3 2 1

normalized frequency (x10 0 0 500 1000 1500 580 585 590 595 600 nts from start nts from start

(c) Xrn1 siRNA; Full Transcript (d) Xrn1 siRNA; Focus ) 15 -4

12.5 -dox A -dox B 10 +dox A 7.5 +dox B 5 2.5

normalized frequency (x10 0 0 500 1000 1500 1220 1222 1224 1226 1228 1230 nts from start nts from start

Figure 3.5: Uncapped 5′ end replicates fail to replicate. 5′-end seq coverage for transcript ENST00000229239.5 (GAPDH), a highly expressed transcript as measured by the fraction of reads mapping to it in the 5′-end seq libraries, is shown as an example, normalized by total reads per library. (a-b) show No siRNA treatment, (c-d) show Xrn1 siRNA treatment, with (a, c) showing the full transcript coverage and (b, d) displaying the indicated smaller focus region of the transcript’s coverage. As indicated, the replicates with only the native capping enzyme expressed are displayed with red-tinted circles, while the replicates with the dominant negative capping enzyme expressed are displayed with blue-tinted squares. As clearly seen, replicate values per position fail to correlate in general.

magnitude, consistently (a second funnel). The Control, −dox sample, which contained the Pseudomonas contamination, is shown, and exhibits even greater variability with at least two sub-populations shifted in opposing directions. As a comparison, the correlation

71 (a) (b)

(c) (d)

Figure 3.6: Global replicate vs replicate coverage comparisons show a subset of 5′ end positions have an average coverage shifted by ∼ 10 times the remaining values: A representative of the different correlation comparisons is shown, where (a) shows No siRNA, −dox, + strand coverages of replicates A vs B, (b) shows the same for Control siRNA, (c) shows Xrn1 siRNA, +dox, − strand coverages for A vs B, and (d) is likewise for replicate A’s + vs − strand coverages, as an example of uncorrelated data. Each data point represents a genomic position where the raw reads coverage is read off of the respective axis.

between + and − strand coverages within one sample are shown as an example of completely uncorrelated coverage data.

Decision to abandon 5′-end seq method and data

Upon examination of both the local and global coverage pictures, with the lack of repro- ducibility between replicates, it was decided to abandon the 5′-end seq data and method as

72 a technique to probe uncapped transcript 5′ ends, despite over a year’s worth of effort and

finances having been invested in the technique and data collection. Indeed, Machida and

Lin found that a ligation-based technique, similar to the one developed by our collaborators, is less reproducible than other techniques to probe transcript 5′ ends [137]. Thus, for future work, our collaborators plan on utilizing one of the SMART techniques discussed by Machida and Lin.

3.4 Conclusions

Here, we have outlined the variety of different techniques we have developed in the analysis of HTS data. These range from quality control methods, in an attempt to assess the validity of experimental methods, to data visualization techniques and differential expression quantitations, probing the effects of experimental conditions on a transcriptome-wide level.

We have also shared a case study of examining a novel 5′-end seq method, developed by collaborators to study the global picture of uncapped transcript 5′ ends in a cell. This case study shows the specific nature of analysis techniques that need to be developed on a study-by-study basis, as well as the importance of quality control methods, in particular in early stages of data analysis.

73 Chapter 4 An Investigation of LepA’s Function in E. coli

4.1 Introduction: Background on LepA

This chapter is based on our published paper, “The conserved GTPase LepA contributes mainly to translation initiation in Escherichia coli” [54].

4.1.1 LepA is highly conserved, and yet not well understood

There are eleven GTPases that are found in all bacteria, and include translation factors, components of signal recognition particles, and ribosome assembly factors [138; 139]. Among

the conserved GTPases is the poorly understood translation factor LepA. Despite its universal conservation across all bacteria [140], the lepA gene can be deleted from the genome with no obvious effect on growth [141].

LepA is a paralog of EF-G, which itself is a translation factor that catalyzes the translocation of the two tRNAs after peptidyl transfer [142]. LepA has protein domains homologous to domains 1 (G domain), 2, 3, and 5 of EF-G [143], but lacks domain 4 and a subdomain (G′), and instead has a unique C-terminal domain (CTD). LepA has been shown to largely localize to the membrane [144; 145]. Although it was suggested that LepA catalyzes reverse translocation [146], further studies [147; 148] and this one [54] have shown this to not be the case.

74 Table 4.1: Strains in which ∆lepA confers a synthetic growth defect

Doubling time (min) Genetic Genome a b Gene function background position ∆ lepA ∆lepA ∆lepA c e lepA+ ∆lepA (pLEPA) (pRB35)d (pRB34) Wild-type NA NA 24 ± 2 26 ± 1 26 ± 1 ND ND Stringent response, dksA 3.5 33 ± 1 52 ± 3 34 ± 2 ND ND transcriptional regulator Molybdenum utilization, molR 47.3 22 ± 1 33 ± 2 24 ± 2 ND ND regulatory protein rsgA 94.6 Ribosome biogenesis, GTPase 26 ± 1 31 ± 1 26 ± 1 ND ND Component of twin-arginine tatB 86.7 29 ± 1 44 ± 2 35 ± 2 ND ND translocase (TAT) Component of TonB-Exb tonB 28.2 system, PMF-dependent 25 ± 1 35 ± 2 30 ± 1 ND ND transporter Component of Tol-Pal system, tolR 16.7 27 ± 1 31 ± 2 28 ± 2 ND ND PMF-dependent transporter ubiF 15.0 Ubiquinone biosynthesis 37 ± 2 58 ± 2 43 ± 2 55 ± 3 57 ± 2 ubiG 50.4 Ubiquinone biosynthesis 39 ± 1 64 ± 2 43 ± 2 60 ± 3 61 ± 4 ubiH 65.8 Ubiquinone biosynthesis 34 ± 1 42 ± 1 31 ± 1 ND ND Reported values represent the mean ± SEM for ≥ 3 independent experiments. NA, not applicable; ND, not determined. (Data collected by R. Balakrishnan, from [54]) a Single-gene deletions of the Keio collection, each marked with a Kn-resistance cassette. Wild-type, the parental strain BW25113 [F −, λ−, ∆(araD–araB)567, ∆lacZ4787 :: rrnB-3, ∆(rhaD–rhaB)568, rph-1, hsdR514]. b In units of minutes (or centisomes). c Plasmid pLEPA contains the wild-type lepA gene downstream from its native promoter. d Plasmid pRB35 expresses a variant of LepA lacking the CTD (∆487–594). e Plasmid pRB34 expresses a variant of LepA with H81A.

4.1.2 Synthetic phenotypes exhibited by ∆lepA

In pursuit of LepA’s ever elusive function, our collaborators performed a systematic analysis

in which deletion of every non-essential gene in E. coli was combined with ∆lepA, and the

effects on growth were assessed. They also re-inserted a plasmid containing the LepA gene,

∆lepA (pLEPA), to observe recovery from the double deletion.

First, they re-confirmed that LepA deletion alone fails to exhibit a significant growth effect (24 ± 2 minutes vs 26 ± 1 minutes, Table 4.1). In addition, 9 mutants containing

the double mutations of both ∆lepA and one of ∆dksA, ∆molR, ∆rsgA, ∆tatB, ∆tonB,

∆tolR, ∆ubiF , ∆ubiG, or ∆ubiH were found to have a reduced growth rate compared to

the single deletion of the genes alone, lepA+. Upon insertion of the plasmid pLEPA, at

least partial recovery was observed in all cases. This suggests that LepA contributes to the

cellular functions of these genes, either directly or indirectly.

75 4.1.3 Deletion of the active-site histidine or the unique CTD in LepA fails to complement the synthetic phenotypes

In an effort to observe the significance of the unique CTD and GTPase activity in LepA,

derivatives of pLEPA were designed lacking the CTD (∆487 − 594) in one case (pRB35),

and containing the mutation H81A in the other (pRB34). Histidine 81 is believed to be critical for GTPase activity, as similar substitutions in EF-Tu and EF-G have been shown to cause loss of GTPase activity [149; 150].

When transformed back into the the ∆ubiF ∆lepA and ∆ubiG ∆lepA strains (those that showed the largest growth defect), both plasmids failed to rescue growth. Confirming that the presence of modified LepA in pRB34 and pRB35 containing cells are comparable to

LepA levels in cells with pLEPA, they showed the importance of both the CTD and the

GTPase activity of LepA to its in vivo function.

4.1.4 Examining LepA’s effect on the transcriptome and translatome

Having discovered a complementable LepA effect on cell growth through the double knockout study, we decided to investigate whether the depth of coverage offered by high-throughput sequencing (HTS) could be used to further investigate LepA’s role in the cell. Our collabo- rators performed RNA-seq (a snapshot of the transcriptome, [151; 152]) and Ribo-seq (a snapshot of the translatome, [48; 49]) on wild-type (WT), ∆lepA mutant (M), and ∆lepA pLEPA complemented (C) strains. Three independent experiments were performed per

strain, leading to 18 total cDNA libraries. Per strain, there were 45–69 million ribosome-

footprint reads and 40–65 million total-RNA reads aligned to the E. coli K12 MG1655 genome, allowing us to determine the effects of LepA on global gene expression. The data obtained were highly reproducible among the biological replicates (Figure 4.1), and we proceeded with our investigation.

76 Figure 4.1: Total RNA and ribosome footprint coverage is highly replicable. Shown are biological replicate 1 and 2 compared to each other in each of the strains (wild-type, mutant, complemented) for the two library types (total RNA and ribosome footprints), comparing the raw reads coverage for each genomic location. The Pearson correlation coefficients for all replicate-to-replicate comparisons are as follows: wild-type (WT) total RNA, r = 0.999, 0.997, and 0.998; mutant (M) total RNA, r = 0.998, 0.997, and 0.998; complemented (C) total RNA, r = 0.996, 0.962, and 0.968; WT ribosome footprints, r = 0.992, 0.990, and 0.995; M ribosome footprints, 0.997, 0.976, and 0.981; C ribosome footprints, r = 0.995, 0.997, and 0.995. (Data collected by R. Balakrishnan, correlated by K. Oman, from [54])

77 4.2 Methods

4.2.1 LepA investigations

Biological experiments and library preparation

Details on the methods for the biological experiments and library preparation can be found in our NAR paper [54].

Adapter trimming and read quality control

Adapter trimming and read quality control were performed as described in Section 3.2.2.

Read processing and alignment

As described in Section 3.2.2, the remaining reads were aligned to the E. coli genome with default Bowtie 2 parameters, and the bowtie2 output files were converted to the BAM format, sorted, and indexed for easy read access utilizing SAMtools. Alignment provided an average of 6.9% unaligned, 17.5% uniquely mapped and 75.6% multiply-mapped reads per total-RNA library, while ribosome-footprint libraries had an average of 18.5%, 36.8% and

44.7%, respectively.

Computational removal of rRNA reads

Prior to further analyses (unless otherwise stated), the rRNA reads were computationally identified and removed, as described in Section 3.2.3. After this removal, 29.8% of the original reads were left in the total RNA samples (55.8% of which were, on average, annotated as uniquely mapping), and 38.7% of ribosome footprint reads remained, with 95% of them being uniquely mapped.

Assignment of Read Location

For downstream analysis, each read needed to be associated with a single genomic location.

The midpoint of reads was used for total RNA libraries. Ribosome footprints were mapped

78 ihteso oo nteAst ([ site A the in codon stop the with omlzdfrec eet h oa N oeaeadlnt ftegns nee by indexed genes, the of length and coverage RNA total the to gene each for normalized e1 t ro ote3 the to prior nts 14 be d Figure (see placement location ( ribosome type for in length sample equally gene corresponding over the coverage in RNA gene total the that by read each normalizing genes, 5 containing a the created than we less site, site P P this the identify To ribosome. the 3 of of site histogram P the of center predicted the to (Data fragments. site footprint [ P ribosome from the the Oman, of of center K. end by the 3’ from analyzed that prevalent stemming Balakrishnan, indicates most as R. site, the interpret by A of we collected their which upstream stop in nts the codon, codon 14 of stop stop is the position the beyond genes, codon with nts all third located over 10 the ribosomes averaged at representing and peak (0 genes, The replicates respective all codon). their and of conditions, codons three stop all the from distances their site. P the of 4.2: Figure wsra rmteso oo,wihw soit ihrbsmsudron termination, undergoing ribosomes with associate we which codon, stop the from ownstream Normalized Frequency 0.001 0.002 0.003 0.004 0.005 −1 0 0 iooefopitfamnsed1 t ontemo h center the of downstream nts 14 end fragments footprint Ribosome ′ edpstoso iooefopitras(hw ovr ndsac from distance in vary to (shown reads ribosome-footprint of positions -end h oeae f3 n oiin frbsm otrn rget were fragments footprint ribosome of positions end 3’ of coverages The ′ nso iooefopitras itgaso eoi locations genomic of Histograms reads. ribosome-footprint of ends ′ ns[ ends 0 )a ucino h itnefo h 3 the from distance the of function a as 153]) Nucleotides fromStopCodon 52 ; 154 ) hs eietfidtecne ftePst to site P the of center the identified we Thus, ]). 79 , WT 10 .Ti itga hw ek1 nts 10 peak a shows histogram This ). 4.2 M , ,esrn l ee contribute genes all ensuring C), 54]) 20 ′ - n ftheir of end 30 (+/− strand coverages combined) of all reads were normalized by the total number of reads after quality control and ribosomal RNA removal to provide coverage in reads per million, combined across replicates for each condition and stored as bed files for import into the

UCSC genome browser [110]. Strandedness of reads was preserved for all other downstream analysis, unless otherwise stated.

Statistical tests

All statistical tests were performed using the R statistical programming package version

3.1.0 [99].

Read reproducibility (replicate versus replicate)

To validate reproducibility across replicates, raw read coverages across the genome (with

+/− strand coverages combined) were compared from replicate to replicate across WT, M and C samples for both total RNA and ribosome footprints. The Pearson’s correlation coefficient was determined for each of the replicate comparisons (Figure 4.1).

High coverage genes and overall normalization

To ensure adequate signal-to-noise ratios and minimize the effect of multiple testing, we determined a list of high coverage genes by partitioning reads into genes based on genomic location, as annotated by the E. coli K12’s Genbank Ref-Seq gene annotation file (downloaded from the UCSC Microbial Genome Browser on 15 April 2013; [110]). For each of the two types of libraries (total RNA, ribosome footprint), total coverage (all WT, M, C samples) for each gene was determined and genes were then sorted from highest to lowest coverage level. Genes in the top 1/2 of the two lists were compared, and those in the intersection were designated high coverage genes (1872 genes). To account for overall sequencing efficiency differences between libraries, all read coverages for further analyses at any given location were normalized by the total number of coding-region reads for each sample (utilizing all gene annotations).

80 Gene by gene overall coverages

Normalized reads were accounted into the different high coverage genes for each library.

Average ribosome density (ARD) was determined by taking ribosome footprint coverages per gene and dividing by the coverage of the corresponding replicate of total RNA. To test for significance in ARD fold changes across experimental conditions (WT versus M and M versus C), we took the base 2 logarithm of all gene coverages, and performed a Student’s t-test, corrected for multiple testing utilizing the Benjamini-Hochberg correction [116].

Ribosome footprint versus total RNA correlations

Gene-by-gene correlations were performed comparing ribosome footprints coverages xi with total RNA coverages yi for the high coverage genes. A least square line of best fit was fitted in fold change (log-log) space, shifting all values up by the minimum non-zero gene coverage in each sample, constraining the y-intercept to be zero in the non-log, normalized coverage space (only allowing the slope to be fit). A measure of spread, d2, was calculated in this log-log space, where: n 1 2 d2 = y˜ − x˜ − A˜ (4.1) n i i Xi=1   and n = 1872 (the number of high coverage genes being used in the comparison), y˜i = ln(yi +ymin) (the logarithm of the shifted total RNA coverage for one gene), x˜i = ln(xi +xmin) (the logarithm of the shifted ribosome footprint coverage for that gene) and A˜ = ln(A)

(A= the fitted slope). Student’s t-test was performed, comparing the differences in spread between WT versus M and M versus C samples.

Read count comparisons

Read counts from the respective samples were compared for total RNA libraries, comparing total mRNA, rRNA, tRNA and other stable small RNAs to total aligned reads. Significances of LepA effects were tested using Student’s t-test.

81 Translation initiation region (TIR) sequence analysis of LepA-affected genes

We determined purine and pyrimidine frequencies for each position of the TIR for genes that exhibit altered ARD due to loss of LepA (i.e. significant change in both the WT versus

M and C versus M comparisons; 283 genes where WT, C < M and 237 genes where WT, C

> M). The nucleotide sequence, aligned with respect to the start codon, was obtained for these genes, as well as all high coverage genes (to be used as a background), utilizing the gene annotations and the E. coli genome mentioned previously. Genes infC and pcnB were left out of the analysis, as they have the non-standard start codon AUU. The percentage of purines and pyrimidines was then determined for each list, in a location-dependent manner, and finally, the respective percentages from the LepA-affected genes were divided by the percentages for all high coverage genes to yield an enrichment/suppression of purine and pyrimidine prevalence compared to background levels. Any locations where the LepA-affected and the highcoverage genes had a percentage for purine or pyrimidine prevalence that were both zero were re-labeled to have an enrichment of 1. A two-tailed binomial test was performed for the pyrimidine frequencies for 8–11 nts prior to the start codon.

Metagene analysis of 5′ ends of genes

Ribosome density (RD) was determined for each gene at every position in the gene by determining:

Normalized Ribosome Footprint Coverage RD = (Normalized total RNA Full Gene Coverage)/(Gene Length)

These RDs were then aligned to the 5′ end of each gene and averaged over all high coverage genes.

82 Gene windows analysis

Relative ribosome occupancies of high coverage genes were determined for ribosome footprints, analogous to RD, by:

Normalized Ribosome Footprint Coverage c = i (Total Normalized Ribosome Footprint Gene Coverage)/(Length of Gene) where ci is the relative ribosome occupancy at a given nucleotide, i, in the gene, normalized for gene-by-gene coverages (to treat all windows across all genes with equal weight, for downstream analysis). A local coverage comparison across sample types was performed by taking 10 nt windows, shifted by 5 nts at a time, scanning over each gene, comparing the window coverages. Window coverage, per window, is the sum over ci’s over every nucleotide in the window. Only windows which showed a complemented average window coverage

(averaged across replicates per WT, M and C strains) were tested for the significance of the coverage differences (WT versus M and M versus C) utilizing a Benjamini-Hochberg

[116] corrected Student’s t-test. For the 56 windows deemed significant (q-value 0.05) for both WT versus M and M versus C comparisons, consecutive and overlapping windows were combined, then the local coverages were visually examined. Of the 38 regions examined in genes with enhanced coverages in the mutant (WT, C < M), we identified 25 (treating the largely identical genes tufA and tufB as a single instance) to contain a single, clear, complemented peak, and examined the peak-aligned codon frequencies across the windows, normalized to background levels across the prevalence of codons in all high coverage genes.

A two-tailed binomial test was performed for GGU, which appeared in greater abundance compared to all other codons at the A site of the peaks. Although the 12 regions with reduced coverage in the mutant (WT, C > M) were also examined in a similar manner, no clear complemented peaks were observed.

83 Table 4.2: Coding regions that exhibit reduced ARD in the absence of LepA

Fold decrease in ARDa Gene WT / M q-valueb C / M q-valueb yjfN 128 3.3 × 10−2 4 3.0 × 10−2 tdcA 126 5.0 × 10−3 5 4.2 × 10−2 galS 46 6.1 × 10−4 14 4.6 × 10−3 ygeV 30 4.2 × 10−3 5 1.2 × 10−2 lldP 25 1.3 × 10−3 58 4.8 × 10−3 dsdX 24 5.5 × 10−3 97 1.9 × 10−3 ychH 21 9.3 × 10−3 6 6.1 × 10−3 yqeF 19 4.2 × 10−3 5 2.3 × 10−2 cstA 19 5.3 × 10−3 9 1.3 × 10−3 raiA 18 6.3 × 10−4 17 1.9 × 10−3 dctA 16 6.1 × 10−4 11 6.4 × 10−4 cspD 15 2.6 × 10−3 6 7.9 × 10−3 dppA 15 3.0 × 10−3 9 1.2 × 10−2 tnaC 14 5.3 × 10−3 26 3.2 × 10−3 malT 14 6.1 × 10−4 4 1.4 × 10−3 tnaB 12 5.9 × 10−3 90 3.4 × 10−3 aldA 12 2.0 × 10−3 42 6.4 × 10−4 yniA 11 3.2 × 10−2 5 2.4 × 10−2 yejG 11 1.9 × 10−2 2 3.3 × 10−2 fucR 11 5.2 × 10−3 4 2.4 × 10−2 gntP 11 1.2 × 10−2 4 1.7 × 10−2 yeaT 11 5.8 × 10−3 3 1.3 × 10−2 yfiQ 10 1.0 × 10−3 2 6.4 × 10−3 malK 10 4.0 × 10−3 49 3.0 × 10−4 uspF 10 5.0 × 10−3 4 5.5 × 10−3 araC 10 2.6 × 10−3 5 1.7 × 10−2 cdaR 10 3.6 × 10−3 4 1.2 × 10−2 (Data collected by R. Balakrishnan, analyzed by K. Oman, from [54]) a ARD was calculated as the percentage of ribosome-footprint reads divided by the percentage of total-mRNA reads for each gene (protein-coding region) of the genome. Shown are quotients representing fold decrease in ARD due to loss of LepA. WT/M, wild-type versus mutant ∆lepA strain; C/M, complemented ∆lepA(pLEPA) versus mutant ∆lepA strain. b Statistical significance was assessed using the Benjamini-Hochberg corrected Students t-test method, comparing differences in log2(ARD) values. In all cases shown, q < 0.05, indicating statistical significance at the > 95% confidence level.

4.3 Results

4.3.1 Without LepA, many mRNA coding regions exhibit reduced ARD

We find that the loss of LepA substantially reduced the ARD of many coding regions (see

Supplementary Tables S1–S3 from our paper [54]). Out of the most highly expressed half

of genes analyzed (1872 genes), we found 520 had significantly altered ARD as a result of

loss of LepA (both WT versus M and C versus M comparisons yield Benjamini-Hochberg

corrected q-values of ≤ 0.05 [116]). Twenty-six genes show 10-fold or greater reduced

ribosome densities in the mutant strain (Table 4.2), but only one gene (flu, encoding

84 Antigen 43 autotransporter) showed similar levels of increase in ARD. Figure 4.3 shows

Figure 4.3: Examples of genes that exhibit decreased ARD in the absence of LepA. Total-RNA and ribosome-footprint read counts for WT, M, and C strains are shown mapped back to the genome in the vicinity of ychH (A), raiA (B), lldP (C), dctA (D) and aldA (E). Ribosome-footprint reads are mapped to the genomic position corresponding to the predicted central nucleotide of the P codon, and total-RNA reads are mapped to the center of the read fragment. Read counts are normalized with respect to total number of reads (after quality control and rRNA read removal), making the histograms of analogous data tracks directly comparable in each panel. (Data collected by R. Balakrishnan, analyzed by K. Oman, from [54])

ribosome profiling data for several genes with substantially reduced ARD in the absence of LepA. In almost all such cases, we see a decrease in protein production, as seen with a decrease in ribosome-protected fragments and an increase in mRNA levels. We believe the

85 Wild-type Mutant Complemented 0.1 d2 = 0.62 d2 = 1.12 d2 = 0.88 Ribosome Footprints

10−5

10−5 10−4 10−3 10−2 10−5 10−4 10−3 10−2 10−5 10−4 10−3 10−2 total RNA total RNA total RNA

Figure 4.4: Gene expression is globally perturbed in the absence of LepA. Nor- malized gene-by-gene coverages are compared between ribosome footprints and total RNA in WT, M and C strains (as indicated). A measure of spread, d2, was calculated for each of the comparisons, yielding 0.62 ± 0.02, 1.12 ± 0.02 and 0.88 ± 0.03 for the WT, M and C samples, respectively. Students t-tests on the values of d2 in all three replicates yield P = 1.6 × 10−5 for the WT versus M, and P = 1.2 × 10−3 for the M versus C comparison, showing significant perturbation of translation efficiencies (i.e. changes in ARD values) due to loss of LepA. (Data collected by R. Balakrishnan, correlated by K. Oman, from [54])

increased mRNA levels are a transcriptional response reacting to the translational defect caused by the loss of LepA. The global level of perturbation deletion of LepA impacts on gene expression (transcription and translation) can be seen in Figure 4.4, where we compare ribosome footprints to total RNA levels per gene. These show spread coefficients of 0.62 ±

0.02, 1.12 ± 0.02 and 0.88 ± 0.03, in the WT, M and C strains, respectively.

The observation that many mRNAs are overexpressed in the mutant with corresponding underproduction of proteins caused us to investigate the total amount of mRNA in the

3 strains. Our cDNA library construction protocol (see [54]) did not include an rRNA

removal step (as is common in other studies), and so we reasoned that read counts could give us an idea of the levels of the different classes of RNA molecules in a cell. We observed

86 Table 4.3: Relative abundance of various types of RNA in cells containing and lacking LepA

Strain mRNAa tRNA rRNA sRNAb Wild-type 4.43 ± 0.02** 14.13 ± 0.05* 65.27 ± 0.10** 5.68 ± 0.02** ∆lepA 5.32 ± 0.10 11.97 ± 0.41 67.60 ± 0.45 5.87 ± 0.04 ∆lepA (pLEPA) 3.64 ± 0.39* 11.28 ± 0.79 70.99 ± 1.40* 5.53 ± 0.02** Data represent percentages of all genome-aligned reads, mean ± SD (n = 3). Asterisks denote significant differences from the ∆lepA case (**P ≤ 0.01; *P≤ 0.05), based on Students t-test. (Data collected by R. Balakrishnan, analyzed by K. Oman, from [54]) a Protein coding regions. b Includes 6S RNA, RNase P RNA, 4.5S RNA, tmRNA and annotated regulatory small RNAs.

that the LepA mutation, ∆lepA, caused an increase of 1.2 fold in the fraction of coding mRNAs (Table 4.3). This effect is reversed with the introduction of plasmid pLEPA, which contains lepA. Proportionally, there is a much smaller increase in rRNA, indicating a higher concentration of mRNAs with respect to ribosomes in the mutant, further supporting the earlier observation of lower ribosome densities for the mutant. In general, we believe the actual ARD values for the mutant are actually across the board smaller (by ∼ 16%), than reported by Supplementary Table S3 (in [54]). The overall level of small RNAs in the cell also increased, albeit only slightly. The level of tRNA in the mutant strain was also decreased

(by ∼ 15%), although this effect was not complementable, and so cannot be ascribed to the loss of LepA.

4.3.2 LepA’s effect on ARD is related to the sequence of the TIR

Reduced ARD could in principle be a result of faster elongation or slower initiation rates.

However, the ≥ 10-fold changes of ARD seem unlikely to be an elongation effect, especially since it has been shown that elongation rates remain relatively constant across all genes

[155; 156]. Thus, the most direct explanation of reduced ARD is that LepA influences

the rates of translation initiation in a significant manner for many coding regions. If this

is indeed the case, it would seem plausible that there exists some kind of a relationship

between LepA’s effect on ARD and the sequence or structure of the TIR. To investigate

87 1.75 W T > M Purine WT > M Pyrimidine 1.5 WT < M Purine WT < M Pyrimidine

1 Enrichment vs Background

0.5 −15 −10 −5 0 5 10 15 Nucleotides from Start

Figure 4.5: Effects of LepA on ARD are related to the TIR sequence. Nt frequencies at each position of the TIR were determined for the subset of genes with decreased ARD in the absence of LepA (WT, C > M; 237 genes), the subset of genes with increased ARD in absence of LepA (WT, C < M; 283 genes) and all genes analyzed (1870). Purine and pyrimidine frequencies for each subset, relative to those of the complete set, are plotted as a function of TIR position (as indicated; position zero corresponds to the first nt of the start codon). Binomial tests indicate that pyrimidines are significantly underrepresented in the former subset (WT, C > M) at positions -11 (P = 7.4 × 10−4), -10 (P = 2.9 × 10−3) and -9 (P = 1.6 × 10−3) and significantly overrepresented in the latter subset (WT, C < M) at positions -11 (P = 3.3 × 10−2), -10 (P = 3.7 × 10−4), -9 (P = 2.5 × 10−2) and -8 (P = 2.6 × 10−2), data points marked with asterisks. The differences seen at the first position of the start codon (position zero) are deemed less than statistically significant (only 40 of the 1870 analyzed genes begin with a pyrimidine). Genes infC and pcnB have the rare start codon AUU and hence were omitted from the analysis. The second and third nts of the start codon (UG) are otherwise invariant and assigned the value of 1.0. (Data collected by R. Balakrishnan, analyzed by K. Oman, from [54])

this, we aligned the start codons of LepA-affected genes and determined the frequencies of nucleotides in this region.

Interestingly, we found pyrimidines are significantly underrepresented in the Shine-

Dalgarno (SD) region for genes whose ARD depend on LepA (WT, C > M; Figure 4.5), and the opposite is seen for genes that have enhanced ARD in the absence of LepA (WT, C <

M). The pyrimidine frequency is a more sensitive indicator, since the SD sequence is purine rich. This data suggests that, without LepA, mRNAs with a stronger SD sequences have a more problematic translation initiation than mRNAs with weaker SD sequences (based on

88 Figure 4.6: Metagene analysis reveals generally reduced ribosome density at the 5’ end of coding regions for both the mutant and complemented strains. Ribosome density values were calculated for each gene position and then averaged across all high- coverage genes. Shown are plots of metagene-averaged ribosome density as a function of gene position (codons 1-33) for the WT, M, and C strains (as indicated). (Data collected by R. Balakrishnan, analyzed by K. Oman, from [54])

the corresponding changes in ARD). Importantly, this provides further evidence of LepA’s role in translation initiation, either directly or indirectly.

4.3.3 LepA’s effect on ribosome distribution along mRNAs

Ribosome occupancy can vary locally along an mRNA due to a variety of factors, including, for example, ribosome pausing [52]. To investigate LepA’s potential role in translation elongation, we first performed a metagene analysis, which gives a general overview of ∆lepA on ribosome distribution along mRNAs. Plots showing average ribosome distribution along the 1872 coding regions, aligned to the start codons, showed reduced ribosomal occupancy in the first ∼ 20 codons (Figure 4.6). However, this trend was also exhibited in the complemented strain, and so cannot be attributed to a LepA effect.

89 We next searched for complementable changes in local ribosome occupancy along mRNAs on a gene-by-gene basis, first visually by scanning the ribosome profiling tracks on the genome browser, and then computationally. Visual inspection of ribosome profiling tracks showed minimal differences with ∆lepA, with the most notable being potential pause sites in ftsW , and dppA. While by eye these effects seemed complementable, subsequent computational analysis across genes utilizing 10 nt windows failed to show significant differences in the mutant vs complemented comparison.

Finally, we computationally scanned the 1872 genes utilizing 10 nt windows, searching for significant differences in local ribosome occupancy (after normalizing each gene with respect to its total footprint reads coverage). In the WT vs M comparison, 3185 windows were found to be significantly different, while in the C vs M comparison, only 187 windows exhibited the same significant difference. This is comparable to the results obtained in the metagene analysis, where we discovered significant effects upon the deletion of lepA, but that were not complementable (Figure 4.6).

Constraining the mutant strain to be significantly different compared to either other strain in a complementable manner (WT and C > M, or WT and C < M), we found 43 windows with significantly higher ribosome occupancy in the mutant strain, with only 13 windows showing significantly lower ribosome occupancy. Upon combining consecutive windows, the 43 windows were reduced to 36 regions, we identified 25 pause sites that are highly position specific (generally involving a single ribosome footprint), and are clearly a result of a loss of LepA. Intriguingly, ribosome density was not reduced after these pause sites, implying that the overall rate of protein synthesis is not affected.

We then aligned to the pause sites the respective sequences, and we found a glycine codon (GGU or GGC) was frequently found in the A-site of the paused ribosome (Figure

4.7), or was one codon away, possibly due to slightly incorrect codon placement (as a result of variability in footprint length). A two-tailed binomial test showed that the occurrence of the GGUs at the A-site is much higher than expected by chance (P = 7.4 × 10−13), although the same is not true of GGC. This shows that LepA either prevents or reduces ribosome pausing at certain GGU codons in the cell.

90 Figure 4.7: LepA prevents ribosomal pausing at certain GGU codons. Aligned are the coding sequences corresponding to the predicted paused ribosomes seen specifically in the mutant strain. The left column identifies the gene and the codon number (of the P codon of the paused complex). Codon GGU (red) is significantly overrepresented as the A codon (P = 7.4 × 10−13), based on a two-tailed binomial test. GGC (blue), the other codon recognized by Gly-tRNAGly3, is seen to occupy the A site in two cases, which is deemed less-than-significant enrichment. (Data collected by R. Balakrishnan, analyzed by K. Oman, from [54])

GGU and GGC are only read by the Gly-tRNAGly3 in E. coli [157], so our collaborators tested for Gly-tRNAGly3 levels in the three strains. They found no difference in the tRNAGly3 levels in the mutant compared with the other two strains (Supplementary Figure S9 in

[54]), showing that the ribosomal pausing at GGU codons is not an indirect consequence of reduced tRNAGly3 concentration.

91 4.4 Discussion

4.4.1 Loss of LepA mainly affects translation initiation

There are a number of observations, particularly from Weissman’s group, that suggest that

ARD is predominantly influenced by initiation. First, the use of inhibitors harringtonine and cycloheximide allowed for elongation rates to be measured globally in eukaryotic cells, and show similarity in elongation rates between genes, regardless of functional class or codon usage [155]. Second, the ribosome density at the 5′ and 3′ ends of genes in bacteria was found to be nearly identical [52; 156], which provides no support to the idea that ribosome pausing may impact the overall rate of protein synthesis. Finally, the ribosome footprint coverage per gene length was found to be proportional to protein synthesis rates in bacterial and eukaryotic cells [156]. This can only occur if average elongation rates across genes remains relatively constant.

In our study, we found that, across more than 500 genes, ARD was altered upon the loss

of LepA. This effect was substantial in many cases, with ARD changes ≥ 5-fold for dozens of

genes. We also find evidence for a LepA effect on translation elongation where LepA is found to prevent ribosomal pausing at certain GGU codons. However, these effects are relatively few in number, highly localized (with no impact on RD up or downstream of these sites), and occur in different genes than those that exhibited the greatest ARD changes. In addition,

LepA was shown to not impact average elongation rates as measured in vivo in the synthesis of two large polypeptides by our collaborators [54]. Together, these observations lead us to conclude that, either directly or indirectly, LepA primarily influences translation initiation as opposed to elongation. Consistent with this interpretation, we observe a relationship between LepA’s effect on ARD and the sequence of the TIR. For genes with significantly lower ARD in the mutant (which was at least partially recovered in the complement), we

find that pyrimidines are significantly underrepresented in the SD region, with the opposite trend holding for genes with a higher ARD upon the deletion of LepA.

Although the exact effect of LepA on translation is gene specific, LepA clearly enhances translation efficiency of many genes, as observed by an increased ARD. For the mutant

92 strain, genes with ≥ 10-fold or lower ARD outnumbered genes with ≥ 10-fold or higher ARD

by 26 to 1. In addition, total polysome levels are reduced by one half, and the overall amount of mRNA per ribosome in the cell is increased by ∼ 16%, which implies that the actual

ARD levels in the mutant are actually lower than reported, leading to a greater number of LepA dependent genes. These findings together imply that LepA in general promotes translation initiation.

What is the mechanism through which LepA promotes translation initiation? Continuing research shows that, in all cells, ribosome biogenesis is monitored by quality control mecha- nisms. For example, in eukaryotes, late-stage maturation of the 40S subunit incorporates a

“test drive”—a translation-like cycle of initiation-factor-dependent 80S formation, followed by release-factor-dependent subunit splitting [158; 159]. There are similar mechanisms

in bacteria, where later steps are involved with respect to the 70S ribosome [160; 161].

Immature subunits that escape these checkpoints have clear defects in initiation [162]. We propose that LepA plays a role in late-stage ribosome biogenesis, and without LepA, subunits formed are functionally compromised in initiation at certain TIRs (notably, with respect to transcripts in Table 4.2).

In support of this hypothesis, one of the genes identified in the synthetic lethal/sick screen done by our collaborators (Table 4.1) is rsgA, which is a GTPase involved in late-stage ribosome maturation [163]. RsgA (also called YjeQ) has been shown to catalyze the release of RbfA from the 30S subunit after 17S-to-16S rRNA processing [164]. LepA may perform a similar function in late-stage ribosome assembly, explaining the synthetic phenotype we observed. Loss of LepA increases the proportion of free subunits in a cell (Figure 3 from [54]), and increases the cell’s sensitivity to cold temperatures [145], similar to the loss of several other known assembly factors [160]. In addition, several of the known assembly factors are

GTP-ases [165; 166], and Efl1 (an EF-G/EF-2 paralog like LepA) catalyzes the release of tif6 during 60S subunit maturation in yeast [167; 168], providing sufficient precedent for this hypothesis.

Of course, it is also possible that LepA participates directly in the initiation process.

LepA may, for example, catalyze a direct conformational change in the ribosome that

93 increases the dynamics of the Shine-Dalgarno-anti-Shine-Dalgarno (SD-ASD) interactions.

This could facilitate a ribosome’s engagement to and/or clearance from the TIR. Further experiments will be needed to test these and other potential possibilities to further determine

LepA’s role in translation initiation.

4.4.2 LepA’s translation elongation effects are codon specific and are com- paratively minor

It is widely believed that LepA directly influences translation elongation [169]. This is based primarily on LepA’s structural similarity to EF-G, and LepA’s in vitro influence on elongation. Our ribosome profiling data suggests that LepA does indeed prevent pausing at certain GGU codons, consistent with it playing a role in elongation. However, we emphasize that these observed elongation effects are codon specific, and comparatively subtle. For example, the stalling sites seen in the mutant strain do not affect ribosome densities either up or down stream of the stall sites, suggesting that the overall rate of protein synthesis remains unchanged. In contrast, the effects observed in ARD were widespread, and in many cases, large, indicating a global and substantial influence of LepA on gene expression. Thus, we propose that LepA primarily influences translation initiation, with a secondary influence on elongation.

This more secondary role for LepA on elongation is supported by various in vivo data.

Specifically, (i) LepA has no influence on decoding fidelity or frameshifting frequency regardless of various contexts or conditions [170], (ii) LepA does not influence the average elongation rate (Supplementary Figure S10 in [54]), and (iii) LepA fails to inhibit tmRNA- dependent peptide tagging or A-codon cleavage, unless LepA is overexpressed [170].

The cause of the observed ribosome pausing at GGU codons remains unclear, as well as how LepA prevents said pausing. Gly-tRNAGly3 levels are as high in the M as they are in the WT or C strains (Supplementary Figure S9 in [54]), which argues against the idea that the pausing is due to elongating ribosomes having to wait for lower concentration cognate tRNAs. Both GGU and GGC are recognized by tRNAGly3, but only GGU is enriched in the

A-site of paused ribosomes. This implies that the pausing is codon specific, which could be

94 a result of slower decoding of GGU, or inhibited translocation of GGU’s tRNA, tRNAGly3.

How LepA interacts with this process is unknown.

4.4.3 Perturbations in gene expression likely explain the observed syn- thetic phenotypes of ∆lepA

Many of the genes shown to co-interact with LepA are involved in cell respiration and transport (Table 4.1). It is not obvious why cells lacking these genes are particularly more sensitive to the deletion of LepA. We believe this is an indirect consequence of altered gene expression upon LepA deletion, and is supported by the fact that dksA was among the genes identified by the lethal/sick screen. The DksA protein binds to RNA polymerase, and together with the alarmone (p)ppGpp, regulates the transcription of many genes [171]. We believe that with the single deletion, ∆lepA, the transcription regulatory network largely compensates, masking the LepA deletion effect. In the absence of DksA, the regulatory network is compromised, revealing the ∆lepA phenotype. Notably, a growth defect previously attributed to the loss of LepA is a lengthened lag phase [145; 170; 172], consistent with what one would expect from cells where translation initiation is compromised and the gene expression network affected.

4.5 Conclusions

We have found that, in E. coli, LepA predominantly acts in the translation initiation phase of protein synthesis, although the exact mechanism of TIR selection by which LepA acts, whether directly or indirectly, remains an open question. We hypothesize LepA’s role in initiation is indirect, acting primarily in the assembly and maturation of ribosomes, for two main reasons. First, we observed that LepA deletion confers a synthetic phenotype in the absence of RsgA, a GTPase which catalyzes the release of RbfA during late-stage ribosome maturation [163; 164]. This observation can be explained if LepA were to perform a similar, and partially redundant function. Second, LepA is universally conserved in bacteria and bacterial-derived organelles. These have in common the problem of assembling

95 structurally-related ribosomes, but they developed different mechanisms of TIR selection.

In many bacteria (Bacteroidetes and certain Cyanobacteria, for example) and mitochondria, initiation does not utilize a SD-ASD interaction [173–175]. We believe that LepA plays a conserved role in ribosome assembly, and subunits formed in its absence contain defects, with variable consequences, depending on the particular organism/ organelle [176; 177].

96 Bibliography

[1] Dahm, R. and Miescher, F. “Discovering DNA: Friedrich Miescher and the early years of nucleic acid research”. Hum. Genet., 122(6):565–581 (2008). [DOI:10.1007/s00439- 007-0433-0] [PubMed:17901982].

[2] Darwin, C. On the Origin of Species by Means of Natural Selection, Or, The Preser- vation of Favoured Races in the Struggle for Life. J. Murray, London (1859). URL http://books.google.com/books?id=jTZbAAAAQAAJ.

[3] Mendel, Gregor. “Versuche ¨uber Pflanzen-Hybriden”. Verhandlungen des natur- forschenden Vereines in Br¨unn, 42:3–47 (1866).

[4] Morgan, T. H. “SEX LIMITED INHERITANCE IN DROSOPHILA”. Science, 32(812):120–122 (1910). [DOI:10.1126/science.32.812.120] [PubMed:17759620].

[5] Avery, O. T., Macleod, C. M., and McCarty, M. “STUDIES ON THE CHEM- ICAL NATURE OF THE SUBSTANCE INDUCING TRANSFORMATION OF PNEUMOCOCCAL TYPES : INDUCTION OF TRANSFORMATION BY A DES- OXYRIBONUCLEIC ACID FRACTION ISOLATED FROM PNEUMOCOCCUS TYPE III”. J. Exp. Med., 79(2):137–158 (1944). [PubMed Central:PMC2135445] [PubMed:19871359].

[6] HERSHEY, A. D. and CHASE, M. “Independent functions of viral protein and nucleic acid in growth of bacteriophage”. J. Gen. Physiol., 36(1):39–56 (1952). [PubMed Central:PMC2147348] [PubMed:12981234].

[7] Commons, Wikimedia and Zephyris. “DNA Structure, with key, la- belled, and no backbone.” (2011). URL http://en.wikipedia.org/ wiki/File:DNA_Structure%2BKey%2BLabelled.pn_NoBB.png, [File:787px- DNA Structure+Key+Labelled.pn NoBB.png].

[8] Commons, Wikimedia and Ball, M. P. “DNA Chemical Structure” (2011). URL http://en.wikipedia.org/wiki/File:DNA_chemical_structure.svg, [File:658px- DNA chemical structure.svg.png].

[9] WATSON, J. D. and CRICK, F. H. “Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid”. Nature, 171(4356):737–738 (1953). [PubMed:13054692].

97 [10] WATSON, J. D. and CRICK, F. H. “The structure of DNA”. Cold Spring Harb. Symp. Quant. Biol., 18:123–131 (1953). [DOI:10.1101/SQB.1953.018.01.020] [PubMed:13168976].

[11] Meselson, M. and Stahl, F. W. “THE REPLICATION OF DNA IN ESCHERICHIA COLI”. Proc. Natl. Acad. Sci. U.S.A., 44(7):671–682 (1958). [PubMed Cen- tral:PMC528642] [PubMed:16590258].

[12] Commons, Wikimedia and Boumphreyfr. “Peptide synthesis” (2009). URL http: //en.wikipedia.org/wiki/File:Peptide_syn.png, [File:Peptide syn.png].

[13] Commons, Wikimedia and NIH. “Genetic Code Chart”. URL http://en.wikibooks. org/wiki/File:06_chart_pu3.png, [File:06 chart pu3.png].

[14] CRICK, F. H. “On protein synthesis”. Symp. Soc. Exp. Biol., 12:138–163 (1958). [PubMed:13580867].

[15] Crick, F. “Central dogma of molecular biology”. Nature, 227(5258):561–563 (1970). [PubMed:4913914].

[16] Hurwitz, J. “The discovery of RNA polymerase”. J. Biol. Chem., 280(52):42477–42485 (2005). [DOI:10.1074/jbc.X500006200] [PubMed:16230341].

[17] NIRENBERG, M. W. and MATTHAEI, J. H. “The dependence of cell-free pro- tein synthesis in E. coli upon naturally occurring or synthetic polyribonucleotides”. Proc. Natl. Acad. Sci. U.S.A., 47:1588–1602 (1961). [PubMed Central:PMC223178] [PubMed:14479932].

[18] Gamow, G. “Possible Relation between Deoxyribonucleic Acid and Protein Structures”. Nature, 173(4398):318 (1954). [DOI:10.1038/173318a0].

[19] GARDNER, R. S., et al. “Synthetic polynucleotides and the amino acid code. VII”. Proc. Natl. Acad. Sci. U.S.A., 48:2087–2094 (1962). [PubMed Central:PMC221128] [PubMed:13946552].

[20] WAHBA, A. J., et al. “Synthetic polynucleotides and the amino acid code. VIII”. Proc. Natl. Acad. Sci. U.S.A., 49:116–122 (1963). [PubMed Central:PMC300638] [PubMed:13998282].

[21] AB, Nobel Media. “The Nobel Prize in Physiology or Medicine ” (1968). URL http:// www.nobelprize.org/nobel_prizes/medicine/laureates/1968/, [Web:The Nobel Prize in Physiology or Medicine 1968].

[22] Gilbert, W. and Maxam, A. “The nucleotide sequence of the lac operator”. Proc. Natl. Acad. Sci. U.S.A., 70(12):3581–3584 (1973). [PubMed Central:PMC427284] [PubMed:4587255].

[23] Sanger, F., Nicklen, S., and Coulson, A. R. “DNA sequencing with chain-terminating inhibitors”. Proc. Natl. Acad. Sci. U.S.A., 74(12):5463–5467 (1977). [PubMed Central:PMC431765] [PubMed:271968].

98 [24] Adams, J. “DNA sequencing technologies”. Nature Education, 1(1):193 (2008). [URL:http://www.nature.com/scitable/topicpage/dna-sequencing-technologies-690].

[25] Shendure, J. and Ji, H. “Next-generation DNA sequencing”. Nat. Biotechnol., 26(10):1135–1145 (2008). [DOI:10.1038/nbt1486] [PubMed:18846087].

[26] Quail, M. A., et al. “A tale of three next generation sequencing platforms: com- parison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers”. BMC Genomics, 13:341 (2012). [PubMed Central:PMC3431227] [DOI:10.1186/1471-2164- 13-341] [PubMed:22827831].

[27] Mardis, E. R. “Next-generation sequencing platforms”. Annu Rev Anal Chem (Palo Alto Calif), 6:287–303 (2013). [DOI:10.1146/annurev-anchem-062012-092628] [PubMed:23560931].

[28] Zhu, P. and Craighead, H. G. “Zero-mode waveguides for single-molecule analysis”. Annu Rev Biophys, 41:269–293 (2012). [DOI:10.1146/annurev-biophys-050511-102338] [PubMed:22577821].

[29] Glenn, T. C. “Field guide to next-generation DNA sequencers”. Mol Ecol Resour, 11(5):759–769 (2011). [DOI:10.1111/j.1755-0998.2011.03024.x] [PubMed:21592312].

[30] van Dijk, E. L., Auger, H., Jaszczyszyn, Y., and Thermes, C. “Ten years of next-generation sequencing technology”. Trends Genet., 30(9):418–426 (2014). [DOI:10.1016/j.tig.2014.07.001] [PubMed:25108476].

[31] Abecasis, G. R., et al. “A map of human genome variation from population-scale sequencing”. Nature, 467(7319):1061–1073 (2010). [PubMed Central:PMC3042601] [DOI:10.1038/nature09534] [PubMed:20981092].

[32] Haussler, D., et al. “Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species”. J. Hered., 100(6):659–674 (2009). [PubMed Cen- tral:PMC2877544] [DOI:10.1093/jhered/esp086] [PubMed:19892720].

[33] Kilpinen, H. and Barrett, J. C. “How next-generation sequencing is transforming complex disease genetics”. Trends Genet., 29(1):23–30 (2013). [DOI:10.1016/j.tig.2012.10.001] [PubMed:23103023].

[34] Weber-Lehmann, J., et al. “Finding the needle in the haystack: differentiating ”identical” twins in paternity testing and forensics by ultra-deep next generation sequencing”. Forensic Sci Int Genet, 9:42–46 (2014). [DOI:10.1016/j.fsigen.2013.10.015] [PubMed:24528578].

[35] Kingsmore, S. F. and Saunders, C. J. “Deep sequencing of patient genomes for disease di- agnosis: when will it become routine?” Sci Transl Med, 3(87):87ps23 (2011). [PubMed Central:PMC4264992] [DOI:10.1126/scitranslmed.3002695] [PubMed:21677196].

[36] Saunders, C. J., et al. “Rapid whole-genome sequencing for genetic disease diagnosis in neonatal intensive care units”. Sci Transl Med, 4(154):154ra135 (2012). [PubMed Central:PMC4283791] [DOI:10.1126/scitranslmed.3004041] [PubMed:23035047].

99 [37] Hodges, E., et al. “Genome-wide in situ exon capture for selective resequencing”. Nat. Genet., 39(12):1522–1527 (2007). [DOI:10.1038/ng.2007.42] [PubMed:17982454].

[38] Choi, M., et al. “Genetic diagnosis by whole exome capture and massively parallel DNA sequencing”. Proc. Natl. Acad. Sci. U.S.A., 106(45):19096–19101 (2009). [PubMed Central:PMC2768590] [DOI:10.1073/pnas.0910672106] [PubMed:19861545].

[39] Rehm, H. L. “Disease-targeted sequencing: a cornerstone in the clinic”. Nat. Rev. Genet., 14(4):295–300 (2013). [PubMed Central:PMC3786217] [DOI:10.1038/nrg3463] [PubMed:23478348].

[40] Faust, K. and Raes, J. “Microbial interactions: from networks to models”. Nat. Rev. Microbiol., 10(8):538–550 (2012). [DOI:10.1038/nrmicro2832] [PubMed:22796884].

[41] Jensen, T. H., Jacquier, A., and Libri, D. “Dealing with pervasive transcription”. Mol. Cell, 52(4):473–484 (2013). [DOI:10.1016/j.molcel.2013.10.032] [PubMed:24267449].

[42] Wang, Z., Gerstein, M., and Snyder, M. “RNA-Seq: a revolutionary tool for tran- scriptomics”. Nat. Rev. Genet., 10(1):57–63 (2009). [PubMed Central:PMC2949280] [DOI:10.1038/nrg2484] [PubMed:19015660].

[43] van Dijk, E. L., et al. “XUTs are a class of Xrn1-sensitive antisense regulatory non- coding RNA in yeast”. Nature, 475(7354):114–117 (2011). [DOI:10.1038/nature10118] [PubMed:21697827].

[44] Mills, J. D., Kawahara, Y., and Janitz, M. “Strand-Specific RNA-Seq Pro- vides Greater Resolution of Transcriptome Profiling”. Curr. Genomics, 14(3):173– 181 (2013). [PubMed Central:PMC3664467] [DOI:10.2174/1389202911314030003] [PubMed:24179440].

[45] Siegel, T. N., et al. “Strand-specific RNA-Seq reveals widespread and developmentally regulated transcription of natural antisense transcripts in Plasmodium falciparum”. BMC Genomics, 15:150 (2014). [PubMed Central:PMC4007998] [DOI:10.1186/1471- 2164-15-150] [PubMed:24559473].

[46] Mercer, T. R., et al. “Targeted RNA sequencing reveals the deep complexity of the human transcriptome”. Nat. Biotechnol., 30(1):99–104 (2012). [PubMed Cen- tral:PMC3710462] [DOI:10.1038/nbt.2024] [PubMed:22081020].

[47] Blomquist, T. M., et al. “Targeted RNA-sequencing with competitive multiplex-PCR amplicon libraries”. PLoS ONE, 8(11):e79120 (2013). [PubMed Central:PMC3827295] [DOI:10.1371/journal.pone.0079120] [PubMed:24236095].

[48] Ingolia, N. T., Brar, G. A., Rouskin, S., McGeachy, A. M., and Weissman, J. S. “The ribosome profiling strategy for monitoring translation in vivo by deep sequencing of ribosome-protected mRNA fragments”. Nat Protoc, 7(8):1534–1550 (2012). [PubMed Central:PMC3535016] [DOI:10.1038/nprot.2012.086] [PubMed:22836135].

[49] Ingolia, N. T., Ghaemmaghami, S., Newman, J. R., and Weissman, J. S. “Genome- wide analysis in vivo of translation with nucleotide resolution using ribosome

100 profiling”. Science, 324(5924):218–223 (2009). [PubMed Central:PMC2746483] [DOI:10.1126/science.1168978] [PubMed:19213877].

[50] Guo, H., Ingolia, N. T., Weissman, J. S., and Bartel, D. P. “Mammalian microRNAs pre- dominantly act to decrease target mRNA levels”. Nature, 466(7308):835–840 (2010). [PubMed Central:PMC2990499] [DOI:10.1038/nature09267] [PubMed:20703300].

[51] Brar, G. A., et al. “High-resolution view of the yeast meiotic program revealed by ribo- some profiling”. Science, 335(6068):552–557 (2012). [PubMed Central:PMC3414261] [DOI:10.1126/science.1215110] [PubMed:22194413].

[52] Li, G. W., Oh, E., and Weissman, J. S. “The anti-Shine-Dalgarno sequence drives translational pausing and codon choice in bacteria”. Nature, 484(7395):538–541 (2012). [PubMed Central:PMC3338875] [DOI:10.1038/nature10965] [PubMed:22456704].

[53] Stadler, M. and Fire, A. “Wobble base-pairing slows in vivo translation elongation in metazoans”. RNA, 17(12):2063–2073 (2011). [PubMed Central:PMC3222120] [DOI:10.1261/rna.02890211] [PubMed:22045228].

[54] Balakrishnan, R., Oman, K., Shoji, S., Bundschuh, R., and Fredrick, K. “The conserved GTPase LepA contributes mainly to translation initiation in Escherichia coli”. Nucleic Acids Res., 42(21):13370–13383 (2014). [PubMed Central:PMC4245954] [DOI:10.1093/nar/gku1098] [PubMed:25378333].

[55] Johnson, D. S., Mortazavi, A., Myers, R. M., and Wold, B. “Genome-wide map- ping of in vivo protein-DNA interactions”. Science, 316(5830):1497–1502 (2007). [DOI:10.1126/science.1141319] [PubMed:17540862].

[56] Sanford, J. R., et al. “Splicing factor SFRS1 recognizes a functionally diverse land- scape of RNA transcripts”. Genome Res., 19(3):381–394 (2009). [PubMed Cen- tral:PMC2661799] [DOI:10.1101/gr.082503.108] [PubMed:19116412].

[57] Konig, J., et al. “iCLIP reveals the function of hnRNP particles in splicing at individual nucleotide resolution”. Nat. Struct. Mol. Biol., 17(7):909–915 (2010). [PubMed Central:PMC3000544] [DOI:10.1038/nsmb.1838] [PubMed:20601959].

[58] Simon, M. D., et al. “The genomic binding sites of a noncoding RNA”. Proc. Natl. Acad. Sci. U.S.A., 108(51):20497–20502 (2011). [PubMed Central:PMC3251105] [DOI:10.1073/pnas.1113536108] [PubMed:22143764].

[59] Chu, C., Qu, K., Zhong, F. L., Artandi, S. E., and Chang, H. Y. “Genomic maps of long noncoding RNA occupancy reveal principles of RNA-chromatin in- teractions”. Mol. Cell, 44(4):667–678 (2011). [PubMed Central:PMC3249421] [DOI:10.1016/j.molcel.2011.08.027] [PubMed:21963238].

[60] Dekker, J., Marti-Renom, M. A., and Mirny, L. A. “Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data”. Nat. Rev. Genet., 14(6):390–403 (2013). [PubMed Central:PMC3874835] [DOI:10.1038/nrg3454] [PubMed:23657480].

101 [61] Duan, Z., et al. “A three-dimensional model of the yeast genome”. Nature, 465(7296):363–367 (2010). [PubMed Central:PMC2874121] [DOI:10.1038/nature08973] [PubMed:20436457]. [62] Liang, S., et al. “Analysis of epigenetic modifications by next genera- tion sequencing”. Conf Proc IEEE Eng Med Biol Soc, 2009:6730 (2009). [DOI:10.1109/IEMBS.2009.5332853] [PubMed:19963934]. [63] Meaburn, E. and Schulz, R. “Next generation sequencing in epigenet- ics: insights and challenges”. Semin. Cell Dev. Biol., 23(2):192–199 (2012). [DOI:10.1016/j.semcdb.2011.10.010] [PubMed:22027613]. [64] Dolled-Filhart, M. P., Lee, M., Ou-Yang, C. W., Haraksingh, R. R., and Lin, J. C. “Computational and bioinformatics frameworks for next-generation whole exome and genome sequencing”. ScientificWorldJournal, 2013:730210 (2013). [PubMed Cen- tral:PMC3556895] [DOI:10.1155/2013/730210] [PubMed:23365548]. [65] Rodrguez-Ezpeleta, N. and Aransay, A. M. “Introduction”. In N. Rodrguez-Ezpeleta, M. Hackenberg, and A. M. Aransay (eds.), “Bioinformatics for High Throughput Sequencing”, pp. 1–9. Springer New York (2012). ISBN 978-1-4614-0781-2. URL http://dx.doi.org/10.1007/978-1-4614-0782-9_1. [66] Reed, K., Poulin, M. L., Yan, L., and Parissenti, A. M. “Comparison of bisulfite sequenc- ing PCR with pyrosequencing for measuring differences in DNA methylation”. Anal. Biochem., 397(1):96–106 (2010). [DOI:10.1016/j.ab.2009.10.021] [PubMed:19835834]. [67] Stevens, M., et al. “Estimating absolute methylation levels at single-CpG res- olution from methylation enrichment and restriction enzyme sequencing meth- ods”. Genome Res., 23(9):1541–1553 (2013). [PubMed Central:PMC3759729] [DOI:10.1101/gr.152231.112] [PubMed:23804401]. [68] Riebler, A., et al. “BayMeth: improved DNA methylation quantification for affin- ity capture sequencing data using a flexible Bayesian approach”. Genome Biol., 15(2):R35 (2014). [PubMed Central:PMC4053803] [DOI:10.1186/gb-2014-15-2-r35] [PubMed:24517713]. [69] Frankhouser, D. E., et al. “PrEMeR-CG: inferring nucleotide level DNA methylation values from MethylCap-seq data”. Bioinformatics, 30(24):3567–3574 (2014). [PubMed Central:PMC4253832] [DOI:10.1093/bioinformatics/btu583] [PubMed:25178460]. [70] Brinkman, A. B., et al. “Whole-genome DNA methylation profiling using MethylCap-seq”. Methods, 52(3):232–236 (2010). [DOI:10.1016/j.ymeth.2010.06.012] [PubMed:20542119]. [71] Rodriguez, B. A., et al. “Methods for high-throughput MethylCap-Seq data anal- ysis”. BMC Genomics, 13 Suppl 6:S14 (2012). [PubMed Central:PMC3481483] [DOI:10.1186/1471-2164-13-S6-S14] [PubMed:23134780]. [72] Meredith, G., et al. “DNA methylome sequencing: Methylated DNA enrichment with high-throughput sequencing by ligation (SOLiD System)”. Invitrogen poster on Invitrogen’s site. (2010). 102 [73] Kulis, M. and Esteller, M. “DNA methylation and cancer”. Adv. Genet., 70:27–56 (2010). [DOI:10.1016/B978-0-12-380866-0.60002-2] [PubMed:20920744].

[74] Nair, S. S., et al. “Comparison of methyl-DNA immunoprecipitation (MeDIP) and methyl-CpG binding domain (MBD) protein capture for genome-wide DNA methy- lation analysis reveal CpG sequence coverage bias”. Epigenetics, 6(1):34–44 (2011). [DOI:10.4161/epi.6.1.13313] [PubMed:20818161].

[75] Morgan, H. D., Santos, F., Green, K., Dean, W., and Reik, W. “Epigenetic re- programming in mammals”. Hum. Mol. Genet., 14 Spec No 1:47–58 (2005). [DOI:10.1093/hmg/ddi114] [PubMed:15809273].

[76] Feinberg, A. P. “Phenotypic plasticity and the epigenetics of human disease”. Nature, 447(7143):433–440 (2007). [DOI:10.1038/nature05919] [PubMed:17522677].

[77] Reik, W. and Lewis, A. “Co-evolution of X-chromosome inactivation and imprint- ing in mammals”. Nat. Rev. Genet., 6(5):403–410 (2005). [DOI:10.1038/nrg1602] [PubMed:15818385].

[78] Coulondre, C., Miller, J. H., Farabaugh, P. J., and Gilbert, W. “Molecular basis of base substitution hotspots in Escherichia coli”. Nature, 274(5673):775–780 (1978). [PubMed:355893].

[79] Bird, A. P. “DNA methylation and the frequency of CpG in animal DNA”. Nucleic Acids Res., 8(7):1499–1504 (1980). [PubMed Central:PMC324012] [PubMed:6253938].

[80] Saxonov, S., Berg, P., and Brutlag, D. L. “A genome-wide analysis of CpG dinu- cleotides in the human genome distinguishes two distinct classes of promoters”. Proc. Natl. Acad. Sci. U.S.A., 103(5):1412–1417 (2006). [PubMed Central:PMC1345710] [DOI:10.1073/pnas.0510310103] [PubMed:16432200].

[81] Larsen, F., Gundersen, G., Lopez, R., and Prydz, H. “CpG islands as gene markers in the human genome”. Genomics, 13(4):1095–1107 (1992). [PubMed:1505946].

[82] Zhu, J., He, F., Hu, S., and Yu, J. “On the nature of human housekeeping genes”. Trends Genet., 24(10):481–484 (2008). [DOI:10.1016/j.tig.2008.08.004] [PubMed:18786740].

[83] Bird, A. “DNA methylation patterns and epigenetic memory”. Genes Dev., 16(1):6–21 (2002). [DOI:10.1101/gad.947102] [PubMed:11782440].

[84] Esteller, M. “Epigenetics in cancer”. N. Engl. J. Med., 358(11):1148–1159 (2008). [DOI:10.1056/NEJMra072067] [PubMed:18337604].

[85] Baylin, S. B. and Ohm, J. E. “Epigenetic gene silencing in cancer - a mechanism for early oncogenic pathway addiction?” Nat. Rev. Cancer, 6(2):107–116 (2006). [DOI:10.1038/nrc1799] [PubMed:16491070].

[86] Jones, P. A. and Baylin, S. B. “The fundamental role of epigenetic events in cancer”. Nat. Rev. Genet., 3(6):415–428 (2002). [DOI:10.1038/nrg816] [PubMed:12042769].

103 [87] Esteller, M. “CpG island hypermethylation and tumor suppressor genes: a booming present, a brighter future”. Oncogene, 21(35):5427–5440 (2002). [DOI:10.1038/sj.onc.1205600] [PubMed:12154405].

[88] Baylin, S. B., et al. “Aberrant patterns of DNA methylation, chromatin forma- tion and gene expression in cancer”. Hum. Mol. Genet., 10(7):687–692 (2001). [PubMed:11257100].

[89] Patterson, K., Molloy, L., Qu, W., and Clark, S. “DNA methylation: bisulphite modification and analysis”. J Vis Exp, (56) (2011). [PubMed Central:PMC3227193] [DOI:10.3791/3170] [PubMed:22042230].

[90] Trimarchi, M. P., et al. “Enrichment-based DNA methylation analysis using next- generation sequencing: sample exclusion, estimating changes in global methylation, and the contribution of replicate lanes”. BMC Genomics, 13 Suppl 8:S6 (2012). [PubMed Central:PMC3535705] [DOI:10.1186/1471-2164-13-S8-S6] [PubMed:23281662].

[91] Invitrogen. MethylMiner Methylated DNA Enrichment Kit. Life Technologies Corpo- ration, Waltham, MA (2009). URL http://www.invitrogen.com.

[92] Hendrich, B. and Tweedie, S. “The methyl-CpG binding domain and the evolv- ing role of DNA methylation in animals”. Trends Genet., 19(5):269–277 (2003). [DOI:10.1016/S0168-9525(03)00080-5] [PubMed:12711219].

[93] Scarsdale, J. N., Webb, H. D., Ginder, G. D., and Williams, D. C. “Solution structure and dynamic analysis of chicken MBD2 methyl binding domain bound to a target- methylated DNA sequence”. Nucleic Acids Res., 39(15):6741–6752 (2011). [PubMed Central:PMC3159451] [DOI:10.1093/nar/gkr262] [PubMed:21531701].

[94] Hendrich, B., et al. “Genomic structure and chromosomal mapping of the murine and human Mbd1, Mbd2, Mbd3, and Mbd4 genes”. Mamm. Genome, 10(9):906–912 (1999). [PubMed:10441743].

[95] Bateman, A., et al. “UniProt: a hub for protein information”. Nucleic Acids Res., 43(Database issue):D204–212 (2015). [PubMed Central:PMC4384041] [DOI:10.1093/nar/gku989] [PubMed:25348405].

[96] Fraga, M. F., et al. “The affinity of different MBD proteins for a specific methylated locus depends on their intrinsic binding properties”. Nucleic Acids Res., 31(6):1765– 1774 (2003). [PubMed Central:PMC152853] [PubMed:12626718].

[97] Rodriguez, B. A., et al. “Epigenetic repression of the estrogen-regulated Homeobox B13 gene in breast cancer”. Carcinogenesis, 29(7):1459–1465 (2008). [PubMed Central:PMC2899848] [DOI:10.1093/carcin/bgn115] [PubMed:18499701].

[98] Lander, E. S., et al. “Initial sequencing and analysis of the human genome”. Nature, 409(6822):860–921 (2001). [DOI:10.1038/35057062] [PubMed:11237011].

[99] R Core Team. R: A Language and Environment for Statistical Computing. R Founda- tion for Statistical Computing, Vienna, Austria (2014). URL http://www.R-project. org/. 104 [100] Benjamini, Y. and Speed, T. P. “Summarizing and correcting the GC content bias in high-throughput sequencing”. Nucleic Acids Res., 40(10):e72 (2012). [PubMed Central:PMC3378858] [DOI:10.1093/nar/gks001] [PubMed:22323520].

[101] SantaLucia, J. “A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics”. Proc. Natl. Acad. Sci. U.S.A., 95(4):1460–1465 (1998). [PubMed Central:PMC19045] [PubMed:9465037].

[102] Bentley, D. R., et al. “Accurate whole human genome sequencing using reversible termi- nator chemistry”. Nature, 456(7218):53–59 (2008). [PubMed Central:PMC2581791] [DOI:10.1038/nature07517] [PubMed:18987734].

[103] Morin, R., et al. “Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing”. BioTechniques, 45(1):81–94 (2008). [DOI:10.2144/000112900] [PubMed:18611170].

[104] Ingolia, N. T., et al. “Ribosome profiling reveals pervasive translation outside of annotated protein-coding genes”. Cell Rep, 8(5):1365–1379 (2014). [PubMed Cen- tral:PMC4216110] [DOI:10.1016/j.celrep.2014.07.045] [PubMed:25159147].

[105] Blattner, F. R., et al. “The complete genome sequence of Escherichia coli K-12”. Science, 277(5331):1453–1462 (1997). [PubMed:9278503].

[106] Langmead, B. and Salzberg, S. L. “Fast gapped-read alignment with Bowtie 2”. Nat. Methods, 9(4):357–359 (2012). [PubMed Central:PMC3322381] [DOI:10.1038/nmeth.1923] [PubMed:22388286].

[107] Dobin, A., et al. “STAR: ultrafast universal RNA-seq aligner”. Bioinformatics, 29(1):15–21 (2013). [PubMed Central:PMC3530905] [DOI:10.1093/bioinformatics/bts635] [PubMed:23104886].

[108] Harrow, J., et al. “GENCODE: the reference human genome annotation for The ENCODE Project”. Genome Res., 22(9):1760–1774 (2012). [PubMed Cen- tral:PMC3431492] [DOI:10.1101/gr.135350.111] [PubMed:22955987].

[109] Li, H., et al. “The Sequence Alignment/Map format and SAMtools”. Bioinformatics, 25(16):2078–2079 (2009). [PubMed Central:PMC2723002] [DOI:10.1093/bioinformatics/btp352] [PubMed:19505943].

[110] Schneider, K. L., Pollard, K. S., Baertsch, R., Pohl, A., and Lowe, T. M. “The UCSC Archaeal Genome Browser”. Nucleic Acids Res., 34(Database issue):D407–410 (2006). [PubMed Central:PMC1347496] [DOI:10.1093/nar/gkj134] [PubMed:16381898].

[111] Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. “Basic local alignment search tool”. J. Mol. Biol., 215(3):403–410 (1990). [DOI:10.1016/S0022- 2836(05)80360-2] [PubMed:2231712].

[112] Karolchik, D., et al. “The UCSC Table Browser data retrieval tool”. Nucleic Acids Res., 32(Database issue):D493–496 (2004). [PubMed Central:PMC308837] [DOI:10.1093/nar/gkh103] [PubMed:14681465].

105 [113] Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., and Wold, B. “Mapping and quantifying mammalian transcriptomes by RNA-Seq”. Nat. Methods, 5(7):621–628 (2008). [DOI:10.1038/nmeth.1226] [PubMed:18516045]. [114] Dillies, M. A., et al. “A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis”. Brief. Bioinformatics, 14(6):671–683 (2013). [DOI:10.1093/bib/bbs046] [PubMed:22988256]. [115] Kent, W. J., et al. “The human genome browser at UCSC”. Genome Res., 12(6):996– 1006 (2002). [PubMed Central:PMC186604] [DOI:10.1101/gr.229102. Article published online before print in May 2002] [PubMed:12045153]. [116] Benjamini, Y. and Hochberg, Y. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing”. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289–300 (1995). ISSN 00359246. URL http://dx. doi.org/10.2307/2346101. [117] Bailey, T. L. and Elkan, C. “Fitting a mixture model by expectation maximization to discover motifs in biopolymers”. Proc Int Conf Intell Syst Mol Biol, 2:28–36 (1994). [PubMed:7584402]. [118] Lorenz, R., et al. “ViennaRNA Package 2.0”. Algorithms Mol Biol, 6:26 (2011). [PubMed Central:PMC3319429] [DOI:10.1186/1748-7188-6-26] [PubMed:22115189]. [119] Spearman, C. and Spearman, C. “The proof and measurement of association be- tween two things. By C. Spearman, 1904”. Am J Psychol, 100(3-4):441–471 (1987). [PubMed:3322052]. [120] Ho, C. K., et al. “The guanylyltransferase domain of mammalian mRNA capping enzyme binds to the phosphorylated carboxyl-terminal domain of RNA polymerase II”. J. Biol. Chem., 273(16):9577–9585 (1998). [PubMed:9545288]. [121] Shatkin, A. J. and Manley, J. L. “The ends of the affair: capping and polyadenylation”. Nat. Struct. Biol., 7(10):838–842 (2000). [DOI:10.1038/79583] [PubMed:11017188]. [122] Konarska, M. M., Padgett, R. A., and Sharp, P. A. “Recognition of cap structure in splicing in vitro of mRNA precursors”. Cell, 38(3):731–736 (1984). [PubMed:6567484]. [123] Visa, N., Izaurralde, E., Ferreira, J., Daneholt, B., and Mattaj, I. W. “A nuclear cap- binding complex binds Balbiani ring pre-mRNA cotranscriptionally and accompanies the ribonucleoprotein particle during nuclear export”. J. Cell Biol., 133(1):5–14 (1996). [PubMed Central:PMC2120770] [PubMed:8601613]. [124] Lewis, J. D. and Izaurralde, E. “The role of the cap structure in RNA processing and nuclear export”. Eur. J. Biochem., 247(2):461–469 (1997). [PubMed:9266685]. [125] Shatkin, A. J. “Capping of eucaryotic mRNAs”. Cell, 9(4 PT 2):645–653 (1976). [PubMed:1017010]. [126] Banerjee, A. K. “5’-terminal cap structure in eucaryotic messenger ribonucleic acids”. Microbiol. Rev., 44(2):175–205 (1980). [PubMed Central:PMC373176] [PubMed:6247631]. 106 [127] Sonenberg, N. and Gingras, A. C. “The mRNA 5’ cap-binding protein eIF4E and control of cell growth”. Curr. Opin. Cell Biol., 10(2):268–275 (1998). [PubMed:9561852].

[128] Evdokimova, V., et al. “The major mRNA-associated protein YB-1 is a potent 5’ cap-dependent mRNA stabilizer”. EMBO J., 20(19):5491–5502 (2001). [PubMed Central:PMC125650] [DOI:10.1093/emboj/20.19.5491] [PubMed:11574481].

[129] Gao, M., Fritz, D. T., Ford, L. P., and Wilusz, J. “Interaction between a poly(A)- specific ribonuclease and the 5’ cap influences mRNA deadenylation rates in vitro”. Mol. Cell, 5(3):479–488 (2000). [PubMed Central:PMC2811581] [PubMed:10882133].

[130] Burkard, K. T. and Butler, J. S. “A nuclear 3’-5’ exonuclease involved in mRNA degradation interacts with Poly(A) polymerase and the hnRNA protein Npl3p”. Mol. Cell. Biol., 20(2):604–616 (2000). [PubMed Central:PMC85144] [PubMed:10611239].

[131] Schoenberg, D. R. and Maquat, L. E. “Regulation of cytoplasmic mRNA de- cay”. Nat. Rev. Genet., 13(4):246–259 (2012). [PubMed Central:PMC3351101] [DOI:10.1038/nrg3160] [PubMed:22392217].

[132] Li, Y. and Kiledjian, M. “Regulation of mRNA decapping”. Wiley Interdiscip Rev RNA, 1(2):253–265 (2010). [DOI:10.1002/wrna.15] [PubMed:21935889].

[133] Gu, M. and Lima, C. D. “Processing the message: structural insights into cap- ping and decapping mRNA”. Curr. Opin. Struct. Biol., 15(1):99–106 (2005). [DOI:10.1016/j.sbi.2005.01.009] [PubMed:15718140].

[134] Mukherjee, C., et al. “Identification of cytoplasmic capping targets reveals a role for cap homeostasis in translation and mRNA stability”. Cell Rep, 2(3):674– 684 (2012). [PubMed Central:PMC3462258] [DOI:10.1016/j.celrep.2012.07.011] [PubMed:22921400].

[135] Otsuka, Y., Kedersha, N. L., and Schoenberg, D. R. “Identification of a cyto- plasmic complex that adds a cap onto 5’-monophosphate RNA”. Mol. Cell. Biol., 29(8):2155–2167 (2009). [PubMed Central:PMC2663312] [DOI:10.1128/MCB.01325- 08] [PubMed:19223470].

[136] Dobin, A. “Re: Many unmapped reads in STAR, classified as ”too short”?” (2013). [URL:https://groups.google.com/d/msg/rna- star/7RwKkvNLmI4/REpWc1B4KDkJ].

[137] Machida, R. J. and Lin, Y. Y. “Four methods of preparing mRNA 5’ end libraries using the Illumina sequencing platform”. PLoS ONE, 9(7):e101812 (2014). [PubMed Central:PMC4086933] [DOI:10.1371/journal.pone.0101812] [PubMed:25003736].

[138] Caldon, C. E. and March, P. E. “Function of the universally conserved bacterial GTPases”. Curr. Opin. Microbiol., 6(2):135–139 (2003). [PubMed:12732302].

[139] Caldon, C. E., Yoong, P., and March, P. E. “Evolution of a molecular switch: universal bacterial GTPases regulate ribosome function”. Mol. Microbiol., 41(2):289–297 (2001). [PubMed:11489118].

107 [140] Margus, T., Remm, M., and Tenson, T. “Phylogenetic distribution of translational GTPases in bacteria”. BMC Genomics, 8:15 (2007). [PubMed Central:PMC1780047] [DOI:10.1186/1471-2164-8-15] [PubMed:17214893].

[141] Dibb, N. J. and Wolfe, P. B. “lep operon proximal gene is not required for growth or secretion by Escherichia coli”. J. Bacteriol., 166(1):83–87 (1986). [PubMed Central:PMC214560] [PubMed:3514582].

[142] Jørgensen, R., Merrill, A. R., and Andersen, G. R. “The life and death of translation elongation factor 2”. Biochem. Soc. Trans., 34(Pt 1):1–6 (2006). [DOI:10.1042/BST20060001] [PubMed:16246167].

[143] Evans, R. N., Blaha, G., Bailey, S., and Steitz, T. A. “The structure of LepA, the ribosomal back translocase”. Proc. Natl. Acad. Sci. U.S.A., 105(12):4673–4678 (2008). [PubMed Central:PMC2290774] [DOI:10.1073/pnas.0801308105] [PubMed:18362332].

[144] March, P. E. and Inouye, M. “GTP-binding membrane protein of Escherichia coli with sequence homology to initiation factor 2 and elongation factors Tu and G”. Proc. Natl. Acad. Sci. U.S.A., 82(22):7500–7504 (1985). [PubMed Central:PMC390844] [PubMed:2999765].

[145] Pech, M., et al. “Elongation factor 4 (EF4/LepA) accelerates protein synthesis at increased Mg2+ concentrations”. Proc. Natl. Acad. Sci. U.S.A., 108(8):3199– 3203 (2011). [PubMed Central:PMC3044372] [DOI:10.1073/pnas.1012994108] [PubMed:21300907].

[146] Qin, Y., et al. “The highly conserved LepA is a ribosomal elongation factor that back-translocates the ribosome”. Cell, 127(4):721–733 (2006). [DOI:10.1016/j.cell.2006.09.037] [PubMed:17110332].

[147] Liu, H., et al. “The conserved protein EF4 (LepA) modulates the elongation cycle of protein synthesis”. Proc. Natl. Acad. Sci. U.S.A., 108(39):16223–16228 (2011). [PubMed Central:PMC3182707] [DOI:10.1073/pnas.1103820108] [PubMed:21930951].

[148] Liu, H., Pan, D., Pech, M., and Cooperman, B. S. “Interrupted catalysis: the EF4 (LepA) effect on back-translocation”. J. Mol. Biol., 396(4):1043–1052 (2010). [PubMed Central:PMC3138200] [DOI:10.1016/j.jmb.2009.12.043] [PubMed:20045415].

[149] Cunha, C. E., et al. “Dual use of GTP hydrolysis by elongation factor G on the ribosome”. Translation, 1(1):e24315 (2013). [DOI:10.4161/trla.24315].

[150] Daviter, T., Wieden, H. J., and Rodnina, M. V. “Essential role of histidine 84 in elongation factor Tu for the chemical step of GTP hydrolysis on the ribosome”. J. Mol. Biol., 332(3):689–699 (2003). [PubMed:12963376].

[151] Nagalakshmi, U., et al. “The transcriptional landscape of the yeast genome de- fined by RNA sequencing”. Science, 320(5881):1344–1349 (2008). [PubMed Cen- tral:PMC2951732] [DOI:10.1126/science.1158441] [PubMed:18451266].

108 [152] Wilhelm, B. T., et al. “Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution”. Nature, 453(7199):1239–1243 (2008). [DOI:10.1038/nature07002] [PubMed:18488015].

[153] O’Connor, P. B., Li, G. W., Weissman, J. S., Atkins, J. F., and Baranov, P. V. “rRNA:mRNA pairing alters the length and the symmetry of mRNA- protected fragments in ribosome profiling experiments”. Bioinformatics, 29(12):1488– 1491 (2013). [PubMed Central:PMC3673220] [DOI:10.1093/bioinformatics/btt184] [PubMed:23603333].

[154] Oh, E., et al. “Selective ribosome profiling reveals the cotranslational chaperone action of trigger factor in vivo”. Cell, 147(6):1295–1308 (2011). [PubMed Cen- tral:PMC3277850] [DOI:10.1016/j.cell.2011.10.044] [PubMed:22153074].

[155] Ingolia, N. T., Lareau, L. F., and Weissman, J. S. “Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mam- malian proteomes”. Cell, 147(4):789–802 (2011). [PubMed Central:PMC3225288] [DOI:10.1016/j.cell.2011.10.002] [PubMed:22056041].

[156] Li, G. W., Burkhardt, D., Gross, C., and Weissman, J. S. “Quantifying ab- solute protein synthesis rates reveals principles underlying allocation of cellu- lar resources”. Cell, 157(3):624–635 (2014). [PubMed Central:PMC4006352] [DOI:10.1016/j.cell.2014.02.033] [PubMed:24766808].

[157] Dong, H., Nilsson, L., and Kurland, C. G. “Co-variation of tRNA abundance and codon usage in Escherichia coli at different growth rates”. J. Mol. Biol., 260(5):649–663 (1996). [DOI:10.1006/jmbi.1996.0428] [PubMed:8709146].

[158] Strunk, B. S., Novak, M. N., Young, C. L., and Karbstein, K. “A translation-like cycle is a quality control checkpoint for maturing 40S ribosome subunits”. Cell, 150(1):111– 121 (2012). [PubMed Central:PMC3615461] [DOI:10.1016/j.cell.2012.04.044] [PubMed:22770215].

[159] Karbstein, K. “Quality control mechanisms during ribosome maturation”. Trends Cell Biol., 23(5):242–250 (2013). [PubMed Central:PMC3640646] [DOI:10.1016/j.tcb.2013.01.004] [PubMed:23375955].

[160] Shajani, Z., Sykes, M. T., and Williamson, J. R. “Assembly of bacterial ribosomes”. Annu. Rev. Biochem., 80:501–526 (2011). [DOI:10.1146/annurev-biochem-062608- 160432] [PubMed:21529161].

[161] Connolly, K. and Culver, G. “Deconstructing ribosome construction”. Trends Biochem. Sci., 34(5):256–263 (2009). [PubMed Central:PMC3711711] [DOI:10.1016/j.tibs.2009.01.011] [PubMed:19376708].

[162] Connolly, K. and Culver, G. “Overexpression of RbfA in the absence of the KsgA checkpoint results in impaired translation initiation”. Mol. Microbiol., 87(5):968–981 (2013). [PubMed Central:PMC3583373] [DOI:10.1111/mmi.12145] [PubMed:23387871].

109 [163] Campbell, T. L. and Brown, E. D. “Genetic interaction screens with ordered overexpres- sion and deletion clone sets implicate the Escherichia coli GTPase YjeQ in late ribosome biogenesis”. J. Bacteriol., 190(7):2537–2545 (2008). [PubMed Central:PMC2293177] [DOI:10.1128/JB.01744-07] [PubMed:18223068].

[164] Goto, S., Kato, S., Kimura, T., Muto, A., and Himeno, H. “RsgA releases RbfA from 30S ribosome during a late stage of ribosome biosynthesis”. EMBO J., 30(1):104–114 (2011). [PubMed Central:PMC3020115] [DOI:10.1038/emboj.2010.291] [PubMed:21102555].

[165] Karbstein, K. “Role of GTPases in ribosome assembly”. Biopolymers, 87(1):1–11 (2007). [DOI:10.1002/bip.20762] [PubMed:17514744].

[166] Britton, R. A. “Role of GTPases in bacterial ribosome assembly”. Annu. Rev. Microbiol., 63:155–176 (2009). [DOI:10.1146/annurev.micro.091208.073225] [PubMed:19575570].

[167] Bussiere, C., Hashem, Y., Arora, S., Frank, J., and Johnson, A. W. “Integrity of the P-site is probed during maturation of the 60S ribosomal subunit”. J. Cell Biol., 197(6):747–759 (2012). [PubMed Central:PMC3373404] [DOI:10.1083/jcb.201112131] [PubMed:22689654].

[168] Lo, K. Y., et al. “Defining the pathway of cytoplasmic maturation of the 60S ribosomal subunit”. Mol. Cell, 39(2):196–208 (2010). [PubMed Central:PMC2925414] [DOI:10.1016/j.molcel.2010.06.018] [PubMed:20670889].

[169] Yamamoto, H., et al. “EF-G and EF4: translocation and back-translocation on the bacterial ribosome”. Nat. Rev. Microbiol., 12(2):89–100 (2014). [DOI:10.1038/nrmicro3176] [PubMed:24362468].

[170] Shoji, S., Janssen, B. D., Hayes, C. S., and Fredrick, K. “Translation factor LepA contributes to tellurite resistance in Escherichia coli but plays no apparent role in the fidelity of protein synthesis”. Biochimie, 92(2):157–163 (2010). [PubMed Central:PMC2815024] [DOI:10.1016/j.biochi.2009.11.002] [PubMed:19925844].

[171] Srivatsan, A. and Wang, J. D. “Control of bacterial transcription, translation and replication by (p)ppGpp”. Curr. Opin. Microbiol., 11(2):100–105 (2008). [DOI:10.1016/j.mib.2008.02.001] [PubMed:18359660].

[172] Yang, F., Li, Z., Hao, J., and Qin, Y. “EF4 knockout E. coli cells exhibit lower levels of cellular biosynthesis under acidic stress”. Protein Cell, 5(7):563–567 (2014). [PubMed Central:PMC4085283] [DOI:10.1007/s13238-014-0050-3] [PubMed:24706296].

[173] Accetto, T. and Avgu?tin, G. “Inability of Prevotella bryantii to form a functional Shine-Dalgarno interaction reflects unique evolution of ribosome binding sites in Bacteroidetes”. PLoS ONE, 6(8):e22914 (2011). [PubMed Central:PMC3155529] [DOI:10.1371/journal.pone.0022914] [PubMed:21857964].

[174] Nakagawa, S., Niimura, Y., Miura, K., and Gojobori, T. “Dynamic evo- lution of translation initiation mechanisms in prokaryotes”. Proc. Natl.

110 Acad. Sci. U.S.A., 107(14):6382–6387 (2010). [PubMed Central:PMC2851962] [DOI:10.1073/pnas.1002036107] [PubMed:20308567].

[175] Kuzmenko, A. V., et al. “Protein biosynthesis in mitochondria”. Bio- chemistry Mosc., 78(8):855–866 (2013). [PubMed Central:PMC3837291] [DOI:10.1134/S0006297913080014] [PubMed:24228873].

[176] Bauerschmitt, H., Funes, S., and Herrmann, J. M. “The membrane-bound GTPase Guf1 promotes mitochondrial protein synthesis under suboptimal conditions”. J. Biol. Chem., 283(25):17139–17146 (2008). [DOI:10.1074/jbc.M710037200] [PubMed:18442968].

[177] Prestele, M., Vogel, F., Reichert, A. S., Herrmann, J. M., and Ott, M. “Mrpl36 is important for generation of assembly competent proteins during mitochondrial trans- lation”. Mol. Biol. Cell, 20(10):2615–2625 (2009). [PubMed Central:PMC2682602] [DOI:10.1091/mbc.E08-12-1162] [PubMed:19339279].

111