RNA Secondary Structures: from Biophysics to Bioinformatics

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By William D. Baez, M.S. Graduate Program in Physics

The Ohio State University 2018

Dissertation Committee: Dr. Ralf Bundschuh, Advisor

Dr. Kurt Fredrick

Dr. Amy Connelly

Dr. Comert Kural c Copyright by

William D. Baez

2018 Abstract

We investigate aspects of RNA secondary structure from the view point of theoretical bio- physics and from view point of bioinformatics. From the existence of a novel thermodynamic phase transition to the fundamental mechanisms of life, RNA continues to act as wellspring of new discoveries.

RNA forms elaborate secondary structures through intramolecular base pairing. These structures perform critical biological functions within each . Due to the availability of a polynomic algorithm to calculate the partition function, they are also a suitable model system for the statistical physics of disordered systems. In this model, below the denaturation temperature random RNA secondary structures can exist in one of two phases: a strongly disordered, low-temperature glass phase and a weakly disordered, high-temperature molten phase. The probability of two bases pairing in these phases has been shown to decay with the distance between the two bases with an exponent 3/2 and 4/3 in the molten and glass phases, respectively. Drawing on previous results from a renormalized field theory of the glass transition, we numerically study this transition and introduce two order parameters that determine the location of the critical point, and explore the driving mechanism behind this transition.

Within a cell’s genome regulatory elements can often be found within the vicinity of the they regulate. In prokaryotes, a common translational regulatory element, the Shine

Dalgarno sequence, has been found to be largely absent from entire phyla of bacteria. This sequence element is part of the textbook model of initiation. To understand how Shine Dalgarno independent bacteria, such as F. johnsoniae, a member of the phylum

Bacteroidetes, initiates translation, we used high-thoughput RNA sequencing and

ii profiling data to investigate the impact of mRNA secondary structure near a ’s initiation site. We found evidence that strongly implicates the role that unstructured or unstable mRNA structures play in these understudied organisms.

Finally, we again use high-throughput RNA sequencing and ribosome profiling data to study the impact of Fhit loss on human cells. Our findings show that Fhit expression impacts the translation of a number of cancer associated genes, and they support the hypothesis that

Fhits genome protective/tumor suppressor function is associated with post-transcriptional changes in expression of genes whose dysregulation contributes to malignancy.

iii To Mrs. Jacobs, my 10th grade teacher. This is all your fault.

iv Acknowledgments

There are too many people I should thank that have helped me along my journey to this

PhD. Fortunately, I have forgotten most of them. So, I will, instead, thank those that I do remember. First and foremost, I would like to thank my advisor, Dr. Ralf A. Bundschuh, for his guidance, support, and Buddha-like patience. Little did we know back in 2012 that my introductory research problem, the phase transition of RNA secondary structures, would become six year academic adventure. Along with Ralf, I must thank my fellow graduate students with whom I’ve have the pleasure of working alongside with in the Bundshuh group:

Blythe Moreland, Robert Patton, Kenji Oman, and Dengke Zhao.

I would also like to thank my collaborators here at Ohio State and abroad. In particular,

I must extend my gratitude to Dr. Kurt Fredrick, Dept. of Microbiology, and Dr. Daniel

Schoenberg, Dept. of Biological Chemistry and Pharmacology, both of whom opened my mind to the complexities of biology beyond what one might learn in a textbook. Their patience and guidance during my initial months collaborating with their respective groups were instrumental to our success. Many thanks to Dr. Kay Wiese, of the Ecole´ normale sup´erieurein Paris, France, whose contributions were instrumental in the RNA phase transition project.

It would be remiss of me to not acknowledge the support, guidance, and mentorship of those from my academic past: Dr. J. Andrew Hauger and Dr. Trinanjan Datta of

Augusta Univeristy, Augusta, GA, and Mrs. Jean Jacobs of Lindenhurst Senior High School,

Lindenhurst, NY. As chair of the Dept. of Chemistry and Physics, Dr. Andy Hauger taught me more than just the art of electronics and the wonders of quantum mechanics. From high atop a mountain, he taught me the fundamental lesson of being a leader: take care

v of your people. It is a lesson that I continue to benefit from to this day. It was from Dr.

Trinanjan Datta from whom I learned the skill neccessary to survive graduate school and physics research: grit. For without grit, it is all too easy to give up after countless deadends, numerous soul-crushing failures, and endless nights of little to no sleep. I am forever indebted to these two academic heroes.

But perhaps the most valuable lesson I was every taught, but did not appreciate until the final years of graduate school, came from my high school biology teacher, Mrs. Jean

Jacobs. As an ambitious high school student, I attempted to take as many courses as one could fit into nine forty-minute periods. But it was she how would often remind me, ”There are not enough hours in the day, Mr. Baez.” 20+ years later, I finally learned that lesson.

vi Vita

September 3, 1980 ...... Born—Lindenhurst, NY

May, 2010 ...... B.S., Augusta State University, Augusta, GA December, 2015 ...... M.S., Ohio State University, Columbus, OH

Publications

Daniel L Kiss, William Baez (co-first author), Kay Huebner, Ralf Bundschuh, Daniel R. Schoenberg. Impact of FHIT loss on the translation of cancer-associated mRNAs. Mol. Cancer, 16:179 (2017).

Daniel L Kiss, William Baez (co-first author), Kay Huebner, Ralf Bundschuh, Daniel R. Schoenberg. Loss of fragile histidine triad (Fhit) expression alters the translation of cancer-associated mRNAs. BMC Res. Notes, 11:178 (2018)

Fields of Study

Major Field: Physics

Studies in Biophysics Theory, Bioinformatics: Ralf Bundschuh

vii Table of Contents

Page Abstract...... ii Dedication...... iv Acknowledgments...... v Vita...... vii List of Figures ...... x List of Tables ...... xix

Chapters

1 Introduction1 1.1 Nucleic Acids and The Central Dogma...... 1 1.1.1 DNA and RNA...... 1 1.1.2 The Universality of the Central Dogma ...... 2 1.2 Biophysics and Statistical Mechanics...... 5 1.3 Bioinformatics ...... 8 1.3.1 High-Throughput RNA Sequencing and Ribosome Profiling..... 9 1.4 Outline ...... 11

2 Phase Transition of RNA Secondary Structures 12 2.1 Introduction...... 12 2.2 RNA Secondary-Structure Model...... 16 2.2.1 RNA Secondary Structures ...... 16 2.2.2 Energy Model...... 16 2.2.3 Partition Function...... 17 2.2.4 Observables...... 18 2.2.5 Numerical Approach...... 20 2.3 Scaling of the contact and overlap observables...... 20 2.3.1 Initial Estimate of Phase Transition Temperature...... 20 2.3.2 Order Parameters for the Transition, and a More Precise Estimation of the Phase Transition Temperature...... 22 2.4 Transition Mechanism...... 24 2.5 Discussion & Conclusions ...... 36

3 The role of RNA secondary structure in translation initiation in Flavobac- terium johnsoniae 37

viii 3.1 Introduction...... 37 3.2 F. johnsoniae as the model organism for Bacteroidetes...... 37 3.3 Materials and Methods...... 40 3.3.1 Cell culture and library preparations...... 40 3.3.2 RNA-seq, Ribo-seq, and the selection of representative set of genes . 40 3.4 Results...... 44 3.4.1 Ribosome footprints from F. johnsoniae are shorter and more uniform in length than those of other bacteria ...... 44 3.4.2 Start codon usage and AUG trinucleotide representation in the F. johnsoniae TIR...... 49 3.4.3 RNA-seq data indentifies promoters with well-conserved -7 elements 54 3.4.4 Rate of translation is tuned by mRNA secondary structure near the start codon...... 56 3.5 Discussion/Conclusion...... 69

4 The role of FHIT on the translation of cancer-associated mRNAs 75 4.1 Introduction: Background on Fhit ...... 75 4.1.1 Fhit as 5’ cap scavenger...... 75 4.2 Materials and Methods...... 76 4.2.1 Cell culture and library preparations...... 76 4.2.2 RNA-seq, Ribo-seq, and informatics ...... 76 4.3 Results...... 78 4.3.1 Identifying mRNAs whose translation is controlled by Fhit . . . . . 78 4.3.2 Fhit loss associated with changes in ribosome distribution across coding region of some genes...... 81 4.3.3 Validation of ARD targets...... 87 4.3.4 Effects of Fhit on 5’ UTR ribosome occupancy...... 90 4.4 Disscusion/Conclusions ...... 93

Bibliography 98

ix List of Figures

Figure Page

1.1 Cartoon depiction of Deoxyribonucleic Acid (DNA) and Ribonucleic Acid (RNA). Both nucleic acids show similarities in basic structure, but do differ in specific way. DNA is typically found a double-stranded molecule, while RNA is usually seen as a single-standed molecule. Three of the four bases making up either nucleic acid are common to both: adenine (A), cytosine (C), and guanine (G). Whereas thymine (T) is only found in DNA, which is replaced by uracil (U) in RNA. Image adapted from [6]...... 3 1.2 Codon table for bacteria and archaea according to NCBI [10]. Start codons on indicated by the superscript star *. The canonical start codon, AUG, is highlighted in green, while near-cognates are highlighted in blue. Stop, or termination, codons are highlighted in red...... 6

2.1 Diagrammatic and abstract representations of a RNA secondary structure. (a) The backbone of the molecule is represented by the solid gray line while the solid black lines stand for the hydrogen-bonded base pairs where each base is depicted as a circle. The shape of the backbone is such that stems of stacked base pairs and the loops connecting or terminating them can be clearly seen. Stems form double-helical structures similar to that of DNA. (b) Each connected by a solid black line while the red dashed lines create a pseudoknot. We exlude such pairings in our definition of a secondary structure. Pseudoknots, such as these, do not contribute much to the total energy and are often deemed part of an RNA molecule’s tertiary structure. 13 2.2 Secondary structures of a random RNA molecule at distant times. Base pairings can be nested, such as (s, t) and (s0, t0), or independent, such as (s, t) and (s00, t00). The pairing overlap is defined by the common base pairings between the left and right configuration (corresponding bases are shown in black). (a) Above Tc, the molecule contains conserved subfolds on scales up to the correlation length ξ (indicated by shading) and is molten on large scales. (b) Below Tc, the molecule is locked into its minimum-energy structure on all scales, up to rare fluctuations (unshaded). Image adapted from [70]. . . . 15

x 2.3 The exponent α1 and α2 as a function of temperature. The continuously changing nature of this quantity makes it a poor indicator of the location of the phase transition...... 18 2.4 Pinch free energy prefactor a(T ) vs temperature. The inset shows the disorder-√ averaged pinch free energy vs ln(√N) in both the glass phase (kBT = 0.1 D) and the molten phase (kBT = 1.0 D). This allows us to extract the prefactors of these logarithmic behaviors, which are shown as a function of temperature in the main graph. The temperature at which the prefactor changes slope is an estimate of the phase transition temperature Tc...... 21 2.5 Average pairing probability versus substrand length. The upper figure shows how the pairing probability scales√ as Eq. (2.7) in the glass phase at the lowest temperature considered, kBT/ D=0.1. As mentioned in the text, within this phase the critical exponent, α, is insensitive to the value of n. The lower figure shows how the pairing probability scales as√ Eq. (2.7) in the molten phase at the highest temperature considered, kBT/ D=1.0, but demonstrates how the value of α directly depends on the value of n...... 22 2.6 High-temperature pairing ratio vs substrand length. The curves above demon- strate the behavior of this new observable at several temperatures considered above and below the location of the transition√ temperature and for tem- peratures√ deep within the glass phase (kBT/ D = 0.1) and molten phase (kBT/ D = 1.0)...... 24 2.7 Low-temperature pairing ratios vs substrand length. The curves above demon- strate the behavior of this new observable at several temperatures considered above and below the location of the transition√ temperature and for tem- peratures√ deep within the glass phase (kBT/ D = 0.1) and molten phase (kBT/ D = 1.0)...... 25 2.8 Order parameters,√ ∆g and ∆m, versus temperature. (a) As Eq. (2.17a) approaches kBT/ D ≈ 0.53 from√ below it drops below zero. (b) Similarly, as Eq. (2.17b) approaches kBT/ D ≈ 0.53 from above it becomes negative. Note that finding a negative value for ∆g and ∆m is unphysical, and for N → ∞ these negative regions converge to zero. Plotting them nevertheless allows us to more precisely estimate the transition temperature Tc. Estimates for the values of ωgν and ωmν are also shown in their respective temperature regimes...... 26 2.9 Distributions of the logarithm of base-pairing probabilities x = − ln(p) for system size N = 4000√ averaged over the final third of all `’s. (a) Deep in the molten phase (kBT/ D = 1.0) the distribution of base-pairing probabilities has a distinct Gaussian-like form, which would go to a delta peak as T → ∞ (D → 0). The inset shows that close to x ≈ 0 (p ≈ 1) that no weight of the distribution√ is present close to x ≈ 0 (p ≈ 1). (b) Deep in the glass phase (kBT/ D = 0.1) the base-pairing distribution still has a Gaussian-like form, albeit broader than the high temperature distribution. However, the inset shows there exists an exponential-like peak close to x ≈ 0...... 29

xi 2.10 Separation of the glass phase ground and non-ground state distributions.√ (a) The total base-pairing distribution within the glass phase (kBT/ D = 0.1). The black arrow points to the prominent peak at x ≈ 0. (b) The ground state base-pairing distribution. The black arrow points to the prominent peak at x ≈ 0. This peak represents the frequency of those pairs “locked” in the ground-state structure and are the contributing weight seen in the inset of Fig. 2.9(b). (c) As mentioned in the text, the broad Gaussian-like peak seen in the low temperature regime is a result of non-ground state pairs that have a small, non-zero pairing probability at finite temperatures. These probabilities would be zero at T = 0...... 30 2.11 Separation of the molten phase ground and non-ground state distributions.√ (a) The total base-pairing distribution within the molten phase (kBT/ D = 1.0). (b) The lack of weight close to x ≈ 0 is due to the absence of ground state pairs at high temperatures, as seen in the inset of Fig. 2.9(a). We see instead that the peak has shifted to the right to a postion in x closer to the non-ground state distribution. Note that the y-axis is 4 orders of magnitude smaller than either one of the other distributions. (c) The non-ground state pairs dominate the total pairing probability distribution...... 31 2.12 “Locked” regions of the√ ground state distributions in the vinicity of the critical point. (a) At kBT/ D=0.5 we see that several of the bins below√ x ≤ 0.1 have no weight. (b)-(c) As the temperature approaches Tc (kBT/ D=0.51, 0.52), this region of zero-weight bins begins to slowly expand further. (c) At the critical point many of the bins within them x ≤ 0.1 range have zero-count. (e) Above Tc the number of zero-count bins dominates the x ≤ 0.1 range with more occuring further to the right...... 33 2.13 Average pairing probability of the “locked” region versus temperature. The black vertical line indicates the location of the critical point...... 33 2.14 Fraction of zero-weight bins versus temperature. The black horizontal like shows the point at which the number of zero-weight bins within the “locked” region is at 10%. Below Tc the number of zero-weight bins is very low. At the critical point ∼10% of the bins within the “locked” region are zero-weight.√ This fraction of zero-weight bins sharply rises above Tc until kBT/ D=0.7 where all bins within the “locked” region are zero...... 34 2.15 Average ground state base-pairing√ probabilities versus x for system size N = 1000. (a) At kBT/ D = 0.3, and all temperatures below, the region close to x ≈ 0 is the domninant peak. (b) As the tempertaure is increased, the height of these peaks all drop to roughly the same height as the neighboring peaks. (b) Above a particular temperature, this peak is then lower that its neighbors...... 34 2.16 Logarithm of the average ground base-pairing distributions versus ln(x) for system size N = 1000. At low temperatures the regions the logarithm of the ground state pairing distributions close to x ≈ 0 demonstrate linear behavior whose slopes (approximated by the black dashed fit lines) appear to turn from negative to positive as the temperaure increases...... 35

xii 2.17 γ vs temperature. When plotted against temperature, the√ slopes of the fit lines√ from Fig. 2.16 cross zero at a point between kBT/ D = 0.34 and kBT/ D = 0.35, which one could interpret as the secondary freezing event mentioned in [70] ...... 35

3.1 Pairing of the Shine-Dalgarno (SD) sequence with the anti-Shine-Dalgarno (aSD) sequence. The SD sequence lies ∼5-9 nt upstream of the start codon of an mRNA. Image adapted from [83] ...... 38 3.2 Comparison of RNA-seq libraries from F. johnsoniae. Shown are scatter plots of duplicate RNA-seq libraries with the Spearman coefficient for each shown in the green box...... 43 3.3 Comparison of Ribo-seq libraries from F. johnsoniae. Shown are scatter plots of duplicate Ribo-seq libraries with the Spearman coefficient for each shown in the green box...... 43 3.4 Comparison of average ribosome densitites from F. johnsoniae. Shown are scatter plots of average ribosome densities from the Ribo-seq libraries with the Spearman coefficient for each shown in the green box...... 44 3.5 Length histograms of Ribo-seq footprints for F. johnsoniae, E. coli, and B. subtilis. Average length of Ribo-seq footprints in F. johnsoniae were found the be 23 nt long, 4-5 nt shorter than E. coli and B. subtilis. E. coli and B. subtilis showed a broader size distribution due to possible 5’ length variability mentioned in [49–51]. We interpret the shorter footprint lengths of F. johnsoniae being a result of the lack of SD-mRNA pairing...... 45 3.6 Metagene analysis of the 3’ and 5’ end of Ribo-seq footprints about the 3’ end of coding sequences in F. johnsoniae. (a) The peak at 9 nts to the right of the stop codon, which we interpret as stemming from located with the stop codon in their A-site, indicates that the center of the P-site is 13 nts upstream of the most prevalent 3 end of the Ribo-seq footprints (b) The peak at 13 nts to the left of the stop codon is also interpreted as stemming from ribosomes located with the stop codon in their A-site...... 46 3.7 Metagene analysis of the 3’ and 5’ end of Ribo-seq footprints about the 3’ end of coding sequences in E. coli. (a) The peak at 10 nts to the right of the stop codon, which we interpret as stemming from ribosomes located with the stop codon in their A-site, indicates that the center of the P-site is 14 nts upstream of the most prevalent 3 end of the Ribo-seq footprints. (b) A broad peak can been seen starting 13 nts and extending to 19 nts to the left of the stop codon...... 47 3.8 Metagene analysis of the 3’ and 5’ end of Ribo-seq footprints about the 3’ end of coding sequences in B. subtilis. (a) The peaks at 9 and 10 nts to the right of the stop codon, which we interpret as stemming from ribosomes located with the stop codon in their A-site, indicates that the center of the P-site is ∼14 nts upstream of the most prevalent 3 end of the Ribo-seq footprints. (b) Due to the broad Ribo-seq length histogram seen in Fig. 3.5, the precise location of a peak is difficult to determine based on the 5’ ends of the footprints 48

xiii 3.9 Representation of AUG as compared to AGU in Bacteroidetes, Proteobacteria, and Firmacutes. Boxes shaded blue signify that the individual phyla were found to be statistically signficantly different from one and less than one (padj ≤ 0.05). Brackets connecting two boxes highlight comparison between the two phyla found to be statistically significant (one star signifies a padj ≤ 0.05) 52 3.10 Representation of AUG as compared GAU in Bacteroidetes, Proteobacteria, and Firmacutes. Boxes shaded blue signify that the individual phyla were found to be statistically signficantly different from one and less than one (padj ≤ 0.05). Brackets connecting two boxes highlight comparison between the two phyla found to be statistically significant (one star signifies a padj ≤ 0.05) 53 3.11 Representation of AUG as compared to GUA in Bacteroidetes, Proteobacteria, and Firmacutes. Boxes shaded blue signify that the individual phyla were found to be statistically signficantly different from one and less than one (padj ≤ 0.05). Brackets connecting two boxes highlight comparison between the two phyla found to be statistically significant (one star signifies a padj ≤ 0.05; two stars signifies a padj ≤ 0.01) ...... 53 3.12 Distance between the putative -7 element and the predicted TSS based on the 434 putative elements identified. We deemed promoters found in the sequences comprising peaks at -4 and -5 as bona fide...... 55 3.13 frequencies as a function of position from the upstream regions of those sequences comprising the peaks 4 and 5 away from the TSS in Fig. 3.12. Evidence of the -7 element (TANNTTTG) is clear. However, the -33 element (TTTG) is much less pronounced...... 55 3.14 Sequence logo from the upstream regions from those sequences comprising the peaks at positions -4 and -5 from Fig. 3.12. Evidence of the -7 element (TANNTTTG) is clear. However, the -33 element (TTTG) is much less pronounced...... 56 3.15 Comparison of folded regions of all genes in the representative set 30, 50, and 100 nt upstream and 100 nt downstream of annotated translation initiation sites. Without explicit knowledge of start sites in F. johnsoniae, this data demonstrated to us that the pairing probability per position was not especially altered by increasing the length over which we performed the in silico fold. This led us to conclusion that a 100 nt stretch upstream of the start site would capture the majority of 5’ UTRs in our representative gene set for F. johnsoniae ...... 58 3.16 Comparison of folded regions of octile 1 genes 30, 50, and 100 nt upstream and 100 nt downstream of annotated translation initiation sites. These data demonstrate that the pairing probability per position within the subset of the highly translated genes is not affected by increasing the length over which we performed the in silico fold...... 58 3.17 Comparison of folded regions of octile 8 genes 30, 50, and 100 nt upstream and 100 nt downstream of annotated translation initiation sites. These data demonstrated to us that the pairing probability per position within the subset of the least translated genes is not especially altered by increasing the length over which we performed the in silico fold...... 59

xiv 3.18 Average pairing probability per position for each octile in F. johnsoniae. There is a clear trend that a decrease in pairing probability correlates with an increase in ARD...... 59 3.19 (a) Comparison of average pairing probability per position between octiles 1 and 8 in F. johnsoniae, where octile 1 represents high-ARD genes and octile 8 low-ARD genes. (b) p-values for each position calculated via a two-sample test. The dashed magenta line represents the cutoff for statistical significance (padj ≤ 2.5e − 4)...... 60 3.20 Average pairing probability per position for each octile in E. coli. There is a clear trend that a decrease in pairing probability correlates with an increase in ARD...... 60 3.21 (a) Comparison of average pairing probability per position between octiles 1 and 8 in E. coli, where octile 1 represents high-ARD genes and octile 8 low-ARD genes. (b) p-values for each position calculated via a two-sample test. The dashed magenta line represents the cutoff for statistical significance (padj ≤ 2.5e − 4)...... 61 3.22 Average pairing probability per position for each octile in B. subtilis. There is a clear trend that a decrease in pairing probability correlates with an increase in ARD...... 62 3.23 (a) Comparison of average pairing probability per position between octiles 1 and 8 in B. subtilis, where octile 1 represents high-ARD genes and octile 8 low-ARD genes. (b) p-values for each position calculated via a two-sample test. The dashed magenta line represents the cutoff for statistical significance (padj ≤ 2.5e − 4)...... 62 3.24 Nucleotide frequencies in the vicinity of the start codon, which sits at positions 0-2, in B. subtilis...... 64 3.25 (a) Comparison of the frequency of G between octile 1 and octile 8 in B. subtilis. (b) p-values for each position calculated via a two-sample test. The dashed magenta line represents the cutoff for statistical significance (padj ≤ 8.2e − 4). 64 3.26 (a) Comparison of the frequency of A between octile 1 and octile 8 in B. subtilis. (b) p-values for each position calculated via a two-sample test. The dashed magenta line represents the cutoff for statistical significance (padj ≤ 8.2e − 4). 65 3.27 Nucleotide frequencies in the vicinity of the start codon, which sits at positions 0-2, in E. coli ...... 67 3.28 (a) Comparison of the frequency of G between octile 1 and octile 8 in E. coli. (b) p-values for each position calculated via a two-sample test. The dashed magenta line represents the cutoff for statistical significance (padj ≤ 8.2e − 4). 67 3.29 (a) Comparison of the frequency of A between octile 1 and octile 8 in E. coli. (b) p-values for each position calculated via a two-sample test. The dashed magenta line represents the cutoff for statistical significance (padj ≤ 8.2e − 4). 68 3.30 (a) Comparison of the frequency of T between octile 1 and octile 8 in E. coli. (b) p-values for each position calculated via a two-sample test. The dashed magenta line represents the cutoff for statistical significance (padj ≤ 8.2e − 4). 68 3.31 Nucleotide frequencies in the vicinity of the start codon, which sits at positions 0-2, in F. johnsoniae ...... 70

xv 3.32 (a) Comparison of the frequency of A between octile 1 and octile 8 in F. johnsoniae. (b) p-values for each position calculated via a two-sample test. The dashed magenta line represents the cutoff for statistical significance (padj ≤ 8.2e − 4)...... 70 3.33 Comparison of the frequency of A between all octiles in F. johnsoniae. . . . . 71 3.34 (a) Comparison of the frequency of T between octile 1 and octile 8 in F. johnsoniae. (b) p-values for each position calculated via a two-sample test. The dashed magenta line represents the cutoff for statistical significance (padj ≤ 8.2e − 4)...... 71 3.35 Comparison of the frequency of T between all octiles in F. johnsoniae. . . . 72

4.1 Comparison of RNA-seq libraries from Fhit-negative (E1) and Fhit-positive (D1) H1299 cell lines. Shown are scatterplots of duplicate RNA-seq libraries from Ponasterone A-treaded H1299 cells with the Spearman coefficient for each shown in the green box. The E1 cell line is stably transfected with empty vector and the D1 cell line carries an inducible Fhit transgene...... 80 4.2 The log2 change in steady state levels of the mRNA is shown as a function of the RNA-seq read counts. All mRNAs that underwent a statistically significant change in association with Fhit expression (p < 0.05) are indicated with red circles, and we arbitrarily selected a 1.5-fold change as a cutoff for further study. These transcripts and their fold changes can be found in [53]...... 81 4.3 Comparison of ribosome profiling libraries from Fhit-negative (E1) and Fhit- positive (D1) H1299 cell lines. Shown are scatterplots of duplicate ribosome profiling libraries from Ponasterone A-treaded H1299 cells with the Spear- man coefficient for each shown in the green box. The E1 cell line is stably transfected with empty vector and the D1 cell line carries an inducible Fhit transgene...... 82 4.4 Metagene analysis shows periodicity of bound ribosomes. The 5’ ends of reads of each of the ribosome protected fragments were mapped with respect to the start (A) and stop (B) codons. This confirmed a 3 nucleotide periodicity consistent with the triplicate nature of codons and the peak 13 nucleotides upstream of start codons is consistent with ribosomes bound at these locations. 83 4.5 The log2 change in the coding region for each mRNA is shown as a function of Ribo-seq read counts. Statistically significant changes are indicated by red circles. These transcripts and their fold changes can also be found in [53]. 85 4.6 Average ribosome density of Fhit-negative and Fhit-expressing H1299 cells. Shown are scatterplots of average ribosome densities from ribosome profiling libraries of H1299 cells. The Spearman coefficient for each plot is shown in the green box. The E1 cell line (A) is stably transfected with empty vector and the D1 cell line (B) carries an inducible Fhit transgene. Both cell lines were treated with Ponastrerone A...... 86

xvi 4.7 RiboDiff was used to normalize the coding region Ribo-seq data in Fig. 4.5 to the RNA-seq data in Fig. 4.2. Fhit-mediated changes in the resulting average ribosome density (ARD) are plotted as a function of RNA-seq read counts. A cutoff of 1.5-fold was selected. mRNAs that underwent a statistically significant change with Fhit (p < 0.05) are labeled and indicated with red dots. 87 4.8 Fhit-mediated changes in ribosome distribution across mRNAs idenfied by coding region ARD. (a) Ribosome distribution is shown across the mRNA, identified in the upper right corner of each box, as having increased coding region ARD in Fhit-positive (blue) vs. Fhit-negative (red) cells. (b) Ribosome distribution is shown across the mRNAs, identified in the upper right corner of each box, as having decreased coding region ARD in Fhit-positive (blue) vs. Fhit-negative (red) cells. The coding region of each mRNA is shaded. . 88 4.9 Fhit-mediated changes in targets protein expression are independent of mRNA level. (a) D1 and E1 cells treated for 24 hr with Ponasterone A were an- alyzed for changes in TP53I3, IFIT1, EEF2, and Vinculin. Shown are the representative Western blots for each of triplicate determinations and the fold change normalized to Vinculin is presented beneath the panels for TP53I3, IFIT1, and EEF2. (b) Cytoplasmic extracts from these cells were spiked with luciferase RNA as a control for sample recovery and analyzed by RT-qPCR for changes in steady-state levels of TP53I3, EEF2, IFIT1, and Vinculin mRNA. The mean fold change ± SEM (n = 3) determined for each mRNA by RT-qPCR (black bars) is shown alongside the average difference from the duplicate RNA-seq datasets. (c) The analysis in (a) was repeated for ADAM9 and FASN, with U2AF35 used as a normalization control...... 89 4.10 Fhit-mediated changes in ribosome distribution across mRNAs identified by changes in 5’ UTR ARD. Ribosome distribution is shown across the mRNAs identified as having increased 5’ UTR ARD in Fhit-positive (blue) vs Fhit-negative (red) cells. The 5’ UTR of each mRNA is shaded...... 91 4.11 Fhit-mediated changes in ribosome distribution across mRNAs identified by changes in 5’ UTR ARD. Ribosome distribution is shown across the mRNAs identified as having decreased 5’ UTR ARD in Fhit-positive (blue) vs Fhit-negative (red) cells. The 5’ UTR of each mRNA is shaded...... 92 4.12 Identification of mRNAs with Fhit-mediated changes in ribosome occupancy of 5’ UTR versus coding region. RiboDiff was used to determine ribosome occupancy of 5’ UTRs vs. the coding region. The results of which were used to identify mRNAs for which Fhit expression altered the ratio between these features. The data were plotted as a function of Ribo-seq read counts, with those mRNAs showing statistically significant changes identified by name and as red dots. The full list of mRNAs tested can be found in [53] ...... 93

xvii 4.13 Fhit changes the ribosome occupancy ratio of 5’ UTR versus coding region. Ribosome distribution is shown across the mRNAs identified in Tab. 4.2 as having changes to 5’ UTR/coding region ribosome occupancy in Fhit-positive (blue) vs Fhit-negative (red) cells. Omitted are mRNAs from Tab. 4.2 whose ribosome distribution is presented in Figs. 4.10 and 4.11. The 5’ UTR of each mRNA is shaded. (a) Five mRNAs showed increased ribosome loading in both the 5’ UTR and the coding region, but the increases observed in the 5’ UTR were larger than those observed in the coding region. (b) Four mRNAs exhibited an increase in 5’ UTR-bound ribosomes while their coding regions remained essentially unchanged. (c) Two mRNAs showed decrease 5’ UTR ribosome occupancy in Fhit-positive versus Fhit-negative cells, but increased coding region ribosome occupancy. (d) Two mRNAs showed decreased 5’ UTR ribosome occupancy in Fhit-positive versus Fhit-negative cells but little change in coding region ribosome occupancy...... 95 4.14 ADAM9 (NM 003816) with uORF in 5’ UTR. Cells expressing Fhit cause a high ribosome occupancy within a uORF found in the 5’ UTR of ADAM9. uORFs have mostly non-canonical (CUG, GUG, UUG) start codons, and some extend into the coding region, raising the possibility that the protein products translated from these mRNAs could be affected by the loss of Fhit. 97

xviii List of Tables

Table Page

3.1 Start codon usage in protein coding genes in F. johnsoniae, E. coli, and B. subtilis ...... 49 3.2 AUG representation with respect to “control” trinucleotides – AGU, GUA, GAU – in the vicinity of the start codon (also known as the translation initation site, TIS) and coding sequence midpoint in F. johnsoniae ...... 51 3.3 AUG representation with respect to “control” trinucleotides – AGU, GUA, GAU – in the vicinity of the start codon (also known as the translation initation site, TIS) and coding sequence midpoint in E. coli ...... 52 3.4 AUG representation with respect to “control” trinucleotides – AGU, GUA, GAU – in the vicinity of the start codon (also known as the translation initation site, TIS) and coding sequence midpoint in B. subtilis ...... 52

4.1 mRNAs showing Fhit-dependent changes in ribosome loading. The average ribosome density (ARD) for Ponasterone A-treated E1 and D1 cells was determined using RiboDiff and only those transcripts with an adjusted p value <0.05 were considered further. Column 3 shows the log2 inductions ratio +Fhit(D1)/-Fhit(E1). The upper panel lists mRNAs with changes in ribosomes bound to coding regions. The lower panel lists mRNAs with changes in ribosomes bound to the 5’ UTR using a cutoff of 25 nucleotides upstream of the annotated start codon...... 84 4.2 mRNAs showing Fhit-dependent changes in ribosomes bound to the 5’ UTR versus coding sequence. RiboDiff was used to determine relative changes in ribosome occupancy of 5’ UTR versus coding region as a function of Fhit expression, using ARD data from Ponasterone A-treated E1 and D1 cells. Only those transcripts with padj <0.05 were considered further. The complete list of genes, their relative changes and statistical analysis are available in [53]. 94

xix Chapter 1 Introduction

During the last century the celebrated Austrian physicist and co-founder of quantum theory

Erwin Schr¨odinger,in his book What is Life? [1], considered life through the lens of physics and outlined an idea of how genetic information might be stored within a cell. In it he drew explicit comparisons to Morse code. In 1953, James D. Watson and Francis Crick published their findings on the molecular structure of deoxyribonucleic acid (DNA), the now famous double helix. Crick explicitly acknowledged his intellectual debt to Schr¨odinger’s ideas [2].

Six years later, in his lecture, There’s Plenty of Room at the Bottom! [3], Richard P. Feynman mentioned that physicists often disparadge biology for not using enough mathematics, but he confessed that technological advancements were needed to answer some of the fundamental biological questions, such as “How is the base order in the DNA connected to the order of amino acids in the protein?” and “What is the structure of the ribonucleic acid; is it single-chain or double-chain?”

Nucleic Acids and The Central Dogma

DNA and RNA

All cells require an enormous amount of different to function, survive, and proliferate.

These proteins are needed for a variety of reasons, such as, to name a few, maintaining or changing structure, regulating cellular processes, organzing cell replication, driving transport, and for the creation of new proteins themselves. The information required to create all these proteins is stored and transmitted by two nucleic acids found in each cell. Both 1 deoxyribonucleic acid (DNA) and ribonucleic acid (RNA) are linear polymers composed of poly-nucleotides. Each of the nucleotides are made up of a nitrogen-containing base, a five- carbon sugar, and a phosphate linker. Phosphodiester bonds link these nucleotides together by connecting the 3’ carbon atom in one sugar to the 5’ carbon atom in the five-carbon sugar of the adjacent nucleotide. Therefore all nucleic acids consist of a backbone of repeating sugar-phosphate units with bases extending off the side. The way in which nucleotides line up together give nucleic acids a directionality. At one end of this sugar-phosphate backbone there is a dangling phosphate group on the 5’ position of the five-carbon sugar. While at the other end there is a hydroxyl group at the 3’ position. The synthesis of these polynucleotides proceeds only in the 5’ to 3’ direction. This directionality and the sequence of nucleotides encodes the genetic information necessary for the creation of proteins. [4]

As depicted in Fig. 1.1, both nucleic acids are similar in basic structure. However, they differ from each other in specific ways. For DNA the five-carbon sugar is deoxyribose, while for RNA it is a ribose. This difference makes DNA more chemically stable than RNA, reflecting its fuction as the long-term storage for genetic information. While RNA, being much less stable, acts as a temporary messenger. Three of the four bases making up either nucleic acid are common to both: adenine (A), cytosine (C), and guanine (G). Whereas thymine (T) is only found in DNA, which is replaced by uracil (U) in RNA. These bases can pair with one another by forming hydrogen bonds. Hydrogren bonds that exist between nucleotides in what are known as Watson-Crick pairs: A-T (or A-U) (2 hydrogen bonds) and G-C (3 hydrogen bonds). DNA is typically found a double-stranded molecule composed of two anti-parallel DNA strands held together by the hydrogen bonds of complementary bases. On the other hand, RNA is usually found as a single-stranded molecule. [4;5]

The Universality of the Central Dogma

Cells on Earth, in general, are found to be one of two types, prokaryotes and eukaryotes.

Prokaryotes (e.g. bacteria and archea) are typically less complex and smaller than eukary- otes (e.g. mammals, plants, etc.). Unlike prokaryotes, which do not contain a separate compartment to store their genetic material, eukaryotic cells house their DNA within a

2 Figure 1.1: Cartoon depiction of Deoxyribonucleic Acid (DNA) and Ribonucleic Acid (RNA). Both nucleic acids show similarities in basic structure, but do differ in specific way. DNA is typically found a double-stranded molecule, while RNA is usually seen as a single-standed molecule. Three of the four bases making up either nucleic acid are common to both: adenine (A), cytosine (C), and guanine (G). Whereas thymine (T) is only found in DNA, which is replaced by uracil (U) in RNA. Image adapted from [6]

3 nucleus. Stretches of DNA, called genes, store the genetic information necessary for the synthesis of functional proteins. The multitude of reactions controlling the abundance of these gene products is called . Proteins are the final product of gene expression and the genes responsible for their creation are referred to as protein coding genes.

However, there are many non-protein coding genes, also expressed through functional RNA molecules, such as ribosomal RNA (rRNA) and transfer RNA (tRNA). While these genes do not directly form proteins, they are instrumental in the process of protein synthesis. Those

RNA molecules that do directly form proteins function are called messenger RNA (mRNA).

In general, pathway from DNA to protein involves two complex processes: transcription

(DNA → RNA) and translation (RNA → protein). The processes of transcription and translation are a fundamental aspect of gene expression. In general, the mechanism of translation is conserved across prokaryotes and eukaryotes. While there are important variations in the mechanistic way that prokaryotic and eukaryotic cells create proteins, the information flow from DNA to protein is essentially the same. This principle is so fundamental to life that it is termed the ”central dogma” of molecular biology.

Transciption of functional RNA or mRNA begins with the opening and unwinding of a small portion of the DNA double helix exposing the bases on each strand. The non-coding strand (3’-5’) acts as a template for the synthesis of a complementary RNA molecule. The enzymes that perform this synthesis are RNA polymerases (RNAPs), large molecular protein complexes that consist of several subunits. These molecular machines catalyze the formation of the phosphodiester linkage between the nucleotides of elongating RNA chains. As the

RNA polymerase moves stepwise along the DNA, it unwinds the helix just ahead of the poly- merization site exposing the non-coding strand for complementary base pairing. The growing

RNA chain extends one nucleotide at a time in the 5’-3’ direction through an exit channel at the back of the RNAP. Many copies of an RNA can be made from the same gene since the RNA strand is almost immediately released from the DNA as it is synthesized. The syn- thesis of additional RNA molecules can start before the creation of first RNA is finished. [4; 7]

Through translation mRNA is decoded to produce proteins. At one level these polynu-

4 cleotide chains are a sequence of A, C, G, and U. This “alphabet” spells out “code words” known as codons. Every codon is a unique combination of three nucleotides, which are each interpreted as a single amino acid in a polypeptide chain, as seen in Fig. 1.2. Only three of the four letters can occupy a position within a codon. There are 43 = 64 possible combinations of triplet nucleotides for each codon. 61 of these stand for one or another of the 20 amino acids [8]. While three of these codons (UGA, UAG, and UAA) signal the terminatiton of translation of an mRNA. Most amino acids are encoded by more than one of the remaining codons, with the exception of methionine (AUG) and tryptophan

(UGG). Codons that correspond to the same amino acid are referred to as synonymous.

This characteristic makes the degenerate. However, codon sequences occur as a non-overlapping set. Thus each mRNA nucleotide is part of one codon and there are no additional nucleotides between two adjacent codons. This means that there are three ways to group nucleotides of a given sequence into codon yielding three different codon sets, each of which could be translated into three completely different amino acid sequences [9]. These three codon sequences are called reading frames. Each mRNA has three possible reading frames, depending on where the decoding process begins, and usually only one encodes for a functional protein. This requires that the proper reading frame be set at the initiation of translation. The part of the reading frame that does not contain any stop codons is called an open reading frame (ORF). The beginning of the correct ORF typically begins with an

AUG as the first codon, although near cognates have been identified (See Fig. 1.2). Proper reading frame alignment is critical, thus initiating codons aid in the positioning of the cell’s translational machinery into the correct reading frame. Commonly, the other two ORFs will contain stop codons soon after the site of translation initiation, thereby preventing the synthesis of non-functional proteins [4;5;7].

Biophysics and Statistical Mechanics

Biophysics is a scientific discipline that seeks to answers questions concerning biological systems using problem-solving methods, approaches, and techniques originally developed

5 Figure 1.2: Codon table for bacteria and archaea according to NCBI [10]. Start codons on indicated by the superscript star *. The canonical start codon, AUG, is highlighted in green, while near-cognates are highlighted in blue. Stop, or termination, codons are highlighted in red.

6 to study non-living systems [11]. In other words, biophysics is the study of the physical processes governing living cells. Everything from the arrangement of atoms in the DNA molecule, how a network of nerve cells communicate, to the study of how environmental factors impact populations are all within the scope of biophysics.

The subject of nucleic acid folding has interestered researchers for many decades due to it importance in many biological processes, such as translation regulation and its impact on protein folding [12–14]. RNA folding has been well studied [15], but is continuously refined [16]. One tool that physicists typically use when attempting to describe the behavior of a large system, such as an RNA molecule, is statistical mechanics. The goal of statistical mechanics is to calculate the average values of the physical properties for a system comprised of a large number of constituent particles (e.g. nucleotides). These average values are computed over a statistical ensemble, which consists of states of a system chosen to best represent physical situtations. Statistical mechanics gives one the tools to write down the probability of the many different configurations that are available to a system.

One fundamental result of statistical mechanics is the calculation of the Boltzmann distribution, which informs us on how the probability of a given state or configuration is determined by the energy of the system. This distribution plays the central role in statstical mechanics e−i/kB T P ( ) = (1.1) i Z where P (i) is the probability of observing the state or configuration i at energy i, T is the fixed temperature of the surrounding environment, kB is the Boltzmann constant, and normalization factor Z is the partion function. The partition function is defined as

X Z = e−i/kB T (1.2) i where the sum of i includes all possible states or configuations of a system.

Average values of thermodynamic quantities, such as a system’s energy in state i, U, can

7 be directly calculated as X e−i/kB T hU i = U . (1.3) i i Z i Direct calculations of expecation values of physical quantities, such as the one above or other thermodynamic functions, are dependent on knowing the partition function of the system under investigation. However, biological systems are often far too complex for an exact calculation of the partition function. Simplifying assumptions are often needed (e.g.

“spherical cows” [17]) to develop analytical models that capture the essential behavior of interest to a study.

For many thermodynamic systems, their thermodynamic functions have be shown to display analytic discontinuities at particular temperatures, corresponding to the occurence of a phase transition. Examples, such as the melting of solids, the condensation of gases, and the spontaneous loss of magnetization are routinely cited [18; 19]. Numereous studies have shown evidence of the occurence of phase transitions in biological systems as well [20–23].

One example of this, is the study of RNA as a disordered system. In general, single-standed

RNA exists as a three-dimensional molecule. But a two-dimensional representation of the molecule’s secondary structure, along with further simplifying assumptions, gives an approximation that allows for the calculation of thermodynamics quantities. Due to its sequence heterogeneity, an RNA secondary structures can be shown to exist in one of two phases, the details of which will be addressed in the next chapter.

Bioinformatics

Bioinformatics lies at the intersection between computer science, statistics, and biology.

Fundamental to its view is that one of life’s defining properties is information processing in various forms (e.g. information transmitted from DNA to both intra- and inter-cellur processes) [24]. Indeed a more succinct definition of bioinformatics could be “the application of information science to biology” [25]. One could argue that bioinformatics is a direct extension of mathematical and computational biology into the emerging world of new, massive data sets [26]. Computational methods have been used since the early 1960’s to

8 handle expanding collections of amino-acid sequences [27]. In 1975, Frederick Sanger and

Alan Coulson published a method of sequencing DNA [28] and in 1977 later refined their method [29]. The “Sanger Method” soon became the most common method used to sequence

DNA for many years [30]. By the end of the twentieth century, biological data was being produced at enormous rates [31; 32]. In 2001, 11.5 million raw DNA sequences were available for analysis [33]. Since successful completion of the Project in 2003 [34], rapid technological advances have dramatically decreased the costs of sequencing [35] allowing for sequencing of diverse genomes. These technological advancements have also impacted the sequencing of RNA. Recent developments of novel high-throughput RNA sequencing and ribosome profiling methods have revolutionized genomics.

At its simplest level, the genome is a collection of genes. However, the genome is a highly complex, dynamic system that relies on the proper regulation of protein-coding genes and various types of regulatory elements [36–38] (i.e. elements, enhancers, non-coding

RNAs, etc.). Isolating and studying regions of the genome that encode functional elements

(such as mRNA) is one of the key aims of mapping the transcriptome. The transcriptome is the complete set of RNA transcripts in a cell. High-throughput RNA sequencing allows for cataloging the abundance of mRNAs (transcriptome) at a specific developmental stage of a cell or under specific physiological conditions [39]. Ribosome profiling methods have recently been developed to investigate the level of mRNA translation (translatome) and provides a critical step to understanding the control of gene expression in a variety of organisms [40–42].

High-Throughput RNA Sequencing and Ribosome Profiling

High-throughput sequencing of RNA (RNA-seq) involves the isolation and fragmenting RNA molecules currently in a cell. Each fragment is then converted into complimentary DNA

(cDNA) strands, which are then sequenced on a high-throughput sequencing platform that generates hundreds of millions of short (∼30-300 bases) in length. Steps are them taken to map or align these fragments against a reference genome. This method allows one to profile gene expression levels by measuring mRNA abundance [39; 43]. Analysis of this data allows for such goals as gene annotation, discovering new transcripts or gene isoforms, studying

9 mechanisms of gene regulation, differential gene expression analysis, and detection of RNA editing to name a few [43]. This technique is not without its limitations. Size selection standards during the construction of RNA-seq libraries can make the quantification of small transcripts difficult; overlapping transcripts of two different genes can cause difficulties in read assignment; and the presence of reads that align to multiple places in the genome can impact quantitative results [44]. Additionally, a single standard data analysis method or pipeline does not currently exist. Scientists adopt different analysis strategies that are dependent on the organsim they are studying and their research goals [45]. However, one could reasonably expect the mitigation of these limitation as sequencing technology and the tools involved in data analysis continue to evolve.

Translation is the final stage of gene expression involving nucleic acids. Ribosome profiling [41; 42; 46; 47] (Ribo-seq) is a recently developed method that allows for mapping the position of translating ribosomes over the entire transcriptome [48]. This ribosome- centric view can provide high-resolution, quantitative “global snaphots” of actively translating ribosomes across genes. The basic idea behind this technique is simple: during read library preparation the chemical cycloheximide is used to stall actively translating ribosomes; nuclease is then added to digest portions of mRNA that are not protected by ribosomes.

These ribosome protected fragments are then recovered and sequenced. A defining feature of high-quality Ribo-seq libraries is a distinct read-length distribution typically peaked at

∼30 nt reflecting the size of a ribosome actively translating an mRNA [46; 48]. This method has been used successfully in many studies focusing on both prokaryotic and eukaryotic cells [46; 47; 49–55]. When coupled with RNA-seq data, one can calculate a genes translation efficiency (TE), which is the ratio of the abundances of translated mRNA (Ribo-seq) and available mRNA (RNA-seq). This technique is designed to remove transcriptional activity as a confounder of Ribo-seq abundance. TEs in experiments that compare a treatment to a control can then be used to identify genes whose translation might be altered [46; 56]. Like

RNA-seq, this method is also not without its limitations, such as experimentally introduced distortions caused by the need to rapidly inhibit active translation; contaminating footprint- sized fragments caused by rRNA and tRNA; and reads mapping to multiple genomic

10 locations [57]. Despite these limitations, Ribo-seq has provided important and, at times, surprising insights into which proteins are being translated, highlights previously unidentified proteins or protien isoforms, active translation within an upstream ORF (uORF), how much of a protein is synthesized, and where the site of synthesis begins [42; 57]. In chapters 3 and

4, we use this technique, along with RNA-seq, to investigate RNA secondary structure’s role in transtional inititation in Flavobacterium johnsoniae and how the loss of the fragile histidine triad protein (FHIT) impacts translation in human cells.

Outline

This thesis is structured as follows. In the next chapter we will consider a model of

RNA secondary structures from the view point of statistical physics to investigate a phase transition that occurs below RNA’s denaturation temperature. In particular between a weakly diordered state, where sequence heterogeneity does not play a large role in base pairing, to a strongly disordered state where then the sequence heterogeneity can no longer be overlooked. We will identify the transition’s critical temperature, introduce two new order parameters used to precisely measure the location of the transition, and study the possible mechanism behind this transition. In Chapter 3 we will return to a biological standpoint to investigate the role of RNA secondary structure in the initiation of translation in Flavobacterium johnsoniae with the aid of RNA-seq and Ribo-seq data. This prokaryotic organism, and other members of its phylum, has been shown to lack what is thought to be a critical component of translation initiation, namely, the Shine-Dalgarno sequence. And in the final chapter, we will again use RNA-seq and Ribo-seq data to investigate how the loss of fragile histidine triad protein (FHIT) affects the translation of cancer-associated genes in human cells.

11 Chapter 2 Phase Transition of RNA Secondary Structures

Introduction

Heteropolymer folding is of biophysical interest due its medical relevance. Misfolded proteins and nucleic acids are strongly implicated in the development of several neurological disorders, such as Alzheimer’s, Parkinson’s, and Lou Gehrig’s disease [58–63]. RNA, like DNA, is composed out of four nucleotides (or monomers): adenine (A), cytocine (C), guanine (G), and uracil (U). A primary sequence made up of these monomers can polymerize creating elaborate structures consisting of stacked Watson-Crick base pairs (A-U, C-G). However, unlike DNA, RNA exists mostly as a single-stranded molecule. This allows the strand to fold back onto itself giving complementary monomers along its phosphorous backbone the chance to pair with one another. This bending and pairing process leads to the creation of an RNA secondary structure. From a statistical physics standpoint, heteropolymer folding presents a challenging task for the physics of disordered systems. In particular, RNA secondary structures are one of the few disordered systems for which one can calculate the partition function in a polynomial time [64].

Previous studies show that these structures exist within one of two well-identified phases.

Above a critical temperature Tc, the system is in a phase where sequence disorder does not play a significant role. This simplifying assumption allows one to model any random base sequence or a sequence with random pairing energies, i,j, as a homopolymer. The defining

12 Figure 2.1: Diagrammatic and abstract representations of a RNA secondary structure. (a) The backbone of the molecule is represented by the solid gray line while the solid black lines stand for the hydrogen-bonded base pairs where each base is depicted as a circle. The shape of the backbone is such that stems of stacked base pairs and the loops connecting or terminating them can be clearly seen. Stems form double-helical structures similar to that of DNA. (b) Each base pair connected by a solid black line while the red dashed lines create a pseudoknot. We exlude such pairings in our definition of a secondary structure. Pseudoknots, such as these, do not contribute much to the total energy and are often deemed part of an RNA molecule’s tertiary structure.

trait for this phase is that the partition function of long RNA molecules is dominated by an exponentially large number of secondary structures with comparable energies on the order  of O kBT . Beginning in 1968, de Gennes [65], while studying the folded homopolymer, layed some of the theoretical foundation of the folded polymer by showing that the probability of two bases pairing scales as p(`) ∼ `−α, where ` is the distance between any two bases positioned at i and j=i+` for 1 ≤ i < j ≤ N, and found that α = 3/2. In later studies [66; 67] it was demonstrated that folded heteropolymers can be modeled as a folded homopolymers within a certain temperature regime and be solved analytically. These studies showed that the

13 pairing probability of the four bases (A,U,G,C) matched the scaling behavior of de Gennes’ earlier study. The temperature regime in which this behavior is observed was termed the molten phase.

At low temperatures sequence disorder can no longer be ignored. This glass (or “frozen”) phase is characterized by the existence of a small number of low-energy secondary structures.

In the glass phase the scaling exponent of the pairing probabilties was found numerically to be α ≈ 4/3 [68]. Many studies have made attempts at characterizing this system’s glass phase [66; 67; 69–76].

These two phases can be distinguished by disorder-induced replica correlations. Replicas are two secondary structures at distant times (i.e. drawn independently from a thermal ensemble) of the same RNA molecule as shown in Fig. 2.2(a) and (b). Correlations between replicas are defined by subsequent averaging over the disorder distribution. In [66; 67] the arguments for two bases being paired in both replicas, suggests that replicas become independent at large ` in the molten phase, depicted in Fig. 2.2(a). On the other hand, the probability for bases to pair is strongly correlated in the glass phase, shown in Fig. 2.2(b).

Through the use of a renormalized field theory [70], L¨assig and Wiese (LW) showed analytically that this freezing transition is of second order. They proved that their model is renormalizable at 1-loop order. They further argued that the value of the exponent α at the transition is the same as in the glass phase. The field theory was later refined by David and

Wiese [71; 72], showing that it is renormalizable to all orders in perturbation theory. At

2-loop order [71; 72] it then predicts that α = 3/2 in the molten phase and 1.34±0.02 at the transition and in the glass phase.

However, questions concerning the location of this critical point, the existence and behavior of a suitable order parameter, and the transition’s driving mechanism remained open. In the present study, through the use of numerical simulations and analytical tools inspired by the field theory, we study two order parameters which both vanish at the transition. One of them is non-zero in the molten phase, the other in the frozen phase. This allows us to precisely locate the critical point. Measuring pairing probability distributions, we explore the mechanism driving this transition.

14 Figure 2.2: Secondary structures of a random RNA molecule at distant times. Base pairings can be nested, such as (s, t) and (s0, t0), or independent, such as (s, t) and (s00, t00). The pairing overlap is defined by the common base pairings between the left and right configuration (corresponding bases are shown in black). (a) Above Tc, the molecule contains conserved subfolds on scales up to the correlation length ξ (indicated by shading) and is molten on large scales. (b) Below Tc, the molecule is locked into its minimum-energy structure on all scales, up to rare fluctuations (unshaded). Image adapted from [70].

15 This chapter is organized as follows: In Sec. II we define the model. Sections III-IV detail our efforts to locate the critical temperature, establish the existence of an order parameter, and probe the transition mechanism. In Sec. V we discuss our results.

RNA Secondary-Structure Model

RNA Secondary Structures

Each RNA molecule is described by its sequence of N bases b1 . . . bN . Given a molecule, an RNA secondary structure can then be described by a list of the ordered pairs (i, j) with i < j representing the base pairs formed by the molecule. Following methodologies developed in previous studies in the field of RNA secondary structure formation [66; 67], we require each base to pair with no more than one other base and we only consider structures that exclude pseudo-knots. The latter is accomplished by requiring that any two base pairs

(i, j) and (k, l) with i < k satisfy either i < j < k < l or i < k < l < j. While the role of pseudoknots is biologically relavent [77; 78], the no pseudo-knot constraint is necessary to make both analytical and numerical calculations feasible and is justified due to their infrequent occurance in real folded RNA [15]. We consider any pseudo-knots and base triples as parts of the tertiary structure of the molecule.

Energy Model

In order to fully describe the statistical physics of RNA secondary structures we have to complement the definition of valid secondary structures by an energy function that assigns an energy to every structure. While very sophisticated energy models are available [79], which allow detailed quantitative descriptions of the folding of actual RNA molecules, in this chapter we will follow previous studies [66] that use more universal aspects of RNA secondary structure formation and adopt a simplified RNA energy model. Specifically, we only consider contributions from each base pair of the structure and associate with the pair of the bases at positions i and j an interaction energy, i,j. Thus, the total energy of a

16 structure S is calculated as X E[S] = i,j. (2.1) (i,j)∈S

In principle, the interaction energies i,j should depend on the identities of the bases bi and bj rendering the i,j random variables with discrete values and a complicated correlation structure if the RNA sequences are chosen randomly. However, since we are here interested in universal properties of phase transitions, we further simplify our model [66] by choosing these interaction energies as independent Gaussian random variables taken from the distribution

2 1 −  ρ() = √ e 2D (2.2) 2πD with mean energy zero and variance D = 1. This kind of uncorrelated Gaussian disorder model is expected to have less severe finite-size effects [68] than other choices of disorder models. While we expect our model to exhibit a phase transition as some finite temperture, we want to also capture information concerning the ground-state structure. To do so we use the following equation [80] to calculate the ground state energies

Ei,j = min [Ei,j−1 + Ei,k−1 + k,j + Ek+1,j−1] (2.3) i≤k≤j where we set Ei,j = 0 for all i ≥ j.

Partition Function

Once an energy has been assigned to each secondary structure S, the partition function is calculated as X Z(N) = e−βE[S] (2.4) S∈Ω(N) where Ω(N) is the set of allowed secondary structures of N bases, and β = 1/kBT . In the absence of pseudo-knots, the partition function can be studied by considering substrands of the total sequence from base i to base j. The restricted partition function Zi,j for these substrands obeys the recursive equation [64]

j−1 X −βk,j Zi,j = Zi,j−1 + Zi,k−1 · e · Zk+1,j−1, (2.5) k=i 17 Figure 2.3: The exponent α1 and α2 as a function of temperature. The continuously changing nature of this quantity makes it a poor indicator of the location of the phase transition.

in which by convention Zi,i−1 = 1. In this recursion, the right hand side involves only restricted partition functions for shorter substrands than the one on the left hand side.

Thus, the total partition function, Z(N) = Z1,N , can be calculated by progressing from the shortest substrands to longer ones in ON 3 time.

Observables

From the partition function of RNA secondary structures, we can calculate a variety of physical observables that characterize the structural ensemble. Primarily, we study the pairing probability, pi,j, for a given base pair (i, j). This probability can be calculated as

−β e i,j Zi+1,j−1Zj+1,i−1 pi,j ≡ (2.6) Z1,N where Zi+1,j−1 can be calculated from Eq. (2.5) and Zj+1,i−1 is the partition function of the sequence bj+1bj+2...bN b1...bi−2bi−1. This last quantity can be obtained when the recursion

Eq. (2.5) is applied to a duplicated sequence b1...bN b1...bN and calculated as Zj+1,N+i−1.

18 We are specifically interested in the dependence of the ensemble averaged base-pairing probability on the distance |i − j| = ` between the two bases. Within the molten phase, and for large `, this quantity has a power law dependence

` · (N − `)−αn hp(`)ni ≡ hpn i ∼ , (2.7) i,i+` N where the brackets stand for the ensemble average over the random base-pairing energies i,j and αn is a critical exponent. The additional parameter, n, represents the central moment and highlights the behavior of the critical exponent unique to each phase. For the values of n = 1, 2, hp(`)ni is numerically analogous to the base pair contact field, hΦ(i, j)i, and replica overlap field, hΨ(i, j)i (defined as Ψa,b(i, j) = Φa(i, j)Φb(i, j)) in the LW field theory [70], respectively. The notation used there is

α1 = ρ , α2 = θ . (2.8)

These exponents are not independent. In the molten phase [65]

3 hp(`)ni ' hp(`)in =⇒ α = nα = n . (2.9) n 1 2

In the glass phase one expects [68; 70; 73]

4 hp(`)ni ' hp(`)i =⇒ α = α ≈ . (2.10) n 1 3

This alone should allow for the determination of Tc. As one can see on Fig. 2.3, the behavior of α1 across the transition, which we will show later is at Tc = 0.53, is rather smeared out. The logarithms of each pairing probability,

∆Fi,j = −kBT ln(pi,j) (2.11) have been interpreted as the “pinching free energy” [66], which is the free energy difference between a pinch between the i and j positions and the unperturbed, or unpinched, state.

In Sec. 2.3, as in previous studies[66], we will show how we can use the free energy of the largest possible pinch,

∆F (N) ≡ ∆F N , (2.12) 1, 2 +1 19 where positions i and N/2+1 are treated as a representative of all pinching energies for

N different positions and splits the sequence of N bases into two equal pieces of length 2 − 1, to obtain an estimate of the critical temperature.

Numerical Approach

To investigate the RNA secondary structure glass transition, we use RNA molecules of lengths within the range of 500 ≤ N ≤ 4000. For each length, many independent realizations of the random base-pairing energies i,j are chosen, and the partition function of all allowable secondary structures is calculated for each realization of the random variables using Eq. (2.5).

Then, the observables described above are calculated for each realization of the random variables and averaged over those realizations (as well as over the starting base i in case of the base-pairing probability). In an effort to keep the numerical effort manageable we averaged over 20,000 samples for N = 500, over 10,000 samples for N ∈ 750, 1000, 1500, over 5,000 samples for N = 2000, and over 1,000 samples for N = 4000. We also varied the √ temperature in a range of 0.1 ≤ kBT/ D ≤ 1.0 to capture behavior deep within each phase and close to the phase transition.

Scaling of the contact and overlap observables

To understand the model’s behavior at the transition, we must know its precise location.

Previous studies [66; 74] using different disorder models found that thermodynamic signatures at the phase transition are quite weak, making an exact localization of the critical temperature

Tc challenging.

Initial Estimate of Phase Transition Temperature

To obtain a first estimate of the transition temperature, we follow [66] and consider the disorder-averaged pinch free energy h∆F i. As seen in the inset of Fig. 2.4, this quantity has a logarithmic dependence on the sequence length N at both low and high temperatures [66].

20 Figure 2.4: Pinch free energy prefactor a(T ) vs temperature. The inset shows√ the disorder- averaged pinch free energy√ vs ln(N) in both the glass phase (kBT = 0.1 D) and the molten phase (kBT = 1.0 D). This allows us to extract the prefactors of these logarithmic behaviors, which are shown as a function of temperature in the main graph. The temperature at which the prefactor changes slope is an estimate of the phase transition temperature Tc.

We thus fit it to the linear form

h∆F (N)i = a(T ) ln(N) + c(T ). (2.13) for sequence lengths in the range 500 ≤ N ≤ 2000. The resulting prefactor a(T ) is shown

3 in the main part of Fig. 2.4. As expected [65], in the molten phase a(T ) ≈ 2 kBT resulting in a linear dependence of the prefactor on temperature. However, for lower temperatures, the prefactor is nonmonotonic in temperature and thus lends itself as an estimator of the critical temperature. The intersection of linear fits on either side of the minimum yields √ Tc ≈ 0.51 D/kB.

21 Figure 2.5: Average pairing probability versus substrand length. The upper figure shows how the pairing probability√ scales as Eq. (2.7) in the glass phase at the lowest temperature considered, kBT/ D=0.1. As mentioned in the text, within this phase the critical exponent, α, is insensitive to the value of n. The lower figure shows how the pairing probability√ scales as Eq. (2.7) in the molten phase at the highest temperature considered, kBT/ D=1.0, but demonstrates how the value of α directly depends on the value of n.

Order Parameters for the Transition, and a More Precise Estimation of the Phase Transition Temperature

Using this value of Tc as a guide, we limited the range over which we performed our √ calculations to 0.4 ≤ kBT/ D ≤ 0.7 for a sequence length of N = 4000. We then calculated the disorder-averaged base-pairing probabilities hp(`)ni for each temperature. As mentioned in the previous section, this quantity scales like the power law shown in Eq. (2.7), which is expected to hold in the limit of very large system size N and large `. Figure 2.5 shows that the power law holds, for values of n = 1, 2, in both the low-temperature and high-temperature regimes.

By itself the base-pairing probability, hp(`)ni, does not demonstrate a clear indication of when a finite system of length N approaches Tc from either side. This leads us to define two additional observables that do allow us to extract when the system approaches Tc from either side. The first is the ratio hp(`)2i/hp(`)i, which we would expect to be constant for large `

22 due to α2 = α1 and revert back the power law of Eq. (2.7) at high temperatures for n = 1. The second new observable, hp(`)i2/hp(`)2i we would expect, conversely, to be constant in the high-temperature regime for large ` and then decay with the power law of Eq. (2.7) at low temperatures. Figures 2.6 and 2.7 plot these observable for several temperatures both above, below, and close to the critical temperature. We fit these curves, also seen in those

figures as dashed yellow lines, to the forms ( ) p(`)2 `(N − `)−ωg = A (T ) + ∆ (2.14) hp(`)i g N g ( ) hp(`)i2 `(N − `)−ωm = A (T ) + ∆ , (2.15) hp(`)2i m N m where Ag(T ) and Am(T ) are global prefactors.

We identify the constants ∆g and ∆m as the order parameters of the system. They scale as ξ−ω, where ξ is a characteristic length scale, diverging at the transition as

−ν ξ ∼ |T − Tc| . (2.16)

Supposing that ν is the same on both sides of the transition, this implies that ∆g and ∆m depend on T as

−ωgν ∆g ∼ |T − Tc| (2.17a)

−ωmν ∆m ∼ |T − Tc| . (2.17b)

The behavior of these two quantities is shown in Fig. 2.8 for all system sizes considered. In order to estimate values of ωgν and ωmν for each system size we perform linear fits over small ranges of temperature immediately below or above Tc, respectively, while varying ων. Once we obtained values of ων for each system size with the best fit, we averaged these values together. To the left side of the critical point we obtained ωgν ≈ 2.12 ± 0.08, while on the right side of Tc ωmν ≈ 2.43 ± 0.06. Assuming the theoretical values of ωg and ωm,

23 Figure 2.6: High-temperature pairing ratio vs substrand length. The curves above demon- strate the behavior of this new observable at several temperatures considered above and below the location√ of the transition temperature and√ for temperatures deep within the glass phase (kBT/ D = 0.1) and molten phase (kBT/ D = 1.0).

4/3 and 3/2, respectively, one then finds that on both sides of the transition

νg ≈ 1.59 ± 0.06 (2.18a)

νm ≈ 1.62 ± 0.04. (2.18b)

The value of ν was also determined analytically via hyperscaling and found to be ν =

1/(2 − θ∗) ≈ 8/5 [70].

Transition Mechanism

Next, we aim to understand the mechanism of the phase transition itself. Deep in the molten phase, base-pairing energetics are irrelevant and thus any base can pair with any other base. Thus, the thermally averaged base-pairing probability p(`) is the same for any pair of bases with the same distance ` and decays with the distance ` as the power law introduced in Eq. (2.7). This probability is independent of the disorder realization. In

24 Figure 2.7: Low-temperature pairing ratios vs substrand length. The curves above demon- strate the behavior of this new observable at several temperatures considered above and below the location√ of the transition temperature and√ for temperatures deep within the glass phase (kBT/ D = 0.1) and molten phase (kBT/ D = 1.0).

contrast, at zero temperature the RNA molecule folds into the ground-state structure, which is typically non-degenerate within the Gaussian-disorder model. Thus, for a given realization of the disorder the thermally averaged pairing probability of a given base pair is either zero (if the base pair is not in the ground-state structure) or one (if the base pair is in the ground-state structure). Which base pairs have a zero probability and which have a probability of one changes with the disorder realization. Thus, the scaling behavior Eq. (2.7) at zero temperature arises only after ensemble averaging. The picture of the phase transition proposed in [70] is that as the phase transition is approached from below, the fraction of base pairs “locked” into a ground-state structure decreases by the appearance of molten regions.

At the transition, the situation inverts and the “locked” base pairs become localized regions of decreasing size within an overall molten structure on the high-temperature side of the phase transition. These ideas are reflected in the order parameters introduced in Eqs. (2.14) and (2.15). Let us stress that while the expression “locked” used in Ref. [70] suggests that

25 Figure 2.8: Order√ parameters, ∆g and ∆m, versus temperature. (a) As Eq. (2.17a) approaches kBT/√D ≈ 0.53 from below it drops below zero. (b) Similarly, as Eq. (2.17b) approaches kBT/ D ≈ 0.53 from above it becomes negative. Note that finding a negative value for ∆g and ∆m is unphysical, and for N → ∞ these negative regions converge to zero. Plotting them nevertheless allows us to more precisely estimate the transition temperature Tc. Estimates for the values of ωgν and ωmν are also shown in their respective temperature regimes.

26 bases are paired with probability 1, this condition can be relaxed to mean “are paired with probability p larger than 0.5, or even p ≥ 0.1. This suffices to render the expectations (2.14) and (2.15) non-trivial. We will understand “locked” in this sense from now on.

In order to directly probe this transition mechanism, we numerically calculate the entire distribution of base-pairing probabilities P (p; `). We note that all moments of the disorder-averaged base-pairing probabilities can be reconstructed from these distributions as

Z 1 hp(`)ni = pnP (p; `)dp. (2.19) 0

We obtain these distributions by explicitly calculating the base-pairing probabilities pi,i+` using Eq. (2.6) and then tabulating their frequencies averaged over all i and many realisations of the disorder. Moreover, due to the large range of base-pairing probabilities, instead of taking histograms of the pi,i+` themselves, we instead sample the distribution Q(x; `), where x = − ln(p(`)). The two distributions are connected to each other by Q(x; `)dx = P (p; `)dp and thus moments of the disorder-averaged base-pairing probability can be obtained from

Q(x; `) as Z ∞ hp(`)ni = e−nxQ(x; `)dx. (2.20) 0 To reduce the statistical noise, we also use the large ` expression

Z N/2 Q(x) := d` Q(x, `) . (2.21) N/3

This is motivated by the fact that according to Eq. (2.7), the pairing probability p(`) is a function of `(N − `)/N, and in the chosen range the latter does not vary more than 10%.

Figures 2.9(a) and (b) show that in both temperature regimes the base-pairing probability distribution has a Gaussian-like shape. Figure 2.9(a) is the distribution of Q(x) at the highest √ temperature considered kBT/ D = 1 for a system of size N = 4000. The distribution is centered around x ≈ 13 and we interpret it as a broadened version of what would be a

δ-peak around the value of hp(`)i from Eq. (2.7) at infinite temperature or D = 0.

Figure 2.9(b) shows the same distribution at the lowest temperature numerically accessible √ to us, kBT/ D = 0.1. Here we see a broad Gaussian-like distribution again. However, it is

27 by far wider than the high temperature distribution and is located at x ≈ 95, i.e., at much smaller base-pairing probabilities than its high-temperature counterpart. We interpret this

Gaussian-like feature as the contributions from all the base pairs that are not part of the ground-state structures and thus would have exactly zero probability at zero temperature, but acquire a finite, albeit very small (x  0, p  1), pairing probability at low finite temperatures. Importantly, as the inset of Fig. 2.9(b) shows, close to the origin (x ≈ 0 or p ≈ 1) there exists another peak. We interpret this prominent peak at x ≈ 0 as the contributions from those base pairs that are “locked” in the ground-state structure and define the glass phase of this model. As the inset in Figure ??(a) shows, no corresponding peak at x ≈ 0 occurs deep in the molten phase. From this we conclude that the total base-pairing probability distribution is just a sum of two sub-distributions: the distribution of ground state pairs and the distribution of non-ground state pairs.

In order to separate these two sub-distributions we used Eq. 2.3 to calculate the ground state energies and tablute their associated base-pairing probability frequencies across all `, as was done for the entire base-pairing probability distributions. Obtaining the non-ground state base-pairing probability distrubition is then just a matter of subtraction. Figures 2.10 and 2.11 compare the ground state and non-ground state distrubutions, averaged over the

final third largest values of `, to the total base pair distributions at the lowest and highest temperatures considered, respectively. In Fig. 2.10(b) one can see that the ground state pairs directly contribute to the distinct peak close to x ≈ 0, which we interpreted as only those base pairs that are “locked” in the ground-state structures. This further confirms that it is the non-ground state pairs that make up the bulk of the total distribution seen in

Fig. 2.9(b), each of which would otherwise be zero at T = 0.

Figure 2.11 then explains the absence of any weight in the inset of Fig. 2.9(a). In this region the lack of ground state pair contributions indicate that the system is no longer in the glass phase. Using the refined location of the critical point from the previous section, one can then track the loss of these ground state pairs near x ≈ 0. √ While we limited the bulk of our investigation to the temperature range 0.4 ≤ kBT/ D ≤ 0.7 for a system size of N = 4000, we collected data for temperatures in the range of

28 Figure 2.9: Distributions of the logarithm of base-pairing probabilities x = − ln(p) for system√ size N = 4000 averaged over the final third of all `’s. (a) Deep in the molten phase (kBT/ D = 1.0) the distribution of base-pairing probabilities has a distinct Gaussian-like form, which would go to a delta peak as T → ∞ (D → 0). The inset shows that close to x ≈ 0 (p ≈ 1) that no weight√ of the distribution is present close to x ≈ 0 (p ≈ 1). (b) Deep in the glass phase (kBT/ D = 0.1) the base-pairing distribution still has a Gaussian-like form, albeit broader than the high temperature distribution. However, the inset shows there exists an exponential-like peak close to x ≈ 0.

29 Figure 2.10: Separation of the glass phase ground and non-ground√ state distributions. (a) The total base-pairing distribution within the glass phase (kBT/ D = 0.1). The black arrow points to the prominent peak at x ≈ 0. (b) The ground state base-pairing distribution. The black arrow points to the prominent peak at x ≈ 0. This peak represents the frequency of those pairs “locked” in the ground-state structure and are the contributing weight seen in the inset of Fig. 2.9(b). (c) As mentioned in the text, the broad Gaussian-like peak seen in the low temperature regime is a result of non-ground state pairs that have a small, non-zero pairing probability at finite temperatures. These probabilities would be zero at T = 0.

30 Figure 2.11: Separation of the molten phase ground and non-ground√ state distributions. (a) The total base-pairing distribution within the molten phase (kBT/ D = 1.0). (b) The lack of weight close to x ≈ 0 is due to the absence of ground state pairs at high temperatures, as seen in the inset of Fig. 2.9(a). We see instead that the peak has shifted to the right to a postion in x closer to the non-ground state distribution. Note that the y-axis is 4 orders of magnitude smaller than either one of the other distributions. (c) The non-ground state pairs dominate the total pairing probability distribution.

31 √ 0.1 ≤ kBT/ D ≤ 1.0 for smaller system sizes for our initial estimation of the critical point in Sec. III. The data collected from these smaller systems allowed us to explore the glass phase and the transition in more detail.

With our definition of “locked” now including all ground state pairs with a pairing probability of x < 0.7 we examined this region of the ground state distributions as one approaches Tc from below. Figure 2.12(a)-(e) shows us that as the temperature increases the region x ≤ 0.1 (p ≥ 0.9) becomes more sparse. Calculating the average pairing probability over this region for all temperatures, as shown in Fig. 2.13, we find that the average pairing probability decreases rapidly towards zero close to the critical point.

We then tracked the total number of zero-weight bins within the “locked” region for all temperatures. As shown in Fig. 2.14 we find that through most of the glass phase the fraction of zero-weight bins within “locked” region is zero. As the system approches the critical point the number of zero-weight bins begins to rise. At Tc ∼10% of bins are empty within the “locked” region. Above Tc the number of zero-weight bins within the “locked” √ region sharply rises until kBT/ D=0.7 when all the weight has disappeared. As seen in the inset of Fig. 2.9(b), the prominent peak at x ≈ 0 appears to have a power-

−γ law like form. Thus one could model the ground state pairing distribution as QGS(x) ∼ x for small values of x. We found that at a temperature below Tc this distribution experiences a sudden drop in weight at x ≈ 0 and continues to decrease as the temperature increases, as shown in Fig. 2.15(a), (b), and (c). In (a) we see that the left-most peak is the highest peak √ at a temperature of kBT/ D = 0.3. This is also true for all lower temperatures in this same region, as indicated in previous figures. As the temperature increases, we see in Fig. 2.15(b) that the weight of all neighboring peaks are roughly the same height. As the temperature continues to increase, the peak at x ≈ 0 in Fig. 2.15(c) drops below neighboring peaks. In

−γ Fig. 2.16 we measure the over all decrease in weight out this region by fitting QGS(x) ∼ x to a line in log-space to extract the values of the exponent γ. Figure 2.16 shows that the slope of these fits (seen in black dashed lines) turn from negative to positive as the temperature √ crosses a particular point. We find that this point lies between 0.34 ≤ kBT/ D ≤ 0.35, as seen in Fig. 2.17.

32 Figure 2.12: “Locked”√ regions of the ground state distributions in the vinicity of the critical point. (a) At kBT/ D=0.5 we see that several√ of the bins below x ≤ 0.1 have no weight. (b)-(c) As the temperature approaches Tc (kBT/ D=0.51, 0.52), this region of zero-weight bins begins to slowly expand further. (c) At the critical point many of the bins within them x ≤ 0.1 range have zero-count. (e) Above Tc the number of zero-count bins dominates the x ≤ 0.1 range with more occuring further to the right.

Figure 2.13: Average pairing probability of the “locked” region versus temperature. The black vertical line indicates the location of the critical point.

33 Figure 2.14: Fraction of zero-weight bins versus temperature. The black horizontal like shows the point at which the number of zero-weight bins within the “locked” region is at 10%. Below Tc the number of zero-weight bins is very low. At the critical point ∼10% of the bins within the “locked”√ region are zero-weight. This fraction of zero-weight bins sharply rises above Tc until kBT/ D=0.7 where all bins within the “locked” region are zero.

Figure 2.15: Average√ ground state base-pairing probabilities versus x for system size N = 1000. (a) At kBT/ D = 0.3, and all temperatures below, the region close to x ≈ 0 is the domninant peak. (b) As the tempertaure is increased, the height of these peaks all drop to roughly the same height as the neighboring peaks. (b) Above a particular temperature, this peak is then lower that its neighbors.

34 Figure 2.16: Logarithm of the average ground base-pairing distributions versus ln(x) for system size N = 1000. At low temperatures the regions the logarithm of the ground state pairing distributions close to x ≈ 0 demonstrate linear behavior whose slopes (approximated by the black dashed fit lines) appear to turn from negative to positive as the temperaure increases.

Figure 2.17: γ vs temperature. When plotted against temperature,√ the slopes√ of the fit lines from Fig. 2.16 cross zero at a point between kBT/ D = 0.34 and kBT/ D = 0.35, which one could interpret as the secondary freezing event mentioned in [70]

35 Discussion & Conclusions

In this work, we numerically studied the glass-molten phase transition of RNA secondary structures using a Gaussian disorder model. With the guidance of previous studies of this system using a renormalization field theory [70] we were able to determine a precise location of the transition by the introduction of two new observables. We found that the critical √ temperature is kBT/ D ≈ 0.53 when approached from either above or below for system sizes up to N = 4000. While establishing this location, we also found two quantities that we identify as order parameters of this system, ∆g and ∆m. It is through these numerically measured quantities that we are able to establish further connections to a renormalized field theory [70].

We also provided some insight as to how this transition occurs by studying the behavior of the base-pairing probability distributions. In particular we showed that the total base- pairing probability distribution can be broken into two sub-distributions composed of base pairs that are “locked” in the ground-state structures and non-ground state base pairs that develop a small non-zero pairing probability at low finite temperatures. A redefinition of what it means for a base pair to be “locked” in the ground state allowed highlighting the behavior near the transition. We found that as the system approaches the critical point, the

“locked” region begins to diminish and the number of zero-weight bins reaches a value of

∼10%. Immediately above the critical point, the number of zero-weight bins increase rapidly.

Further below Tc we also found that “locked” region can be modeled as a power-law with √ a sign change occuring at kBT/ D ≈0.35. We interpret this as potentially a “secondary freezing” event, first mentioned in [70].

36 Chapter 3 The role of RNA secondary structure in translation initiation in Flavobacterium johnsoniae

Introduction

In the previous chapter, we looked at the behavior of RNA secondary structures through the lens of statistical physics. In doing so, we overlooked any biological role that these structures have within a cell. In this chapter we will investigate some aspects of their biological impact on the initiation of translation in bacterial cells. Aided by the use of high-thoughput RNA sequencing (RNA-seq) and ribosome profiling (Ribo-seq) we will study how RNA secondary structure affects the initiation of translation initiation in Flavobacterium johnsoniae, a member of the phylum Bacteroidetes.

F. johnsoniae as the model organism for Bacteroidetes

The initiation of translation is a fundamental and highly regulated process in gene expression.

This process requires the assembly of a ribosome complex with an initiator tRNA bound to its peptidyl (P)-site and paired to the correct start codon of an mRNA. Selection of the correct start codon among all other AUG (or similar) trinucleotides represents a critical challenge for the translation machinery [5; 81]. In bacteria and archaea, translation initiation is often facilitated by the Shine-Dalgarno (SD) sequence, a purine-rich element (e.g. AGGAGG) 37 Figure 3.1: Pairing of the Shine-Dalgarno (SD) sequence with the anti-Shine-Dalgarno (aSD) sequence. The SD sequence lies ∼5-9 nt upstream of the start codon of an mRNA. Image adapted from [83]

that lies ∼5-9 nt upstream from the start codon of an mRNA. The SD basepairs with a stretch of pyrimidines at the 3’ end of the 16S rRNA called the anti-Shine-Dalgarno (aSD) sequence, positioning the start codon of the mRNA in the 30S P-site [4; 82], as depicted in

Fig. 3.1.

Several genetic studies have shown that mutations to the SD sequence or any alteration to the spacing between the SD sequence and the start codon substantially reduce translation, demonstrating the importance of the SD sequence for efficient mRNA translation [84–86].

However, not all prokaryotic mRNAs contain a SD sequence. One study found that across completely sequenced prokaryotic genomes the percentage of SD-led genes varied from 11.6% to 90.8% [87]. Yet the translational apparatus of these organisms can still translate mRNAs that lack the SD sequence faithfully and efficiently. This implies that other features of a mRNA’s translation initiation region (TIR) can facilitate start codon selection and aid the initiation of translation.

Genomic studies examining the universality of the SD sequence as an effective initiator of translation among prokaryotes have revealed that certain phyla of bacteria lack this

38 translational regulatory element [88; 89]. These phyla include the Bacteroidetes and a subset of the Cyanobacteria. How the mechanism of translation initiation in these organisms differs from that of model organisms, such as Escherichia coli and Bacillus subtilis, remains unclear.

Two possible alternate mechanisms have been identified in prokarotes. One such mechanism is leaderless translation [90–94]. A leaderless mRNA would directly bind to a 70S ribosome, which would include a fMet-tRNA, to the initiating codon [88; 90]. The second mechanism is mediation by ribosomal protein S1 [88]. S1 interacts with A/U-rich stretches of the 5’

UTR of an mRNA due to its high affinity for these regions [95–98]. Intriguingly, despite the absence of the SD sequence, the 30S ribosomal subunit of most of these SD-independent organisms have retained the conserved aSD sequence at the 3’ end of the 16S rRNA [88; 98].

This implies a possible critical function of the aSD beyond initiation is yet to be understood.

Bacteroidetes represent a widespread and metabolically diverse group of bacteria. Mem- bers of the Bacteroidetes are well known for their ability to digest and utilize various polysaccharides [99; 100]. In the mammalian gut, two phyla – Firmacutes and Bacteroidetes

– account for the majority of bacteria present, ∼65.7% and ∼16.3%, respectively [101]. Bac- teroidetes within the gut break down many diverse glycans, altering the pools of nutrients that can be absorbed and utilized by the host. Indeed, human microbiome studies suggest that Bacteroidetes make important and complex contributons to nutrient aquisition, weight control, and metabolic disease [102–105]. While these obervations have spurred considerable interest in Bacteroidetes, they remain an understudied bacterial group.

The Bacteroidetes exhibit unique aspects of not only tranlsation but also transcription.

Their primary housekeeping sigma factor, termed σABfr (after Bacteroides fragilis), has functionally diverged from the stereotypical σ70 (rpoD) of other bacteria [106]. Sigma

σABfr lacks region 1.1 and contains numerous conserved substitutions in the DNA recog- nition motifs of regions 2.3-2.4 and 4.2. A consensus promoter, with -33/-7 elements

TTTG/TANNTTTG, has been deduced from various molecular studies [107–111]. This promoter sequence differs dramatically from the -35/-10 elements recognized by the Eσ70 holoenzyme (TTGACA/TATAAT), explaining why genes from from E. coli are not expressed in the Bacteroidetes, and vice versa [112].

39 F. johnsoniae, a member of the Bacteroidetes, is a common aerobic soil bacterium that can degrade chitin and other insoluble polymers. F. johnsoniae also exhibits a gliding motility. McBride and colleagues have developed F. johnsoniae as a model organism to elucidate the celluar machinery responsible for gliding motility [113]. In this work, we use

RNA-seq fragments and Ribo-seq footprints to obtain a global snapshot of transcription and translation, repectively, in F. johnsoniae. We find that within the vicinity of the start codon the occurence of the additional AUG codons is suppressed and that translation initiation is influenced by mRNA secondary structure. In addition, our data allows for the refinement of the consensus promoter of the primary RNAP holoenzyme.

Materials and Methods

Cell culture and library preparations

All F. johnsoniae cultures used in this study were grown and processed by Bappaditya

Roy, PhD, in the Fredrick lab within the Department of Microbiology at the Ohio State

University. F. johnsoniae strain UW101 was grown at 30◦C to mid-logarithmic phase in a rich charcol yeast extract medium. The cells were then rapidly chilled to halt translation by mixing the culture with crushed ice. Cells were collected by centrifugation and lysed via a freeze-thaw method. Total RNA and ribosome-protected mRNA fragments were isolated in parallel from the lysate, and corresponding complementary DNA libraries were prepared and subject to high-throughput sequencing. Sequencing was performed using a HiSeq2500 in the Ohio State Comprehensive Cancer Center Genomics Shared Resource in 50 base-pair single-end mode. The three RNA-seq replicates initially contained 31,621,290, 30,635,429, and 29,764,896 50 base-pair single-end reads. And the three Ribo-seq replicates initially contained 28,364,922, 17,598,576, and 27,203,601 50 base-pair single-end reads.

RNA-seq, Ribo-seq, and the selection of representative set of genes

3’ adapter sequences were trimmed from the single-end reads using Skewer (ver. 0.2.2) [114] with its default settings with the exception of the minimum allowed read length (“-l”) being

40 set to 15. Sequences were then aligned to the genome using the Bowtie 2 aligner (ver.

2.2.9) [115] using its default settings. Both RNA-seq fragments and Ribo-seq footprints were aligned to the F. johnsoniae genome (NCBI Reference Sequence: NC 009441.1). Alignments were then converted to the BAM format using F (ver. 0.1.18) [116]. And RNA-seq fragments and Ribo-seq footprints associated with ribosomal RNA (rRNA) and transfer RNA (tRNA) were removed from alignment files using the BEDtools (ver. 2.26.0) utility ‘intersect’ [117].

Total alignment rate for the RNA-seq replicates were 88.4% (65.2% multimapped),

88.1% (62.2% multimapped), and 88.5% (63.3% multimapped) of which 79.6%, 78.0%, and

77.9% of the initial single-end reads, respectively, aligning to regions of r/tRNA. Total alignment rate for the Ribo-seq replicates as compared to the initial number of sequenced reads were 81.6% (77.2% multimapped), 72.4% (68.7% multimapped), and 62.6% (60.1% multimapped) with 77.0%, 68.5%, and 59.9% of the initial single-end reads aligning to regions of r/tRNA. Gene expression was measured by counting aligned RNA-seq fragments and Ribo-seq footprints that overlapped coding sequence annotations using HTSeq (ver.

0.6.0) [118]. Replicate-to-replicate comparions of fragment count per gene coverage showed that the RNA-seq data sets were reproducible, with Spearman’s correlation coefficients, r, of

0.982, 0.981, and 0.980, as shown in Fig. 3.2. Similarly, replicate-to-replicate comparison of footprint countper gene coverage showed that the Ribo-seq data sets were also reproducible, with Spearman’s correlation coefficients, r, of 0.958, 0.963, and 0.963, as shown in Fig. 3.3.

For this study, we selected a subset of highly expressed genes to act as a representative set for all genes. To do this we summed together the raw RNA-seq fragment and Ribo-seq footprint counts for each gene across the three libraries. Then we rank ordered the entire set of 5,138 protein coding genes of F. johnsoniae by the number of RNA-seq fragments per gene length from highest to lowest. From this list we then selected the top third (1,712 genes) as our representative set. Gene counts for both RNA-seq and Ribo-seq were then normalized by the total number of counts per million from this representative set. From there we then re-ranked this subset by the ratio of normalized Ribo-seq footprint counts to normalized RNA-seq fragment counts (i.e. average ribosomal density, ARD) and further divided them into octiles of 214 genes. Studies [41; 46; 119] indicate that ARD (also termed

41 “translational efficiency” or “TE”) generally reflects the rate of translation. Features of the

TIR were compared from octile to octile in an effort to identify those that impact initiation.

Figure 3.4 shows that scatter plots of ARD between replicates were also in good agreement

(Spearman’s r =0.900, 0.902, 0.873).

Additionally, a metagene analysis of the 3’ end of all coding sequences was performed.

For each annotated coding sequence, we counted the raw number of 3’ ends of Ribo-seq footprints that occurred 50 nts upstream and downstream from the 3’ end of each gene. For every gene we then normalized this window to the total number of Ribo-seq footprint 3’ ends within that window. Finally we then averaged these normalized windows across all genes. Figure 3.6(a) is the result of the procedure and shows the expected triplet phasing near the stop codon with a peak 9 nts upstream of the stop codon. This is indicative of ribosomes located at the end of a gene with the stop codon in their A-site. This method allows one to identify the location of the center of the P-site. On average, this center of the

P-site in F. johnsoniae is 13 nt upstream from the 3’ end of a Ribo-seq footprint. Previous studies using E. coli have reported finding the peak associated with the center of the P-site

14 nt upstream from the 3’ ends of Ribo-seq data [54].

To understand the impact of mRNA secondary structure on translation initiation and its possible connection to ARD, we folded the regions around the annotated start codon of each gene in our representative set in silico and calculated the average pairing probability for each nucleotide position. Our measurement of the pairing probability at each position was calculated as follows. For each gene, we extracted the regions 30, 50, and 100 nt upstream and 100 nt downstream of the first base of a gene’s annotated start codon. The ViennaRNA

Package program “RNAfold” [120] was then used to calculate the pairing probabilities for each position using its default settings. The pairing probabilities at each position could then be averaged over the entire set or subsets (i.e. octiles) of these genes.

To study how the absence of SD impacts translation initiation, we compared our rep- resentative subset of genes from F. johnsoniae to similarly selected subsets of genes from

E. coli (GSE88725) [121] and B. subtilis (GSE50870) [122]. These two organisms serve as ambassadors of phyla - Proteobacteria and Firmacutes, respectively - that have demon-

42 Figure 3.2: Comparison of RNA-seq libraries from F. johnsoniae. Shown are scatter plots of duplicate RNA-seq libraries with the Spearman coefficient for each shown in the green box.

Figure 3.3: Comparison of Ribo-seq libraries from F. johnsoniae. Shown are scatter plots of duplicate Ribo-seq libraries with the Spearman coefficient for each shown in the green box.

43 Figure 3.4: Comparison of average ribosome densitites from F. johnsoniae. Shown are scatter plots of average ribosome densities from the Ribo-seq libraries with the Spearman coefficient for each shown in the green box.

strated SD-dependent translation initiation [88]. These additional RNA-seq and Ribo-seq data sets were processed in the same manner as described above. For E. coli, we aligned and enumerated gene counts using the NCBI NC 000913.3 reference genome and annotations.

For B. subtilis, we aligned and enumerated gene counts using the the NCBI NC 000964.3 reference genome and annotations. Our representative set of E. coli contained 1,440 genes

(180 genes per octile) and for B. subitils contained 1,392 genes (174 genes per octile).

Results

Ribosome footprints from F. johnsoniae are shorter and more uniform in length than those of other bacteria

Mindful of reports that aSD pairing to the mRNA can increase ribosome footprint length [49–

51], we compared Ribo-seq footprints from analogous F. johnsoniae, E. coli, and B. subtilis libraries. Ribosome footprints from F. johnsoniae were found to be ∼23 nt in length,

44 Figure 3.5: Length histograms of Ribo-seq footprints for F. johnsoniae, E. coli, and B. subtilis. Average length of Ribo-seq footprints in F. johnsoniae were found the be 23 nt long, 4-5 nt shorter than E. coli and B. subtilis. E. coli and B. subtilis showed a broader size distribution due to possible 5’ length variability mentioned in [49–51]. We interpret the shorter footprint lengths of F. johnsoniae being a result of the lack of SD-mRNA pairing.

4-5 nucleotides shorter than those from E. coli or B. subtilis, as shown in Fig. 3.5. RNA fragments of a broad size range (20-35 nt) were used to prepare the F. johnsoniae Ribo-seq libraries, arguing against any trivial procedural explanation for shorter footprints [50].

To gain further insight, we took advantage of the relatively large dwell time of termination and performed a metagene analysis with respect to the stop codon. Coverage of either 3’ or

5’ ends of Ribo-seq footprints were normalized per gene and averaged over genes. For F. johnsoniae, a sharp peak was seen in each plot (Fig. 3.6(a), (b)), indicating that protected mRNA upstream and downstream from the P-site exhibits considerable length uniformity.

These peaks at positions -13 and 9 (where position 0 is defined as the third nucleotide of the stop codon) correspond to the 5’-most and 3’-most protected nucleotide in the termination complex, yielding (with the intervening 21 nt) a footprint length of 23 nt. This matches the mode value (23 nt) when all footprints are considered (Fig. 3.5).

For E. coli, a sharp peak at position 10 was seen in the plots mapping the 3’ footprint

45 Figure 3.6: Metagene analysis of the 3’ and 5’ end of Ribo-seq footprints about the 3’ end of coding sequences in F. johnsoniae. (a) The peak at 9 nts to the right of the stop codon, which we interpret as stemming from ribosomes located with the stop codon in their A-site, indicates that the center of the P-site is 13 nts upstream of the most prevalent 3 end of the Ribo-seq footprints (b) The peak at 13 nts to the left of the stop codon is also interpreted as stemming from ribosomes located with the stop codon in their A-site.

46 Figure 3.7: Metagene analysis of the 3’ and 5’ end of Ribo-seq footprints about the 3’ end of coding sequences in E. coli. (a) The peak at 10 nts to the right of the stop codon, which we interpret as stemming from ribosomes located with the stop codon in their A-site, indicates that the center of the P-site is 14 nts upstream of the most prevalent 3 end of the Ribo-seq footprints. (b) A broad peak can been seen starting 13 nts and extending to 19 nts to the left of the stop codon.

ends (Fig. 3.7(a)), while a broad peak (-19 to -13, with a substantial upstream shoulder) was seen in plots mapping the 5’ footprint ends (Fig. 3.7(b)). These data indicate protected mRNA of variable length on the 5’ side of the termination complex, consistent with earlier reports [49; 51]. The range of footprint sizes predicted from these plots is in line with the histogram of Fig. 3.5. For B. subtilis, the data (Fig. 3.8(a),(b)) resembled those of E. coli, indicating considerable length heterogeneity on the 5’ side of the footprint. Earlier studies suggest that aSD pairing can extend the 5’ end of the footprint, leading to the observed length variability [49–51]. The fact that footprints from F. johnsoniae are short and exhibit little-to-no 5’ length variability implies that aSD-mRNA pairing rarely occurs in this organism.

47 Figure 3.8: Metagene analysis of the 3’ and 5’ end of Ribo-seq footprints about the 3’ end of coding sequences in B. subtilis. (a) The peaks at 9 and 10 nts to the right of the stop codon, which we interpret as stemming from ribosomes located with the stop codon in their A-site, indicates that the center of the P-site is ∼14 nts upstream of the most prevalent 3 end of the Ribo-seq footprints. (b) Due to the broad Ribo-seq length histogram seen in Fig. 3.5, the precise location of a peak is difficult to determine based on the 5’ ends of the footprints

48 Table 3.1: Start codon usage in protein coding genes in F. johnsoniae, E. coli, and B. subtilis

Start Codon F. johnsoniae E. coli B. subtilis AUG 94.49% 87.61% 77.69% GUG 1.60% 7.50% 9.19% UUG 3.06% 1.78% 12.64%

Start codon usage and AUG trinucleotide representation in the F. john- soniae TIR

In bacteria, natural inititation codons include AUG, GUG, and UUG. Each of these start codons can promote efficient initiation in E. coli, although AUG is optimal. In the absence of SD-aSD pairing, one might suspect that faithful initation depends more heavily on codon- anticodon pairing and hence the identity of the start codon. To explore this possibility we compared start codon usage in F. johnsoniae, E. coli, and B. subtilis. Table 3.1 shows a comparison of the precentage that these three initiation codons are used in all annotated protein coding genes by our three representative organisms and closely match previously published precentages [123] of E. coli and B. subtilis. In F. johnsoniae, 94.5% of genes begin with AUG, while only 87.6% and 77.8% of genes use AUG in E. coli and B. subtilis, respectively. While the potential for misannotated start codons among published genomes persists as a potentially confounding factor [89; 124; 125], these data suggest that organisms that lack the SD sequence are more reliant on AUG as an initating codon.

SD-aSD pairing is believed to facilitate start codon selection. In organisms such as F. johnsoniae, and possibly other members of the Bacteroidetes, an alternative mechanism(s) must be in place to ensure recognition of the correct start codon. One simple way F. johnsoniae could reduce spurious initiation events would be to eliminate accessible AUG trinucleotides other than the start codon in each TIR. To investigate this idea, we compared the number of AUG trinucleotides in the vicinity of the start codon to the number of “control” trinucleotides of the same base composition – AGU, GUA, GAU. As shown in Table 3.2, in F. johnsoniae the trinucleotide AUG is clearly underrepresented as compared to the control trinucleotides immediately upstream (-21 to -1) and downstream (+4 to +24) of the

49 start codon. The frequency of guanines is generally low in the upstream, non-coding region, explaining the higher prevalence of all four in the downstream window. As a control, we tallied the same trinucleotides at the midpoint of the gene and found that AUG was well represented. These data suggest that AUG trinucleotides in the vinicity of the start codon are selected against in F. johnsoniae. Comparatively, in E. coli and B. subtilis we do not see the same underrepresentation of AUG w.r.t to the control trinucleotides in the vicinity of the start codon. Within the vicinity of the midpoint of these organism’s genes, we see an overrepresentation of AUG as compared to the trinucleotides AGU and GUA.

We then asked ourselves how the representation of AUG as compared our control trinu- cleotides compared in other Bacteroidetes, Proteobacteria, and Firmacutes. To investigate this effect across our three representative phyla we generated sets of bacteria for each based on those used in previous studies [88; 89]. For this we used sets of 12 Bacteroidetes, and

10 Proteobacteria and Firmacutes, each. We then conducted the same analysis of regions within the vicinity of the start codon and the midpoint of each gene for each organism. For each phylum we obtained p-values using a one-sample t-test to measure their difference from one. To compare the inter-phyla difference, we also performed a two-sample t-test for each combination of the three phyla. Furthermore, we adjusted calculated p-values with a

Bonferroni correction for multiple test comparisons. Figures 3.9, 3.11, and 3.10 show the results.

Using AGU as our control (Fig. 3.9), we found that the Bacteroidetes (highlighted in blue) were significantly different from one, and whose average was less than one, upstream

−3 −3 and downstream of the start codon (padj = 1.4 × 10 , padj = 6.2 × 10 , repectively). In particular, the regions flanking the start codon of the Bacteroidetes were found to be underrepresentative of AUG with respect to AGU, while the midgene regions were found to be overrepresenting AUG with respect to AGU. Furthermore, a comparison of Bacteroidetes and Proteobacteria showed that the two were significantly different from one another in the regions upstream and downstream of the start codon (connected by an overhead bracket;

−2 −2 padj = 4.9 × 10 , padj = 1.9 × 10 , respectively). For the control trinucleotide GAU (Fig. 3.10) all three phyla displayed significant

50 Table 3.2: AUG representation with respect to “control” trinucleotides – AGU, GUA, GAU – in the vicinity of the start codon (also known as the translation initation site, TIS) and coding sequence midpoint in F. johnsoniae

Controls TIS Upstream TIS Downstream Midpoint Upstream Midpoint Downstream AGU 0.568 0.587 1.590 1.642 GUA 0.569 0.736 1.633 1.673 GAU 0.616 0.605 0.905 0.956

(highlighted in blue) underrepresentation of AUG within the region of nucleotides upstream

−8 −6 of the start codon (Bacteroidetes: padj = 4.6 × 10 , Proteobacteria: padj = 6.0 × 10 ,

−3 Firmacutes: padj = 2.9 × 10 ). While only two phlya, the Bacteroidetes and Proteobacteria, were found to be significantly different from one, and less than one, in the region downstream

−7 4 of the start codon (padj = 1.4 × 10 , padj = 4.3 × 10 , respectively). In the vicinity of the start codon, we found that the difference between the Bacteroidetes and Firmucutes as statistically significant (connected by an overhead bracket; upstream of start: padj =

−2 −3 1.8 × 10 , downstream of start: padj = 3.1 × 10 ). Finally, with GUA as the control (Fig. 3.11), we found that the Bacteroidetes (in blue) displayed a significant underrepresentation of AUG in one region – upstream of the start

−3 codon (padj = 1.7 × 10 ). Between the three phyla within the four regions considered, we found that the comparison between the Bacteroidetes and Proteobacteria showed a significant difference in the regions upstream and downstream of the start codon (connected

−3 −2 by an overhead bracket; padj = 8.4 × 10 , padj = 2.1 × 10 , respectively). These data lead one to conclude that F. johnsoniae, in particular, and the Bacteroidetes, in general, have an overall suppression of AUG within the vicinity of the start codon. This strategy could have evolved due to the loss of the SD element, which typically aligns the

30S subunit to correct start codon, thus, decreasing the chance of accidentally initiating translation on the wrong codon.

51 Table 3.3: AUG representation with respect to “control” trinucleotides – AGU, GUA, GAU – in the vicinity of the start codon (also known as the translation initation site, TIS) and coding sequence midpoint in E. coli

Controls TIS Upstream TIS Downstream Midpoint Upstream Midpoint Downstream AGU 1.010 0.948 1.919 1.899 GUA 1.135 1.082 1.792 1.842 GAU 0.778 0.757 0.905 0.901

Table 3.4: AUG representation with respect to “control” trinucleotides – AGU, GUA, GAU – in the vicinity of the start codon (also known as the translation initation site, TIS) and coding sequence midpoint in B. subtilis

Controls TIS Upstream TIS Downstream Midpoint Upstream Midpoint Downstream AGU 1.470 1.310 2.235 2.321 GUA 1.740 1.562 2.322 2.337 GAU 0.851 0.898 0.941 0.957

Figure 3.9: Representation of AUG as compared to AGU in Bacteroidetes, Proteobacteria, and Firmacutes. Boxes shaded blue signify that the individual phyla were found to be statistically signficantly different from one and less than one (padj ≤ 0.05). Brackets connecting two boxes highlight comparison between the two phyla found to be statistically significant (one star signifies a padj ≤ 0.05)

52 Figure 3.10: Representation of AUG as compared GAU in Bacteroidetes, Proteobacteria, and Firmacutes. Boxes shaded blue signify that the individual phyla were found to be statistically signficantly different from one and less than one (padj ≤ 0.05). Brackets connecting two boxes highlight comparison between the two phyla found to be statistically significant (one star signifies a padj ≤ 0.05)

Figure 3.11: Representation of AUG as compared to GUA in Bacteroidetes, Proteobacteria, and Firmacutes. Boxes shaded blue signify that the individual phyla were found to be statistically signficantly different from one and less than one (padj ≤ 0.05). Brackets connecting two boxes highlight comparison between the two phyla found to be statistically significant (one star signifies a padj ≤ 0.05; two stars signifies a padj ≤ 0.01)

53 RNA-seq data indentifies promoters with well-conserved -7 elements

Currently, genome-wide annotations of transcription start sites (TSSs) in F. johnsoniae are unavailable. However, several studies have focused on aspects unique to transcription in

Bacteroidetes [106–111]. The Bacteroidetes promoter, with -33/-7 elements TTTG/TANNTTTG, differs dramatically from the -35/-10 elements (TTGACA/TATAAT) of E. coli. In order to study this unique promoter more closely, we used Rockhopper (ver. 2.03) [126], an open source computational algorithm capable of analyzing bacterial RNA-seq data, to map transcriptional boundaries and determining operon structures in F. johnsoniae. Since the reliability of a transcriptome map depends on RNA-seq coverage levels, we restricted our search to genes that were deemed to be the lead genes in an operon or genes that did not belong to any operon structure, as predicted by Rockhopper. Of the 1,712 genes in our representative set, TSSs were predicted for 1,235 of these. In many of these representative cases, the TSS mapped upstream of the start codon, with an average mRNA leader length of

∼33 nts. A sequence resembling the -7 promoter element characteristic of the Bacteroidetes was obvious just upstream from many of the predicted TSSs. To further refine this list, we extracted the 50 nts immediately upstream of the predicted TSSs and used BioProspector

[127] to examine these regions for sequence motifs and adjust the position of the initial TSS as predicted by Rockhopper. 434 putative -7 elements were identified. For each, the distance to the TSS was calculated, yielding the histogram in Fig. 3.12.

Many of the sequences had 4 or 5 nucleotides between the -7 element and the start site, consistent with bona fide promoters. Furthermore, this distance is also consistent with those seen in mutational analysis studies of the ompA promoter region in F. johnsoniae [108].

Those sequences with spacer lengths of 4-5 nt (deemed probable promoters) were then aligned with respect to the -7 element, and the nucleotide frequencies at each position were plotted in Figs. 3.13 and 3.13. Six nucleotides of the -7 element (T-12, A-11, T-8, T-7,

T-6, G-5) were nearly invariant in this set of sequences, consistent with previous work that defined the consensus TANNTTTG. Guanine is cleary underrepresented at positions -10 and

-9, thus the -7 element consensus should be revised to TAHHTTTG (where H represents

54 Figure 3.12: Distance between the putative -7 element and the predicted TSS based on the 434 putative elements identified. We deemed promoters found in the sequences comprising peaks at -4 and -5 as bona fide.

Figure 3.13: Nucleotide frequencies as a function of position from the upstream regions of those sequences comprising the peaks 4 and 5 nucleotides away from the TSS in Fig. 3.12. Evidence of the -7 element (TANNTTTG) is clear. However, the -33 element (TTTG) is much less pronounced.

55 Figure 3.14: Sequence logo from the upstream regions from those sequences comprising the peaks at positions -4 and -5 from Fig. 3.12. Evidence of the -7 element (TANNTTTG) is clear. However, the -33 element (TTTG) is much less pronounced.

any nucleotide but G). A corresponding to the previously described -33 element was much less evident. Previous studies have also remarked that the -7 region is more conserved, but that the -33 region is far less pronounced [106; 109]. These findings, thus, rule out leaderless mRNA translation as the exclusive initiation mechanism, which is predominant in certain Bacteria and Archaea [91; 92; 94], and does not explain the absence of SD elements in F. johnsoniae.

Rate of translation is tuned by mRNA secondary structure near the start codon

A growing body of evidence [96; 128–132] indicates that secondary structure in the vicinity of a gene’s start codon plays a key role in initiation of translation, with structured mRNA elements inhibiting the process. To assess the impact of structure, we folded regions of nucleotides flanking each gene’s annotated start codon in silico in F. johnsoniae, E. coli, and

B. subtilis. We then calculated the average pairing probability for each nucleotide position throughout the selected region (see Methods and Materials). Given that most leadered regions are shorter than 100 nt, but that only ∼25% of our genes in F. johnsoniae were assigned a predicted TSS, we performed this folding for stretches of 30, 50, and 100 nt upstream and 100 nt downstream of annotated start codons. A comparison of the three folding regions, as depicted in Fig. 3.17, demonstrates that the average pairing probability per position is not especially altered by increasing the length over which we performed the

56 in silico fold. Figs. 3.15 and 3.16 further examine this comparison between the most active

TIRs of octile 1 and the least active TIRs of octile 8 and finds that, again, the effects of in silico folding over these three regions upstream of the start codon on a per position basis do not greatly alter the average pairing probability. Thus, in order to ensure that we captured most, if not all, of a gene’s TIR we used the 100 nt upstream region for our anaylsis.

A comparison of the average pairing probabilty per position for each octile of our representative gene set for F. johnsoniae, seen in Figure 3.18, suggests that the average pairing propensity around the start codon increases (i.e. increasing degrees of structure) as the ARD decreases. The most active TIRs of octile 1 showed significantly lower secondary structure around the start codon than the least active TIRs of octile 8, as seen in Fig. 3.19(a).

Nucleotides in the vinicity of the start codon showed a reduced propensity for pairing, and this region of lowered structure (-20 to +20) surrounded the start codon in a fairly symmetrical fashion. A two-sample t-test of octile 1 and octile 8 further confirms this, whose results can be seen in Fig. 3.19(b). The dashed horizontal line represents the cutoff for statistical significance (p ≤ 2.5 × 10−4). There is indeed a significant difference in terms of structure between these two octiles in both upstream of the initiation site and downstream within the genes themselves. These data suggest that tuning of the translation initiation of

F. johnsoniae is controlled, at least in part, by secondary structure within the TIR.

For comparison, we analyzed the relationship between folding in the vicinity of the start codon and ARD in E. coli and B. subtilis in the same fashion as described above. In E. coli, again, as shown in Fig. 3.20, we observed an inverse relationship between secondary structure and ARD across the octiles. Indeed, nucleotides near the start codon had a reduced tendency for pairing than those more distal and TIRs of octile 1 exhibited generally lower secondary structure than those of octile 8. Interestingly, as shown in Fig. 3.21, structural differences in E. coli between octile 1 and 8 TIRs were most apparent downstream from the start codon (+1 to +75). This is consistent with earlier studies of synthetic TIR-reporter libraries, which provided evidence that reduced RNA structure downstream from the start codon enhanced translation rates in E. coli [133; 134].

In the case of B. subtilis, as with F. johnsoniae and E. coli, we again observe in Fig. 3.22

57 Figure 3.15: Comparison of folded regions of all genes in the representative set 30, 50, and 100 nt upstream and 100 nt downstream of annotated translation initiation sites. Without explicit knowledge of transcription start sites in F. johnsoniae, this data demonstrated to us that the pairing probability per position was not especially altered by increasing the length over which we performed the in silico fold. This led us to conclusion that a 100 nt stretch upstream of the start site would capture the majority of 5’ UTRs in our representative gene set for F. johnsoniae

Figure 3.16: Comparison of folded regions of octile 1 genes 30, 50, and 100 nt upstream and 100 nt downstream of annotated translation initiation sites. These data demonstrate that the pairing probability per position within the subset of the highly translated genes is not affected by increasing the length over which we performed the in silico fold.

58 Figure 3.17: Comparison of folded regions of octile 8 genes 30, 50, and 100 nt upstream and 100 nt downstream of annotated translation initiation sites. These data demonstrated to us that the pairing probability per position within the subset of the least translated genes is not especially altered by increasing the length over which we performed the in silico fold.

Figure 3.18: Average pairing probability per position for each octile in F. johnsoniae. There is a clear trend that a decrease in pairing probability correlates with an increase in ARD.

59 Figure 3.19: (a) Comparison of average pairing probability per position between octiles 1 and 8 in F. johnsoniae, where octile 1 represents high-ARD genes and octile 8 low-ARD genes. (b) p-values for each position calculated via a two-sample test. The dashed magenta line represents the cutoff for statistical significance (padj ≤ 2.5e − 4).

Figure 3.20: Average pairing probability per position for each octile in E. coli. There is a clear trend that a decrease in pairing probability correlates with an increase in ARD.

60 Figure 3.21: (a) Comparison of average pairing probability per position between octiles 1 and 8 in E. coli, where octile 1 represents high-ARD genes and octile 8 low-ARD genes. (b) p-values for each position calculated via a two-sample test. The dashed magenta line represents the cutoff for statistical significance (padj ≤ 2.5e − 4).

the same inverse structure versus ARD behavior. However, in contrast to E. coli, we find that structural differences between octile 1 and 8 TIRs were most apparent upstream from the start codon (-26 to -10) in a region that would also be located immediately upstream of and including the SD sequence. Figure 3.23(a) highlights this in its comparison of octile 1 and octile 8. Furthermore, we can see in Fig. 3.23(b) that the majority of positions that show a significant difference between these two octiles lie in this upstream region where one would expect to find the SD sequence.

Numerous studies [128; 130; 135] have pointed out that bacterial mRNA forms structure within their 5’ UTR in an effort to modulate accessibility of the mRNA by the 30S ribosomal subunit. Thus, the formation of these structures directly contributes to the regulation of translation. So it comes as no surprise that we, too, see an inverse relationship between secondary structure in the vicinity of the start codon and ARD. However, it is thought that structure alone does not regulate translation. The presence of the SD element, too, directly contributes to the inititation of translation in most bacteria. To further investigate the degree to which structue is used to modulate translation, we considered the nucleotide

61 Figure 3.22: Average pairing probability per position for each octile in B. subtilis. There is a clear trend that a decrease in pairing probability correlates with an increase in ARD.

Figure 3.23: (a) Comparison of average pairing probability per position between octiles 1 and 8 in B. subtilis, where octile 1 represents high-ARD genes and octile 8 low-ARD genes. (b) p-values for each position calculated via a two-sample test. The dashed magenta line represents the cutoff for statistical significance (padj ≤ 2.5e − 4).

62 composition within the vicinity of the start codon (-30 to +30 of the start codon) in F. johnsoniae and compared it with the nucleotide composition within the same region in E. coli and B. subtilis, both SD-dependent organisms.

In Fig. 3.24 we first considered B. subtilis. As has been previously reported [123], nucleotide composition prior to the start codon reveals peaks of high frequency A and G, and low frequency of C, at corresponding positions where one would find the SD sequence.

Immediately downstream of the start codon we see that A is particularly overrepresented at the beginning of genes, but approaches a frequency level similar to that of the coding region close to position +30. Figures 3.25 and 3.26 further consider the frequency of G and

A and compare their levels between the high-ARD genes of octile 1 and the low-ARD genes of octile 8. This comparison, in Fig. 3.25(a), reveals that in the region where one would

find the SD element the more highly translated genes of octile 1 have a slight increase in frequency possibly due to a stronger SD sequence (i.e. closer to the consensus sequence) being present in more of these genes. This difference is most prominent at position -12. This is reflected in that position’s two-sample t-test p-value, where the dashed horizontal line represents the cutoff for statistical significance (p ≤ 8.2 × 10−4), as seen in Fig. 3.25(b).

In Fig. 3.26(a), we see little to no difference in the frequency of A immediately upstream between the two octiles leading one to conclude that both octiles rely on unstable mRNA structure due to the its overrepresentation. However, octile 1 does appear to contain more genes with AUG as the initiating codon, as indicated by position 0 in Figs. 3.26(a) and 3.26(b). It was previsouly noted in [123] that genes with near-cognates (i.e. UUG,

GUG) as the initiating codon were more structured than those starting with AUG.

Slight variations between the two octiles do occur immediately downstream of the start codon, but the results of a two-sample t-test do not clearly indicate that either octile is more reliant on the decreased structure due to the overrepresentation of A within this region.

Thus, one could conclude that translation is generally regulated by a balance between the strength of the SD element and mRNA structure within the vicinity of the start codon in B. subtilis.

In Fig. 3.27 we see that, similar to B. subtilis, prior to the start codon the nucleotide

63 Figure 3.24: Nucleotide frequencies in the vicinity of the start codon, which sits at positions 0-2, in B. subtilis.

Figure 3.25: (a) Comparison of the frequency of G between octile 1 and octile 8 in B. subtilis. (b) p-values for each position calculated via a two-sample test. The dashed magenta line represents the cutoff for statistical significance (padj ≤ 8.2e − 4).

64 Figure 3.26: (a) Comparison of the frequency of A between octile 1 and octile 8 in B. subtilis. (b) p-values for each position calculated via a two-sample test. The dashed magenta line represents the cutoff for statistical significance (padj ≤ 8.2e − 4).

composition of E. coli reveals peaks of high frequency A and G at positions that correspond to where one would find the SD sequence. Immediately downstream of the start codon we see that the represenation of A is not as dominante as it was in B. subtilis (Fig. 3.26(a)), and that its frequency approaches that of the average coding region after the first five codons downstream of the start codon. Octile 1, as seen in B. subtilis, also contains significantly more genes with AUG as the initiating codon, as indicated by position 0 in Figs. 3.29(a) and 3.29(b), potentially decreasing the mRNA structure of the start codon.

Comparing the frequency of G within the region of the SD sequence between octile 1 and octile 8 in Fig. 3.28(a) reveals that both octiles appear to similiarly rely on the SD element.

Within the spacer region (i.e. the region between the SD element and the start codon) octile 8 shows an increased occurence of G, which would increase the mRNA structure of the spacer region and decrease accessibilty of the start codon. However, all but one of the positions in this region are above the cutoff for statistical significance (p ≤ 8.2 × 10−4). The situation is reversed when one compares the frequency of A in Fig. 3.29(a) within the spacer region. Figure 3.29(b) highlights that the differences at positions -3 and -6 are particularly significant (p < 8.2 × 10−4). We can again attribute this to an increase in unstable mRNA 65 structure close to the start codon in more highly translated genes.

Unlike B. subtilis, E. coli contains the gene rpsA that codes for the ribosomal protein

S1. Ribosomal protein S1 interacts with a pyrimidine-rich region 5’ to the SD element on an mRNA, which acts as a ribosome recognition site [81]. S1 has been shown to be essential in the initiation of translation within this organism [135; 136]. Komarova et al. found that A/U-rich elements inserted upstream of SD elements stablized mRNA and enhanced translation of the lacZ mRNA in E. coli. In Fig. 3.27 we can see that the frequency of T reaches levels similar to that of A at positions upstream of -19 and -20. These positions correspond to regions 5’ of the SD sequence. Comparison of octiles 1 and 8 in Fig. 3.30(a) also shows a slight increase in T levels in octile 1 at and upstream of position -15. While none of the positions within this upstream region, with the exception of position -26, (Fig. 3.30(b)) are above the cutoff of statstical significance (p ≤ 8.2 × 10−4), several of them suggest that

S1 more strongly interacts with the mRNAs of octile 1, enchancing their translation. Thus, one could conclude that an increased reliance on mRNA structure within the vicinity of the start codon and the mediatation of translation by the ribosomal protein S1 bolsters the mitigated role of the SD element, as compared to B. subtilis, in the regulation of translation in E. coli.

Turning our attention to F. johnsoniae, we see in Fig. 3.31 that, unlike B. subtilis or

E. coli, the upstream region of the start codon is dominated by high frequency peaks of A.

This includes the region that corresponds to the SD sequence, reinforcing studies that have indicated its absence in Bacteroidetes [88; 89]. Similar to B. subtilis, however, immediately downstream of the start codon we see that A is also overrepresented. Comparison of octile 1 and octile 8 in Fig. 3.32(a) reveals that a large portion of this high signal in A at positions

-14 to -11 can be attributed to octile 1. Indeed, it is within that short stretch of nucleotides, showing in Fig. 3.32(b), that we see the greatest statistical significance between these two octiles. In fact we find that the enrichment of As in this region does appear to correlate with ARD, based on the octile-to-octile comparisons seen in Fig. 3.33.

These data suggest that loss of the SD-aSD interaction during the evolution of Bac- teroidetes has been replaced by a stretch of adenines that creates a mechanism where highly

66 Figure 3.27: Nucleotide frequencies in the vicinity of the start codon, which sits at positions 0-2, in E. coli

Figure 3.28: (a) Comparison of the frequency of G between octile 1 and octile 8 in E. coli. (b) p-values for each position calculated via a two-sample test. The dashed magenta line represents the cutoff for statistical significance (padj ≤ 8.2e − 4).

67 Figure 3.29: (a) Comparison of the frequency of A between octile 1 and octile 8 in E. coli. (b) p-values for each position calculated via a two-sample test. The dashed magenta line represents the cutoff for statistical significance (padj ≤ 8.2e − 4).

Figure 3.30: (a) Comparison of the frequency of T between octile 1 and octile 8 in E. coli. (b) p-values for each position calculated via a two-sample test. The dashed magenta line represents the cutoff for statistical significance (padj ≤ 8.2e − 4).

68 translated genes contain little to no structure within the region where one would find the SD element. This would allow for easier binding of the 30S subunit to the mRNA and pairing with the correct start codon. Furthermore, decreasing degrees of translational activity would indicate an increase in structure within this same region. Similary arguments were made in previous studies [137; 138] that looked at the initiation of translation in Prevotella bryantii

TC1-1, also a member of the Bacteroidetes.

To date, there are no known studies that explicitly consider the role that S1 plays in the initiation of translation in Bacteroides. However, based on previously mentioned results from E. coli studies, S1 has been found to be an important determinant in the inititaion of translation. Due to S1’s affinity for A/U-rich regions and positive effects on translation [95], we further considered the overrepresentation of T within the upstream region of the start codon, as seen in Fig. 3.31. We find that upstream to the stretch of As that have replaced the SD-element there is a slight increase in the levels of T within octile

1, shown in Fig. 3.34(a). In comparing octiles 1 and 8 in we find that the frequency levels of T appear consistent with one another and that no positions within this region standout in terms of statistical significance (Fig. 3.34(b)). Additionally, we do not see the same correlation between occurence of T within each octile to ARD in Fig. 3.35 as we found in

Fig. 3.33. This suggests that all genes are similarly dependent on the interaction between S1 and their TIR in binding the 30S to the mRNA, and that structure plays the predominant role in regulating translation in Bacteroidetes.

Discussion/Conclusion

The initiation of translation in prokaryotes requires the assembly of a ribosome complex with an initiator tRNA bound to its peptidyl (P)-site and paired to the correct start codon of an mRNA. Recognition of the correct start codon is facilitated by the SD sequence, which lies ∼5-9 nt upstream from the start codon of an mRNA [5]. However, various studies have found that this translational regulatory element is absent from entire phyla, such as the Bacteroidetes. F. johnsoniae is a member of the Bacteroidetes [106; 109]. Through a

69 Figure 3.31: Nucleotide frequencies in the vicinity of the start codon, which sits at positions 0-2, in F. johnsoniae

Figure 3.32: (a) Comparison of the frequency of A between octile 1 and octile 8 in F. johnsoniae. (b) p-values for each position calculated via a two-sample test. The dashed magenta line represents the cutoff for statistical significance (padj ≤ 8.2e − 4).

70 Figure 3.33: Comparison of the frequency of A between all octiles in F. johnsoniae.

Figure 3.34: (a) Comparison of the frequency of T between octile 1 and octile 8 in F. johnsoniae. (b) p-values for each position calculated via a two-sample test. The dashed magenta line represents the cutoff for statistical significance (padj ≤ 8.2e − 4).

71 Figure 3.35: Comparison of the frequency of T between all octiles in F. johnsoniae.

combination of RNA-seq and Ribo-seq we studied the role of mRNA secondary structure to determine its impact on the initiation of translation within this organism. As a comparison, we also used RNA-seq and Ribo-seq data from two model organisms, E. coli and B. subtilis, to serve as representatives of phyla known to largely depend on the SD sequence to initiate translation. From these data we found that the average Ribo-seq footprint length in F. johnsoniae is ∼4-6 nt shorter than those of E. coli and B. subtilis. The finding led us to conclude that aSD-mRNA pairing rarely, if ever, occurs in F. johnsoniae.

Textbook models of translation initiation state that selection of the correct start codon is facilitated by the SD sequence. In organisms without the SD sequence, an alternative mechanism must be in place to ensure recognition of the correct start codon. We compared the number of AUG trinucleotides in the vicinity of the start codon to the number of “control” trinucleotides of the same base composition – AGU, GUA, GAU. We found that suppression of AUG trinucleotides was not observed in E. coli and B. subtilis. However, we did discover that AUG trinucleotides in the vinicity of the start codon are suppressed in F. johnsoniae, thus reducing spurious initiation events at sites other than the start codon in each TIR.

We then extended this investigation to find out if this underrepresentation of AUG existed

72 within the three phyla: Bacteroidetes, Proteobacteria, and Firmacutes. Overall, we found that the Bacteroidetes do show a suppression of AUG trinucleotides within the vinicity of start codon as compared to the other two.

Two alternative mechanism of translation initiation have been identified in prokaryotes: leaderless mRNAs and ribosomal protein S1 mediated translation of unstructured mRNA [88;

139]. While genome-wide annotations of TSSs do not yet exist for F. johnsoniae, we leveraged

RNA-seq data to study a transcriptional promoter sequence unique to Bacteroidetes. We found that among our representative set of 1,712 highly expressed genes that 434 genes contained a putative promoter element exlusive to the Bacteroidetes and assigned those a TSS. This led us to conclude that the proposed translational mechanism of leaderless mRNAs is not the dominate method used by F. johnsoniae. However, this does not rule out that such a mechanism occurs with F. johnsoniae. Furthermore, we found that across

ARD-ranked subsets of our representative genes for all three of the organisms, there exists an inverse relationship between mRNA structure and ARD in the vicinity of the start codon.

However, the difference between the three oranisms lies in the location least likely to form a stable structure. In B. subtilis we found that lower ARD TIRs tended to be more structured upstream of the start codon than their higher ARD counterparts in the region where one would find the SD sequence, similarly reported in [123]. In E. coli we found that lower ARD TIRs tended to be more structured downstream of the start codon, corroborated by previous studies providing evidence that reduced RNA structure downstream from the start codon enhanced translation rates [133; 134]. In F. johnsoniae, while also we found that lower ARD TIRs tended to be more structured than their higher ARD counterparts, this structure formed more symmetrically around the start codon than in E. coli and B. subtilis. Additionally, we noted that ∼11-14 nt upstream of the start codon there exists a high occurence of A nucleotides in the region where one would find the SD sequence, suggesting that the SD-aSD interaction had been replaced by a structure-driven mechanism.

Furthermore our data indicates an explicit ARD dependence within this region. We found statistically significant differences in positions in this region between low and high ARD genes leading us to conclude that mRNA sequence is an important, if not the predominate,

73 translational determinant in this region. Additionally, we drew similar conclusions concerning the frequency of T nucleotides immediately upstream of this region. This region is a known ribosome recognition site due to its interaction with the ribosomal protein S1. Interestingly, the gene that codes for S1, rpsA, is found among the highly translated genes of octile 1 in our data set for F. johnsoniae. In E. coli, this gene was not included in our representative set of genes due to low expression and it is entirely absent in B. subtilis.

The results presented here lead us to the conclusion that the initiation of translation in F. johnsoniae relies on a combination of low mRNA structure in the vicinity of the start codon and at the site where one would find the SD sequence, and mediation by ribosomal protein S1.

Further research could establish F. johnsoniae as the model organism for SD-independent translation initiation. In particular, studying the effects of deleting or knocking out rpsA from F. johnsoniae’s genome and then investigating how its genes differentially express and translate could shed more light on this unique translational process.

74 Chapter 4 The role of FHIT on the translation of cancer-associated mRNAs

Introduction: Background on Fhit

This chapter is based on the published paper, ”Impact of FHIT loss on the translation of cancer-associated mRNAs” [53].

Fhit as 5’ cap scavenger

Fragile histidine triad (Fhit) is a small cytoplasmic protein that does not interact with known tumor suppressors or oncogenes. This protein’s name derives, in part, due to it being located on a fragile site on 3. Chromosomal fragile sites are unstable genomic regions that tend to break during DNA replication and are involed in structural variation [140]. Its name also reflects a His-X-His-X-His-XX motif characteristic of nucleoside hydrolases [141].

Fhit has been shown in vitro to cleave diadenosine triphosphate (ApppA) to yield ADP and

AMP [142], and Appp3 accumulates in Fhit-deficient cells [143].

All eukaryotic mRNAs receive a 5’ cap consisting of 7-methylguanosine linked to the first transcribed nucleotide through a 5’,5’ triphosphate linkage (m7GpppX). Fhit has recently been identified as a scavenger decapping enzyme [144]. Scavenger decapping enzymes are responsible for degrading m7GpppX cap dinucleotides that are the result of 3’-5’ mRNA decay. DcpS is the major scavenger decapping enzyme [145], and the identification of Fhit

75 as another of this type of enzyme is consistent with previous work describing Fhit’s ability to cleave an mRNA cap-like molecule [146]. Because they can be bound by eIF4E, mRNA decay remnants, such as the cap, can inhibit translation initiation if they accumulate to a level that can compete with mRNA 5’ ends.

Translation plays a critical role in cancer [147], and Fhit, as a genome caretaker and tumor suppressor, is silenced in >50% of cancers [148]. A recent study showed that, for a number of cancer-associated mRNAs, changes in the ribsome occupancy of upstream open reading frames (uORFs) witin a gene’s 5’ UTR precedes the appearance of detectable tumors [55]. Considering its role as a cap scanvenger, this raises the possibility that Fhit loss could affect changes in the translation of cancer-associated mRNAs, possibily as a consequence of increased concentrations of m7GpppX caps due to 3’-5’ mRNA decay.

Materials and Methods

Cell culture and library preparations

To examine the impact of Fhit protein and its loss on the scope of translating mRNAs, our study used Fhit expression-negative H1299 cancer cells carrying an inducible Fhit transgene.

These engineered carcinoma cells, which inducibly express Fhit (D1) or not (E1), have been described in [149]. Further details of methods and materials for the biological experiments can be found in [53].

RNA-seq, Ribo-seq, and informatics

RNA-seq was performed by the Genomic Services Lab, HudsonAlpha Institute of Biotech- nology, on cytoplasmic RNA harvested from input material from sample-matched H1299 D1 and E1 cultures that were used for ribosome profiling (Ribo-seq). Libraries were prepared using their automated pipeline and 50 base-pair paired-end reads were obtained with an

Illumina Hi-Seq 4000. The raw data are available online under BioProject accession number

PRJNA390535. Two RNA-seq replicates of D1 cell data initilly contained 84,883,380 and

81,745,526 read pairs, respectively, while the two E1 control cell replicate files included

76 83,998,806 and 63,296,495 read pairs.

Sequencing of the Ribo-seq samples was performed using a HiSeq2500 in the Ohio State

Comprehensive Cancer Center Genomics Shared Resource in 50 base-pair single-end mode.

The two Ribo-seq replicates of D1 cell data initially contained 68,761,204 and 47,906,118 single-end reads, respectively, while the two E1 control cell replicates contained 40,835,106 and 50,341,628 reads, respectively. 3’ adapter sequences were trimmed using Skewer [114] and sequence alignment was completed using the STAR aligner (version 2.4.0j) [150], which aligned reads to the hg19 genome. Alignments were then converted to the BAM format using

SAMtools (ver. 0.1.18) [116]. Reads associated with ribosomal RNA (rRNA) and transfer

RNA (tRNA) were removed from alignment files using the BEDtools utility ’intersect’ [117].

Gene expression was measured by counting aligned reads that overlapped with Refseq gene annotations using HTSeq (ver. 0.6.0) [118], which preprocesses RNA-seq and Ribo-seq data for differential expression analysis. Unnormalized read counts were then used to detect differential expression with the DESeq2 analysis toolset (version 1.0.17) [151] by calculating the geometric mean across all samples for each gene. Counts for each gene are divided by this mean correcting for library size. DESeq2 then fits negative binomial generalized linear models for each gene and tests the significance using the Wald statistical test. Additionally, unnormalized read counts were used with RiboDiff (ver. 0.2.1) [56] to detect genes with changes in translation across exprerimental conditions. The application fits generalized linear models to estimate over-dispersion of RNA-seq and Ribo-seq measurements separately, and used mRNA abundance and ribosome occupancy to perform a statistical test for differential translation.

Alignment Rates

Total alignment rate for the Fhit-positive (D1) RNA-seq replicates were 57% (13% mul- timapped) and 48% (11% multimapped), with 21% and 25% aligning to regions of r/tRNA.

Similarly, for the Fhit-negative (E1) RNA-seq replicates 44% (9% multimapped) and 54%

(12% multimapped) for these reads were aligned with 27% and 24% aligning to regions of r/tRNA. Total alignment rates for Fhit-positive Ribo-seq replicates were 16% (5% mul-

77 timapped) and 16% (6% multimapped), with 77% and 78% aligning to regions of r/tRNA.

Similarly, for the Fhit-negative Ribo-seq replicates 9% (3% multimapped) and 12% (6% multimapped) of the reads were aligned, with 79% and 81% aligning to regions of r/tRNA.

5’ UTR vs Coding regions

To better understand how Fhit’s presence affects all genes, we focused on both a gene’s 5’

UTR and its coding region. This allowed us to localize any statistically significant effects within either region that might have otherwise been overlooked. In order to mitigate the influence either region may have on the other, we removed, in silico, the first 25 bases downstream from the first base of the annotated start codon in the coding regions. Then using HTSeq and RiboDiff, we measured the differential translation of the 5’ UTR and coding regions separately. 18,283 single Refseq mRNA isoforms for each gene were chosen from the HUGO Committee database [152], and annotations were downloaded from the UCSC Table Browser [153]. uORF Preditions

The sequences of the entire 5’ UTR and coding region were analyzed using the seqshoworfs command within MATLAB’s (Ver: 8.1.0.60) Bioinformatics Toolbox (Ver: 4.3), with the

’AlternativeStartCodon’ search parameter set to true, to look for start codons: ATG, GTG,

CTG, TTG; reading frames: 1, 2, and 3 on the direct strand only; and searched for uORFs with a minimum length of 10 codons using the standard genetic code.

Results

Identifying mRNAs whose translation is controlled by Fhit

Effects of Fhit on transcriptome (RNA-seq)

Replicate comparison plots in Fig. 4.1 show good agreement between duplicate RNA-seq libraries of E1 and D1 cell lines (Spearman’s r >0.98). Most mRNAs showed no statistically significant change in their steady-state levels with the expression of Fhit. Setting a cutoff log

78 fold-change of 1.5 many of the of the mRNAs with a statistically significant change in the expression of Fhit were above this threshold, as indicated by the red circles in Fig. 4.2. In total, there were 209 genes that increased expression in Fhit-positive versus Fhit-negative and

377 genes that decreased. These mRNAs identified, along with their changes in abundance, are listed in appendix A.

Effects of Fhit on translatome (Ribo-seq)

Replicate comparison plots of Ribo-seq data in Fig. 4.3 again show good agreement between duplicate D1 and E1 cell lines (Spearman’s r >0.95). Library quality is further supported by the metagene analysis of Ribo-seq reads showing the expected triplet phasing near the start and stop codons with a peak 13 nucleotides upstream of the start codon in Fig. 4.4, which is characteristic of ribosomes bound to this site. The expression of Fhit also impacted the quantity of ribosome-bound transcripts. Again using a cutoff of 1.5-fold change and a p-adj≤0.05, Fig. 4.5 identifies 67 transcripts with an increase of bound ribosomes and 103 transcripts with a decrease in bound ribosomes. These mRNAs are listed along with their changes in bound ribosomes in the supplementary files found in [53].

Effects of Fhit on average ribosome density

As seen above, the induction of Fhit changes the amount of Ribo-seq for many transcripts.

These changes could be the result in changes to translational efficiency, steady-state mRNA levels, or some combination of those factors. To investigate this, the change in ratio of

Ribo-seq reads to RNA-seq reads was measured using RiboDiff [56]. These data were used to determine the average ribosome density (ARD), which tracks translational efficiency for each mRNA [54]. Figure 4.6 shows that scatter plots of ARD between duplicates are in good agreement (Spearman’s r >0.8). These data were then used to compare coding region

ARD between treatment groups. From Fig. 4.2, it is clear that the expression of Fhit had little effect on the translational efficiency across the vast majority of the transcriptome.

However, our analysis identified six mRNAs with a statistically-significant increase in coding region ARD in Fhit-positive versus Fhit-negative cells and four mRNAs with a decrease in

79 Figure 4.1: Comparison of RNA-seq libraries from Fhit-negative (E1) and Fhit-positive (D1) H1299 cell lines. Shown are scatterplots of duplicate RNA-seq libraries from Ponasterone A-treaded H1299 cells with the Spearman coefficient for each shown in the green box. The E1 cell line is stably transfected with empty vector and the D1 cell line carries an inducible Fhit transgene.

80 Figure 4.2: The log2 change in steady state levels of the mRNA transcriptome is shown as a function of the RNA-seq read counts. All mRNAs that underwent a statistically significant change in association with Fhit expression (p < 0.05) are indicated with red circles, and we arbitrarily selected a 1.5-fold change as a cutoff for further study. These transcripts and their fold changes can be found in [53]

coding region ARD. These genes, and their respective coding region fold change, are listed in Tab. 4.1.

Fhit loss associated with changes in ribosome distribution across coding region of some genes

We obtained a more detailed view of Fhit’s impact on translation by mapping the distribution of ribosomes across the entire length of each mRNA highlighted in Fig. 4.8. CDKN2C, shown in Fig. 4.8(a), underwent the greatest change in coding region ARD. This gene’s short coding region (506 nt) shows little evidence for ribosome binding in the absence of Fhit and ribosomes bound across this region in the presence of Fhit. CDKN2C also has a long

5’ UTR (1216 nt) that included several putative upstream open reading frames (uORFs), and bound ribosomes were detected here regardless of Fhit expression. Interestingly, the appearance of ribosomes across the coding region coincided with a reduction in ribosome

81 Figure 4.3: Comparison of ribosome profiling libraries from Fhit-negative (E1) and Fhit- positive (D1) H1299 cell lines. Shown are scatterplots of duplicate ribosome profiling libraries from Ponasterone A-treaded H1299 cells with the Spearman coefficient for each shown in the green box. The E1 cell line is stably transfected with empty vector and the D1 cell line carries an inducible Fhit transgene.

82 Figure 4.4: Metagene analysis shows periodicity of bound ribosomes. The 5’ ends of reads of each of the ribosome protected fragments were mapped with respect to the start (A) and stop (B) codons. This confirmed a 3 nucleotide periodicity consistent with the triplicate nature of codons and the peak 13 nucleotides upstream of start codons is consistent with ribosomes bound at these locations.

83 Table 4.1: mRNAs showing Fhit-dependent changes in ribosome loading. The average ribosome density (ARD) for Ponasterone A-treated E1 and D1 cells was determined using RiboDiff and only those transcripts with an adjusted p value <0.05 were considered further. Column 3 shows the log2 inductions ratio +Fhit(D1)/-Fhit(E1). The upper panel lists mRNAs with changes in ribosomes bound to coding regions. The lower panel lists mRNAs with changes in ribosomes bound to the 5’ UTR using a cutoff of 25 nucleotides upstream of the annotated start codon.

Symbol Gene ID Log2(D1/E1) Coding Region CDKN2C NM 001262 3.47 IGSF9 NM 020789 3.42 ATG16L2 NM 033388 3.39 TP53I3 NM 004881 2.18 MECP2 NM 004992 1.48 CSRP2 NM 001321 1.31 EEF2 NM 001961 0.81 GDA NM 001242505 1.31 IFIT1 NM 001548 2.37 LRRC73 NM 001012974 4.39 5’ UTR ACSL1 NM 001995 3.69 TROVE2 NM 004600 3.47 ADAM9 NM 003816 1.51 HIST1H2AD NM 021065 1.30 H2AFZ NM 002106 0.72 RACK1 NM 006098 0.88 RPL37A NM 000998 2.01

84 Figure 4.5: The log2 change in the coding region for each mRNA is shown as a function of Ribo-seq read counts. Statistically significant changes are indicated by red circles. These transcripts and their fold changes can also be found in [53].

occupancy in the 5’ UTR. This suggests that one or more features of this transcript’s 5’

UTR could be responsible for the Fhit-mediated increase in translation. This can also be seen in ATG16L2 and MECP2, also shown in Fig. 4.8(a).

In Fhit-negative cells, IGSF9 and TP53I3 both showed little evidence for translating ribosomes in their respective coding region. However, both do show bound ribosomes in the presence of Fhit. TP53I3 displayed stalled ribosomes at the initiation codon that disappear and are replaced by ribosomes distributed across the coding region when Fhit is expressed.

Taken together, these data support a role for Fhit in maintaining the translation of a limited number of mRNAs.

Through our analysis of differential translation with RiboDiff, we also identified a number of mRNAs whose ARD decreased in the presence of Fhit, as shown in Fig. 4.8(b). The most abundant of these was EEF2, which showed a generalized decrease in translating ribosomes across its entire transcript.

85 Figure 4.6: Average ribosome density of Fhit-negative and Fhit-expressing H1299 cells. Shown are scatterplots of average ribosome densities from ribosome profiling libraries of H1299 cells. The Spearman coefficient for each plot is shown in the green box. The E1 cell line (A) is stably transfected with empty vector and the D1 cell line (B) carries an inducible Fhit transgene. Both cell lines were treated with Ponastrerone A.

86 Figure 4.7: RiboDiff was used to normalize the coding region Ribo-seq data in Fig. 4.5 to the RNA-seq data in Fig. 4.2. Fhit-mediated changes in the resulting average ribosome density (ARD) are plotted as a function of RNA-seq read counts. A cutoff of 1.5-fold was selected. mRNAs that underwent a statistically significant change with Fhit (p < 0.05) are labeled and indicated with red dots.

Validation of ARD targets

In the Schoenberg Lab, within the Department of Biological Chemistry and Pharmacology,

Daniel Kiss, PhD, attempted to confirm the preceding results using 3 targets (TP53I3,

EEF2, and IFIT1) selected from Tab. 4.1 for evaluation by Western blotting and quantitative reverse transcription polymerase chain reaction (RT-qPCR). As shown in Fig. 4.9(a), Western blotting showed that TP53I3 was 4.2-fold higher in Fhit-positive vs Fhit-negative cells. This corroborated with the 4.5-fold change in ARD as measured by RiboDiff. EEF2 was found to be 1.5-fold lower in Fhit-expressing cells by Western blotting and 1.7-fold lower by RiboDiff.

However, Western blotting showed a 1.6-fold lowering in IFIT1, which is in contrast to the

5.2-fold lowering seen by RiboDiff. This reason for this difference is unknown at this time.

Vinculin expression was unchanged in both RNA-Seq and RIBO-Seq (Supplementary files in [53]), and quantitative changes for each of these proteins were determined by normalizing

87 Figure 4.8: Fhit-mediated changes in ribosome distribution across mRNAs idenfied by coding region ARD. (a) Ribosome distribution is shown across the mRNA, identified in the upper right corner of each box, as having increased coding region ARD in Fhit-positive (blue) vs. Fhit-negative (red) cells. (b) Ribosome distribution is shown across the mRNAs, identified in the upper right corner of each box, as having decreased coding region ARD in Fhit-positive (blue) vs. Fhit-negative (red) cells. The coding region of each mRNA is shaded.

88 Figure 4.9: Fhit-mediated changes in targets protein expression are independent of mRNA level. (a) D1 and E1 cells treated for 24 hr with Ponasterone A were analyzed for changes in TP53I3, IFIT1, EEF2, and Vinculin. Shown are the representative Western blots for each of triplicate determinations and the fold change normalized to Vinculin is presented beneath the panels for TP53I3, IFIT1, and EEF2. (b) Cytoplasmic extracts from these cells were spiked with luciferase RNA as a control for sample recovery and analyzed by RT-qPCR for changes in steady-state levels of TP53I3, EEF2, IFIT1, and Vinculin mRNA. The mean fold change ± SEM (n = 3) determined for each mRNA by RT-qPCR (black bars) is shown alongside the average difference from the duplicate RNA-seq datasets. (c) The analysis in (a) was repeated for ADAM9 and FASN, with U2AF35 used as a normalization control.

to Vinculin.

Fhit-mediated differences in the corresponding mRNAs by RT-qPCR were considered by

Daniel Kiss, who then compared these results with those obtained by RNA-Seq (Fig. 4.9(b)).

Although TP53I3 mRNA was somewhat higher in Fhit-positive versus Fhit-negative cells

(∼1.5-fold higher by RT-qPCR and ∼1.2-fold higher by RNA-Seq) this was significantly less than the increase in TP53I3 protein determined by Western blotting and ARD, and by the changes in ribosome distribution seen in Fig. 4.8(a). Although IFIT protein was lower in

Fhit-positive cells IFIT1 mRNA was higher (1.8-fold higher by RT-qPCR, 1.6-fold higher by

RNA-Seq), and EEF2 protein was ∼1.5 lower but EEF2 mRNA was either unchanged by

RNA-Seq or slightly lower by RT-qPCR. Taken together the preceding data confirm that

Fhit affects the expression of a limited number of proteins through changes in the translation of their corresponding mRNAs.

89 Effects of Fhit on 5’ UTR ribosome occupancy

The identification of 5’ UTR bound ribosomes on Fhit-regulated mRNAs was unanticipated and suggested a possible link between Fhit-mediated changes in gene expression and cancer- associated changes in 5’ UTR ribosome binding [55]. RiboDiff analysis of the 5’ UTRs identified five mRNAs (Fig. 4.10) with increased 5’ UTR ribosome occupancy and in Fhit- expressing versus Fhit-negative cells, and two mRNAs (Fig. 4.11) with decreased 5’ UTR ribosome occupancy. Changes ranged from a 13.6-fold increase (ACSL1) to a 4-fold decrease

(RPL37A). In ACSL1 and TROVE2 differences in 5’ UTR ribosome occupancy were also associated with differences in the amount ribosome distributed across their respective coding regions.

Fhit expression affects the relative representation of 5’ UTR vs coding region- bound ribosomes

For several of the genes in Fig. 4.7, it appeared that additional translation events were occurring within the 5’ UTR, presumably at uORFs. Similar to [55] translation of the 5’

UTR serves well as a proxy for uORF translation. The change in translational efficiency for the relative 5’ UTRs of all the genes was then calculated using RiboDiff. For this targeted

RiboDiff analysis, the Ribo-seq reads in the coding region replaced RNA-seq data and were compared to the Ribo-seq reads in the 5’ UTR. Using a cutoff of padj ≤ 0.05 this method identified 19 genes having statistically significant changes in relative translational efficiency as a function of Fhit-expression. Data demonstrating how these 19 transcript relate to the rest of the translatome is shown in Fig. 4.13. Several of these, including CDKN2C, ACSL1,

TROVE2, MECP2, and RPL37A were previously identified in Fig. 4.7.

The remaining genes fall into four groups. mRNAs whose 5’ UTR and coding region occupancy increase in Fhit-positive cells are shown in Fig. 4.13(a). For these five mRNAs, the relative increase in ribosomes bound to the 5’ UTR outnumbered the increase seen in the coding region. The next four genes (Fig. 4.13(b)) saw their 5’ UTR occupancy increase while the coding region occupancy remained virtually unchanged. The last four mRNAs showed

90 Figure 4.10: Fhit-mediated changes in ribosome distribution across mRNAs identified by changes in 5’ UTR ARD. Ribosome distribution is shown across the mRNAs identified as having increased 5’ UTR ARD in Fhit-positive (blue) vs Fhit-negative (red) cells. The 5’ UTR of each mRNA is shaded.

91 Figure 4.11: Fhit-mediated changes in ribosome distribution across mRNAs identified by changes in 5’ UTR ARD. Ribosome distribution is shown across the mRNAs identified as having decreased 5’ UTR ARD in Fhit-positive (blue) vs Fhit-negative (red) cells. The 5’ UTR of each mRNA is shaded.

92 Figure 4.12: Identification of mRNAs with Fhit-mediated changes in ribosome occupancy of 5’ UTR versus coding region. RiboDiff was used to determine ribosome occupancy of 5’ UTRs vs. the coding region. The results of which were used to identify mRNAs for which Fhit expression altered the ratio between these features. The data were plotted as a function of Ribo-seq read counts, with those mRNAs showing statistically significant changes identified by name and as red dots. The full list of mRNAs tested can be found in [53]

decreased occupancy in the 5’ UTR (Fig. 4.13(c),(d)). Two of these were associated with increased coding region ribosome occupancy, and two had little or no change in occupancy.

Disscusion/Conclusions

Fhit is a small cytoplasmic protein with few interacting partners that functions in genome stability and as a tumor suppressor. We approached this from the perspective that, based on its role as a dinucleotide scavenger, the loss of Fhit increases the levels of m7G caps generated by mRNA 3’-5’ decay, which could act as a global translational inhibitor. We address this by performing ribosome profiling of Fhit-deficient H1299 lung cancer cells carrying an inducible FHIT transgene and a matching cell line without this gene. We found that Fhit expression resulted in a statistically-significant increase in the steady-state levels of 209 mRNAs and decrease in 377 mRNAs. The presence of Fhit also has an impact on the

93 Table 4.2: mRNAs showing Fhit-dependent changes in ribosomes bound to the 5’ UTR versus coding sequence. RiboDiff was used to determine relative changes in ribosome occupancy of 5’ UTR versus coding region as a function of Fhit expression, using ARD data from Ponasterone A-treated E1 and D1 cells. Only those transcripts with padj <0.05 were considered further. The complete list of genes, their relative changes and statistical analysis are available in [53].

Ratio of 5’ UTR/CDS Symbol Gene ID (Log2 fold change) TBC1D7 NM 016495 4.65 TIAM1 NM 003253 3.61 ZCCHC3 NM 033089 3.33 DCAF5 NM 003861 3.09 ACSL1 NM 001995 3.09 SERTAD3 NM 013368 3.06 TROVE2 NM 004600 2.89 CLPTM1 NM 001294 2.70 HINT2 NM 032593 2.63 FASN NM 004104 1.19 KPNB1 NM 002265 0.92 CALR NM 004343 1.05 RPL37A NM 000998 1.71 FBLN1 NM 006486 1.92 CREBL2 NM 001310 2.51 MECP2 NM 004992 2.82 CDKN2C NM 001262 3.41 ZNF552 NM 024762 5.89

94 Figure 4.13: Fhit changes the ribosome occupancy ratio of 5’ UTR versus coding region. Ribosome distribution is shown across the mRNAs identified in Tab. 4.2 as having changes to 5’ UTR/coding region ribosome occupancy in Fhit-positive (blue) vs Fhit-negative (red) cells. Omitted are mRNAs from Tab. 4.2 whose ribosome distribution is presented in Figs. 4.10 and 4.11. The 5’ UTR of each mRNA is shaded. (a) Five mRNAs showed increased ribosome loading in both the 5’ UTR and the coding region, but the increases observed in the 5’ UTR were larger than those observed in the coding region. (b) Four mRNAs exhibited an increase in 5’ UTR-bound ribosomes while their coding regions remained essentially unchanged. (c) Two mRNAs showed decrease 5’ UTR ribosome occupancy in Fhit-positive versus Fhit-negative cells, but increased coding region ribosome occupancy. (d) Two mRNAs showed decreased 5’ UTR ribosome occupancy in Fhit-positive versus Fhit-negative cells but little change in coding region ribosome occupancy

95 ribosome occupancy of 67 mRNA that showed a statistically-significant increase in bound ribosomes and 103 that showed a decrease. Most of these changes in occupancy could be accounted for by corresponding changes in mRNA levels. Once this was taken into account, six mRNAs were identified that show an increase in ARD in the presence of Fhit, and four mRNAs that showed a decrease in Fhit’s presence.

Finding only these 10 mRNAs was unexpected. But because we were interested in the relationship between Fhit loss and protein expression our initial efforts focused only on ribosome occupancy within the coding region. However, this changed with the report by Sendoel et al [55], showing changes in 5’ UTR ribosome occupancy in a number of cancer-associated mRNAs precedes the onset of malignancy. The first evidence of Fhit loss affecting the 5’ UTR was seen in CDKN2C, ATG16L2, and MECP2 where occupancy was shown to increase in Fhit-negative versus Fhit-positive cells. With Fhit expression restored the pattern shifted resulting in higher occupancy in the coding region and lower occupancy in the 5’ UTR. A similar senario was seen in TP53I3, but in Fhit-negative cells there appears to be a single stalled ribosome bound to the start codon that is released with the expression of Fhit. These findings are consistent with results from previous experiments [154; 155] showing that a uORF regulates downstream translation, which in our case is determined by the presence or absence of Fhit.

The preceding results led us to evaluate our data specifically for Fhit-mediated changes in 5’ UTR ribosome occupancy and for relative changes in ribosome occupancy of 5’ UTRs vs coding regions. This approach yielded five mRNAs whose 5’ UTR ribosome occupancy increased with Fhit, two mRNAs for which occupancy decreased in the 5’ UTR, and

13 mRNAs for which Fhit expression changed the ratio of 5’ UTR versus coding region occupancy. Each of these mRNAs has a potential or verified uORF, the sequences of which are listed in [53]. An example of one such uORF can be seen in Fig. 4.14. These uORFs have mostly non-canonical (CUG, GUG, UUG) start codons, and some extend into the coding region, raising the possibility that the protein products translated from these mRNAs could be affected by the loss of Fhit.

In summary, results presented here show that Fhit loss is associated with changes in

96 Figure 4.14: ADAM9 (NM 003816) with uORF in 5’ UTR. Cells expressing Fhit cause a high ribosome occupancy within a uORF found in the 5’ UTR of ADAM9. uORFs have mostly non-canonical (CUG, GUG, UUG) start codons, and some extend into the coding region, raising the possibility that the protein products translated from these mRNAs could be affected by the loss of Fhit.

ribosome occupancy of the 5’ UTR and/or coding region of 30 different mRNAs, many of which are associated with cancer. This is consistent with recent findings that show changes in 5’ UTR occupancy of a number of cancer-associated genes preceds the appearance of detectable malignancy. The protein products of Fhit-regulated mRNAs function in a number of different pathways, the diversity of which may help explain the challenges encountered in identifying how Fhit loss leads to genome instability and cancer.

97 Bibliography

[1] Schr¨odinger,E. What is life? 12 ed. (1992).

[2] Floridi, L. Information: A Very Short Introduction (Very Short Introductions). Oxford

University Press (2010).

[3] Feynman, R. P. “There’s plenty of room at the bottom: An invitation to enter a

new field of physics”. http://www.zyvex.com/nanotech/feynman.html. Accessed: 2017-11-10.

[4] Alberts, Bruce. Molecular biology of the cell. Garland (2002).

[5] Slonczewski, Joan, Foster, John Watkins., and Gillen, Kathy M. Microbiology: an

evolving science. W.W. Norton (2011).

[6] https://sciencesamhita.com/dna-as-the-genetic-material/. Accessed: 2018- 4-13.

[7] Avison, Matthew B. Measuring gene expression. Taylor Francis (2007).

[8] Crick, F. H. “The origin of the genetic code”. J. Mol. Biol., 38(3):367–379 (1968).

[9] Pierce, B.A. : A Conceptual Approach. W.H. Freeman (2007).

[10] NCBI. https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.

cgi?chapter=tgencodes#SG11. Accessed: 2018-4-13.

[11] Society., Biophysical. http://www.biophysics.org. Accessed: 2017-11-13.

98 [12] Faure, G., Ogurtsov, A. Y., Shabalina, S. A., and Koonin, E. V. “Role of mRNA

structure in the control of protein folding”. Nucleic Acids Res., 44(22):10898–10911

(2016).

[13] Espah Borujeni, A. and Salis, H. M. “Translation Initiation is Controlled by RNA Fold-

ing Kinetics via a Ribosome Drafting Mechanism”. J. Am. Chem. Soc., 138(22):7016–

7023 (2016).

[14] B?aszczyk, L. and Ciesio?ka, J. “Secondary structure and the role in translation

initiation of the 5’-terminal region of p53 mRNA”. Biochemistry, 50(33):7080–7092

(2011).

[15] Tinoco, I. and Bustamante, C. “How RNA folds”. J. Mol. Biol., 293(2):271–281

(1999).

[16] Zuber, J., Sun, H., Zhang, X., McFadyen, I., and Mathews, D. H. “A sensitivity

analysis of RNA folding nearest neighbor parameters identifies a subset of free energy

parameters with the greatest impact on RNA secondary structure prediction”. Nucleic

Acids Res., 45(10):6168–6176 (2017).

[17] Kaiser, D. “The sacred, spherical cows of physics”. http://nautil.us/issue/13/

symmetry/the-sacred-spherical-cows-of-physics. Accessed: 2017-11-13.

[18] Pathria, R.K. Statistical mechanics. Butterworth-Heinemann (2005).

[19] Huang, K. Lectures on Statistical Mechanics and Protein Folding. World Scientific

(2006).

[20] Tkaˇcik,Gaˇsper, et al. “Thermodynamics and signatures of criticality in a network

of neurons”. Proceedings of the National Academy of Sciences, 112(37):11508–11513

(2015). URL http://www.pnas.org/content/112/37/11508.

[21] Jacobs, W. M. and Frenkel, D. “Phase Transitions in Biological Systems with Many

Components”. Biophys. J., 112(4):683–691 (2017).

99 [22] Hesse, J. and Gross, T. “Self-organized criticality as a fundamental property of neural

systems”. Front Syst Neurosci, 8:166 (2014).

[23] Mora, Thierry and Bialek, William. “Are biological systems poised at criticality?”

Journal of Statistical Physics, 144(2):268–302 (2011). URL https://doi.org/10.

1007/s10955-011-0229-4.

[24] Hogeweg, P. “The roots of bioinformatics in theoretical biology”. PLoS Comput. Biol.,

7(3):e1002021 (2011).

[25] Ramsden, J. Bioinformatics An Introduction. Springer-Verlag (2015).

[26] Polanski, Andrzej and Kimmel, Marek. Bioinformatics. Springer (2011).

[27] Hagen, J. B. “The origins of bioinformatics”. Nat. Rev. Genet., 1(3):231–236 (2000).

[28] Sanger, F. and Coulson, A. R. “A rapid method for determining sequences in DNA by

primed synthesis with DNA polymerase”. J. Mol. Biol., 94(3):441–448 (1975).

[29] Sanger, F., Nicklen, S., and Coulson, A. R. “DNA sequencing with chain-terminating

inhibitors”. Proc. Natl. Acad. Sci. U.S.A., 74(12):5463–5467 (1977).

[30] Heather, J. M. and Chain, B. “The sequence of sequencers: The history of sequencing

DNA”. Genomics, 107(1):1–8 (2016).

[31] Bains, W. “Sequence–so what?” Biotechnology (N.Y.), 10(7):751–754 (1992).

[32] Reichhardt, T. “It’s sink or swim as a tidal wave of data approaches”. Nature,

399(6736):517–520 (1999).

[33] Luscombe, N. M., Greenbaum, D., and Gerstein, M. “What is bioinformatics? An

introduction and overview”. Yearb Med Inform, (1):83–99 (2001).

[34] Institute, National Human Genome Research. https://www.genome.gov/11006943/

human-genome-project-completion-frequently-asked-questions/.

100 [35] Goodwin, S., McPherson, J. D., and McCombie, W. R. “Coming of age: ten years of

next-generation sequencing technologies”. Nat. Rev. Genet., 17(6):333–351 (2016).

[36] Maston, G. A., Evans, S. K., and Green, M. R. “Transcriptional regulatory elements

in the human genome”. Annu Rev Genomics Hum Genet, 7:29–59 (2006).

[37] Ambros, V. “The functions of animal ”. Nature, 431(7006):350–355

(2004).

[38] Ambros, V. and Chen, X. “The regulation of genes and genomes by small RNAs”.

Development, 134(9):1635–1641 (2007).

[39] Wang, Z., Gerstein, M., and Snyder, M. “RNA-Seq: a revolutionary tool for transcrip-

tomics”. Nat. Rev. Genet., 10(1):57–63 (2009).

[40] Smircich, P., et al. “Ribosome profiling reveals translation control as a key mechanism

generating differential gene expression in Trypanosoma cruzi”. BMC Genomics, 16:443

(2015).

[41] Ingolia, N. T. “Ribosome profiling: new views of translation, from single codons to

genome scale”. Nat. Rev. Genet., 15(3):205–213 (2014).

[42] Ingolia, N. T. “Ribosome Footprint Profiling of Translation throughout the Genome”.

Cell, 165(1):22–33 (2016).

[43] Griffith, M., Walker, J. R., Spies, N. C., Ainscough, B. J., and Griffith, O. L. “In-

formatics for RNA Sequencing: A Web Resource for Analysis on the Cloud”. PLoS

Comput. Biol., 11(8):e1004393 (2015).

[44] Hirsch, C. D., Springer, N. M., and Hirsch, C. N. “Genomic limitations to RNA

sequencing expression profiling”. Plant J., 84(3):491–503 (2015).

[45] Conesa, A., et al. “A survey of best practices for RNA-seq data analysis”. Genome

Biol., 17:13 (2016).

101 [46] Ingolia, N. T., Ghaemmaghami, S., Newman, J. R., and Weissman, J. S. “Genome-wide

analysis in vivo of translation with nucleotide resolution using ribosome profiling”.

Science, 324(5924):218–223 (2009).

[47] Ingolia, N. T., Lareau, L. F., and Weissman, J. S. “Ribosome profiling of mouse

embryonic stem cells reveals the complexity and dynamics of mammalian ”.

Cell, 147(4):789–802 (2011).

[48] Calviello, L. and Ohler, U. “Beyond Read-Counts: Ribo-seq Data Analysis to Under-

stand the Functions of the Transcriptome”. Trends Genet., 33(10):728–744 (2017).

[49] O’Connor, P. B., Li, G. W., Weissman, J. S., Atkins, J. F., and Baranov, P. V.

“rRNA:mRNA pairing alters the length and the symmetry of mRNA-protected fragments

in ribosome profiling experiments”. Bioinformatics, 29(12):1488–1491 (2013).

[50] Mohammad, F., Woolstenhulme, C. J., Green, R., and Buskirk, A. R. “Clarifying

the Translational Pausing Landscape in Bacteria by Ribosome Profiling”. Cell Rep,

14(4):686–694 (2016).

[51] Woolstenhulme, C. J., Guydosh, N. R., Green, R., and Buskirk, A. R. “High-precision

analysis of translational pausing by ribosome profiling in bacteria lacking EFP”. Cell

Rep, 11(1):13–21 (2015).

[52] Latif, H., et al. “A streamlined ribosome profiling protocol for the characterization of

microorganisms”. BioTechniques, 58(6):329–332 (2015).

[53] Kiss, D. L., Baez, W., Huebner, K., Bundschuh, R., and Schoenberg, D. R. “Impact of

FHIT loss on the translation of cancer-associated mRNAs”. Mol. Cancer, 16(1):179

(2017).

[54] Balakrishnan, R., Oman, K., Shoji, S., Bundschuh, R., and Fredrick, K. “The conserved

GTPase LepA contributes mainly to translation initiation in Escherichia coli”. Nucleic

Acids Res., 42(21):13370–13383 (2014).

102 [55] Sendoel, A., et al. “Translation from unconventional 5’ start sites drives tumour

initiation”. Nature, 541(7638):494–499 (2017).

[56] Zhong, Y., et al. “RiboDiff: detecting changes of mRNA translation efficiency from

ribosome footprints”. Bioinformatics, 33(1):139–141 (2017).

[57] Brar, G. A. and Weissman, J. S. “Ribosome profiling reveals the what, when, where

and how of protein synthesis”. Nat. Rev. Mol. Cell Biol., 16(11):651–664 (2015).

[58] Bowman, G. R., Voelz, V. A., and Pande, V. S. “Taming the complexity of protein

folding”. Curr. Opin. Struct. Biol., 21(1):4–11 (2011).

[59] Wu, P., et al. “Roles of long noncoding RNAs in brain development, functional

diversification and neurodegenerative diseases”. Brain Res. Bull., 97:69–80 (2013).

[60] Taylor, J. Paul, Hardy, John, and Fischbeck, Kenneth H. “Toxic proteins in neurode-

generative disease”. Science, 296(5575):1991–1995 (2002).

[61] Conlon, E. G. and Manley, J. L. “RNA-binding proteins in neurodegeneration:

mechanisms in aggregate”. Genes Dev., 31(15):1509–1528 (2017).

[62] Hartl, F. U. “Protein Misfolding Diseases”. Annu. Rev. Biochem., 86:21–26 (2017).

[63] Ishiguro, Taro, et al. “Regulatory role of rna chaperone tdp-43 for rna misfolding and

repeat-associated translation in sca31”. Neuron, 94(1):108 – 124.e7 (2017).

[64] McCaskill, J. S. “The equilibrium partition function and base pair binding probabilities

for RNA secondary structure”. Biopolymers, 29(6-7):1105–1119 (1990).

[65] de Gennes, P. G. “Statistics of branching and hairpin helices for the dAT copolymer”.

Biopolymers, 6(5):715–729 (1968).

[66] Bundschuh, R. and Hwa, T. “Statistical mechanics of secondary structures formed by

random rna sequences”. Phys. Rev. E, 65:031903 (2002).

[67] Bundschuh, R. and Hwa, T. “Phases of the secondary structures of rna sequences”.

EPL (Europhysics Letters), 59(6):903 (2002). 103 [68] M. Mller, F. Krzakala, and M. Mzard. “The secondary structure of rna under tension”.

Eur. Phys. J. E, 9(1):67–77 (2002).

[69] Liu, Tsunglin and Bundschuh, Ralf. “Analytical description of finite size effects for

rna secondary structures”. Phys. Rev. E, 69:061912 (2004).

[70] L¨assig,Michael and Wiese, Kay J¨org.“Freezing of random rna”. Phys. Rev. Lett.,

96:228101 (2006).

[71] David, Francois and Wiese, Kay J¨org. “Systematic field theory of the rna glass

transition”. Phys. Rev. Lett., 98:128102 (2007).

[72] David, Franois and Wiese, Kay Jrg. “Field theory of the rna freezing transition”.

Journal of Statistical Mechanics: Theory and Experiment, 2009(10):P10019 (2009).

[73] Hui, S. and Tang, L.-H. “Ground state and glass transition of the rna secondary

structure”. Eur. Phys. J. B, 53(1):77–84 (2006).

[74] Pagnani, A., Parisi, G., and Ricci-Tersenghi, F. “Glassy transition in a disordered

model for the rna secondary structure”. Phys. Rev. Lett., 84:2026–2029 (2000).

[75] Marinari, Enzo, Pagnani, Andrea, and Ricci-Tersenghi, Federico. “Zero-temperature

properties of rna secondary structures”. Phys. Rev. E, 65:041919 (2002).

[76] Krzakala, F., Mzard, M., and Mller, M. “Nature of the glassy phase of rna secondary

structure”. EPL (Europhysics Letters), 57(5):752 (2002).

[77] Chen, Jiunn-Liang and Greider, Carol W. “Functional analysis of the pseudoknot

structure in human telomerase rna”. Proceedings of the National Academy of Sciences,

102(23):8080–8085 (2005). ISSN 0027-8424. URL http://www.pnas.org/content/

102/23/8080.

[78] Staple, David W and Butcher, Samuel E. “Pseudoknots: Rna structures with diverse

functions”. PLOS Biology, 3(6) (2005). URL https://doi.org/10.1371/journal.

pbio.0030213. 104 [79] Lorenz, R., et al. “ViennaRNA Package 2.0”. Algorithms Mol Biol, 6:26 (2011).

[80] Higgs, P. G. “RNA secondary structure: physical and computational aspects”. Q.

Rev. Biophys., 33(3):199–253 (2000).

[81] Laursen, B. S., S?rensen, H. P., Mortensen, K. K., and Sperling-Petersen, H. U.

“Initiation of protein synthesis in bacteria”. Microbiol. Mol. Biol. Rev., 69(1):101–123

(2005).

[82] Shine, J. and Dalgarno, L. “Determinant of cistron specificity in bacterial ribosomes”.

Nature, 254(5495):34–38 (1975).

[83] http://dnaofbioscience.blogspot.com/2016/05/what-is-shine-dalgarno-sequence.

html. Accessed: 2017-11-17.

[84] Vimberg, V., Tats, A., Remm, M., and Tenson, T. “Translation initiation region

sequence preferences in Escherichia coli”. BMC Mol. Biol., 8:100 (2007).

[85] Salis, H. M., Mirsky, E. A., and Voigt, C. A. “Automated design of synthetic ribosome

binding sites to control protein expression”. Nat. Biotechnol., 27(10):946–950 (2009).

[86] Devaraj, A. and Fredrick, K. “Short spacing between the Shine-Dalgarno sequence and

P codon destabilizes codon-anticodon pairing in the P site to promote +1 programmed

frameshifting”. Mol. Microbiol., 78(6):1500–1509 (2010).

[87] Chang, B., Halgamuge, S., and Tang, S. L. “Analysis of SD sequences in completed

microbial genomes: non-SD-led genes are as common as SD-led genes”. Gene, 373:90–

99 (2006).

[88] Nakagawa, S., Niimura, Y., Miura, K., and Gojobori, T. “Dynamic evolution of

translation initiation mechanisms in prokaryotes”. Proc. Natl. Acad. Sci. U.S.A.,

107(14):6382–6387 (2010).

[89] Nakagawa, S., Niimura, Y., and Gojobori, T. “Comparative genomic analysis of

105 translation initiation mechanisms for genes lacking the Shine-Dalgarno sequence in

prokaryotes”. Nucleic Acids Res., 45(7):3922–3931 (2017).

[90] Moll, I., Grill, S., Gualerzi, C. O., and Blasi, U. “Leaderless mRNAs in bacte-

ria: surprises in ribosomal recruitment and translational control”. Mol. Microbiol.,

43(1):239–246 (2002).

[91] Zheng, X., Hu, G. Q., She, Z. S., and Zhu, H. “Leaderless genes in bacteria: clue to

the evolution of translation initiation mechanisms in prokaryotes”. BMC Genomics,

12:361 (2011).

[92] Cortes, T., et al. “Genome-wide mapping of transcriptional start sites defines an exten-

sive leaderless transcriptome in Mycobacterium tuberculosis”. Cell Rep, 5(4):1121–1131

(2013).

[93] Kramer, P., Gabel, K., Pfeiffer, F., and Soppa, J. “Haloferax volcanii, a prokaryotic

species that does not use the Shine Dalgarno mechanism for translation initiation at

5’-UTRs”. PLoS ONE, 9(4):e94979 (2014).

[94] Shell, S. S., et al. “Leaderless Transcripts and Small Proteins Are Common Features of

the Mycobacterial Translational Landscape”. PLoS Genet., 11(11):e1005641 (2015).

[95] Komarova, A. V., Tchufistova, L. S., Dreyfus, M., and Boni, I. V. “AU-rich sequences

within 5’ untranslated leaders enhance translation and stabilize mRNA in Escherichia

coli”. J. Bacteriol., 187(4):1344–1349 (2005).

[96] Skorski, P., Leroy, P., Fayet, O., Dreyfus, M., and Hermann-Le Denmat, S. “The

highly efficient translation initiation region from the Escherichia coli rpsA gene lacks a

shine-dalgarno element”. J. Bacteriol., 188(17):6277–6285 (2006).

[97] Salah, P., et al. “Probing the relationship between Gram-negative and Gram-positive

S1 proteins by sequence analysis”. Nucleic Acids Res., 37(16):5578–5588 (2009).

[98] Lim, K., Furuta, Y., and Kobayashi, I. “Large variations in bacterial ribosomal RNA

genes”. Mol. Biol. Evol., 29(10):2937–2948 (2012). 106 [99] Martens, E. C., Koropatkin, N. M., Smith, T. J., and Gordon, J. I. “Complex glycan

catabolism by the human gut microbiota: the Bacteroidetes Sus-like paradigm”. J.

Biol. Chem., 284(37):24673–24677 (2009).

[100] Martens, E. C., Roth, R., Heuser, J. E., and Gordon, J. I. “Coordinate regulation of

glycan degradation and polysaccharide capsule biosynthesis by a prominent human

gut symbiont”. J. Biol. Chem., 284(27):18445–18457 (2009).

[101] Ley, R. E., et al. “Evolution of mammals and their gut microbes”. Science,

320(5883):1647–1651 (2008).

[102] Ley, R. E., et al. “Obesity alters gut microbial ecology”. Proc. Natl. Acad. Sci. U.S.A.,

102(31):11070–11075 (2005).

[103] Turnbaugh, P. J., et al. “An obesity-associated gut microbiome with increased capacity

for energy harvest”. Nature, 444(7122):1027–1031 (2006).

[104] Ley, R. E., Turnbaugh, P. J., Klein, S., and Gordon, J. I. “Microbial ecology: human

gut microbes associated with obesity”. Nature, 444(7122):1022–1023 (2006).

[105] Johnson, E. L., Heaver, S. L., Walters, W. A., and Ley, R. E. “Microbiome and

metabolic disease: revisiting the bacterial phylum Bacteroidetes”. J. Mol. Med.,

95(1):1–8 (2017).

[106] Vingadassalom, D., et al. “An unusual primary sigma factor in the Bacteroidetes

phylum”. Mol. Microbiol., 56(4):888–902 (2005).

[107] Chen, S., Bagdasarian, M., Kaufman, M. G., and Walker, E. D. “Characterization

of strong promoters from an environmental Flavobacterium hibernum strain by us-

ing a green fluorescent protein-based reporter system”. Appl. Environ. Microbiol.,

73(4):1089–1100 (2007).

[108] Chen, S., Bagdasarian, M., Kaufman, M. G., Bates, A. K., and Walker, E. D. “Muta-

tional analysis of the ompA promoter from Flavobacterium johnsoniae”. J. Bacteriol.,

189(14):5108–5118 (2007). 107 [109] Chen, S., Kaufman, M. G., Bagdasarian, M., Bates, A. K., and Walker, E. D. “Devel-

opment of an efficient expression system for Flavobacterium strains”. Gene, 458(1-

2):1–10 (2010).

[110] Bayley, D. P., Rocha, E. R., and Smith, C. J. “Analysis of cepA and other Bacteroides

fragilis genes reveals a unique promoter structure”. FEMS Microbiol. Lett., 193(1):149–

154 (2000).

[111] Mastropaolo, M. D., Thorson, M. L., and Stevens, A. M. “Comparison of Bacteroides

thetaiotaomicron and Escherichia coli 16S rRNA gene expression signals”. Microbiology

(Reading, Engl.), 155(Pt 8):2683–2693 (2009).

[112] McBride, M. J. and Kempf, M. J. “Development of techniques for the genetic manipu-

lation of the gliding bacterium Cytophaga johnsonae”. J. Bacteriol., 178(3):583–590

(1996).

[113] McBride, M. J. and Zhu, Y. “Gliding motility and Por secretion system genes are

widespread among members of the phylum bacteroidetes”. J. Bacteriol., 195(2):270–

278 (2013).

[114] Jiang, H., Lei, R., Ding, S. W., and Zhu, S. “Skewer: a fast and accurate adapter

trimmer for next-generation sequencing paired-end reads”. BMC Bioinformatics,

15:182 (2014).

[115] Langmead, B. and Salzberg, S. L. “Fast gapped-read alignment with Bowtie 2”. Nat.

Methods, 9(4):357–359 (2012).

[116] Li, H., et al. “The Sequence Alignment/Map format and SAMtools”. Bioinformatics,

25(16):2078–2079 (2009).

[117] Quinlan, A. R. “BEDTools: The Swiss-Army Tool for Genome Feature Analysis”.

Curr Protoc Bioinformatics, 47:1–34 (2014).

[118] Anders, S., Pyl, P. T., and Huber, W. “HTSeq–a Python framework to work with

high-throughput sequencing data”. Bioinformatics, 31(2):166–169 (2015). 108 [119] Li, G. W., Burkhardt, D., Gross, C., and Weissman, J. S. “Quantifying absolute

protein synthesis rates reveals principles underlying allocation of cellular resources”.

Cell, 157(3):624–635 (2014).

[120] Lorenz, R., et al. “ViennaRNA Package 2.0”. Algorithms Mol Biol, 6:26 (2011).

[121] Baggett, N. E., Zhang, Y., and Gross, C. A. “Global analysis of translation termination

in E. coli”. PLoS Genet., 13(3):e1006676 (2017).

[122] Subramaniam, A. R., et al. “A serine sensor for multicellularity in a bacterium”. Elife,

2:e01501 (2013).

[123] Rocha, E. P., Danchin, A., and Viari, A. “Translation in Bacillus subtilis: roles and

trends of initiation and termination, insights from a genome analysis”. Nucleic Acids

Res., 27(17):3567–3576 (1999).

[124] Pallej`a,Albert, Harrington, Eoghan D., and Bork, Peer. “Large gene overlaps in

prokaryotic genomes: result of functional constraints or mispredictions?” BMC

Genomics, 9(1):335 (2008). URL https://doi.org/10.1186/1471-2164-9-335.

[125] Richardson, E. J. and Watson, M. “The automatic annotation of bacterial genomes”.

Brief. Bioinformatics, 14(1):1–12 (2013).

[126] McClure, R., et al. “Computational analysis of bacterial RNA-Seq data”. Nucleic

Acids Res., 41(14):e140 (2013).

[127] Liu, X., Brutlag, D. L., and Liu, J. S. “BioProspector: discovering conserved DNA

motifs in upstream regulatory regions of co-expressed genes”. Pac Symp Biocomput,

pp. 127–138 (2001).

[128] Gu, W., Zhou, T., and Wilke, C. O. “A universal trend of reduced mRNA stability

near the translation-initiation site in prokaryotes and eukaryotes”. PLoS Comput.

Biol., 6(2):e1000664 (2010).

109 [129] Scharff, L. B., Childs, L., Walther, D., and Bock, R. “Local absence of secondary

structure permits translation of mRNAs that lack ribosome-binding sites”. PLoS

Genet., 7(6):e1002155 (2011).

[130] Keller, T. E., Mis, S. D., Jia, K. E., and Wilke, C. O. “Reduced mRNA secondary-

structure stability near the start codon indicates functional genes in prokaryotes”.

Genome Biol Evol, 4(2):80–88 (2012).

[131] Bentele, K., Saffert, P., Rauscher, R., Ignatova, Z., and Bluthgen, N. “Efficient

translation initiation dictates codon usage at gene start”. Mol. Syst. Biol., 9:675

(2013).

[132] Espah Borujeni, A., et al. “Precise quantification of translation inhibition by mRNA

structures that overlap with the ribosomal footprint in N-terminal coding sequences”.

Nucleic Acids Res., 45(9):5437–5448 (2017).

[133] Goodman, D. B., Church, G. M., and Kosuri, S. “Causes and effects of N-terminal

codon bias in bacterial genes”. Science, 342(6157):475–479 (2013).

[134] Kudla, G., Murray, A. W., Tollervey, D., and Plotkin, J. B. “Coding-sequence

determinants of gene expression in Escherichia coli”. Science, 324(5924):255–258

(2009).

[135] Duval, M., et al. “Escherichia coli ribosomal protein S1 unfolds structured mRNAs

onto the ribosome for active translation initiation”. PLoS Biol., 11(12):e1001731

(2013).

[136] S, M. A., Fricke, J., and Pedersen, S. “Ribosomal protein S1 is required for translation of

most, if not all, natural mRNAs in Escherichia coli in vivo”. J. Mol. Biol., 280(4):561–

569 (1998).

[137] Accetto, T. and Avgu?tin, G. “Inability of Prevotella bryantii to form a functional

Shine-Dalgarno interaction reflects unique evolution of ribosome binding sites in

Bacteroidetes”. PLoS ONE, 6(8):e22914 (2011). 110 [138] Seni?ar, L. and Accetto, T. “The 5’ untranslated mRNA region base content can

greatly affect translation initiation in the absence of secondary structures in Prevotella

bryantii TC1-1”. FEMS Microbiol. Lett., 362(1):1–4 (2015).

[139] Hockenberry, A. J., Stern, A. J., Amaral, L. A. N., and Jewett, M. C. “Diversity of

translation initiation mechanisms across bacterial species is driven by environmental

conditions and growth demands”. Mol. Biol. Evol. (2017).

[140] Fungtammasan, A., Walsh, E., Chiaromonte, F., Eckert, K. A., and Makova, K. D.

“Corrigendum: A genome-wide analysis of common fragile sites: What features deter-

mine chromosomal instability in the human genome?” Genome Res., 26(10):1451

(2016).

[141] Brenner, C., Bieganowski, P., Pace, H. C., and Huebner, K. “The histidine triad

superfamily of nucleotide-binding proteins”. J. Cell. Physiol., 181(2):179–187 (1999).

[142] Barnes, L. D., et al. “Fhit, a putative tumor suppressor in humans, is a dinucleoside

5’,5”’-P1,P3-triphosphate hydrolase”. Biochemistry, 35(36):11529–11535 (1996).

[143] Murphy, G. A., Halliday, D., and McLennan, A. G. “The Fhit tumor suppressor

protein regulates the intracellular concentration of diadenosine triphosphate but not

diadenosine tetraphosphate”. Cancer Res., 60(9):2342–2344 (2000).

[144] Taverniti, V. and Seraphin, B. “Elimination of cap structures generated by mRNA

decay involves the new scavenger mRNA decapping enzyme Aph1/FHIT together with

DcpS”. Nucleic Acids Res., 43(1):482–492 (2015).

[145] Li, Y. and Kiledjian, M. “Regulation of mRNA decapping”. Wiley Interdiscip Rev

RNA, 1(2):253–265 (2010).

[146] Draganescu, A., Hodawadekar, S. C., Gee, K. R., and Brenner, C. “Fhit-nucleotide

specificity probed with novel fluorescent and fluorogenic substrates”. J. Biol. Chem.,

275(7):4555–4560 (2000).

111 [147] Pelletier, J., Graff, J., Ruggero, D., and Sonenberg, N. “Targeting the eIF4F translation

initiation complex: a critical nexus for cancer development”. Cancer Res., 75(2):250–

263 (2015).

[148] Karras, J. R., Schrock, M. S., Batar, B., and Huebner, K. “Fragile Genes That Are

Frequently Altered in Cancer: Players Not Passengers”. Cytogenet. Genome Res.,

150(3-4):208–216 (2016).

[149] Saldivar, J. C., et al. “Initiation of genome instability and preneoplastic processes

through loss of Fhit expression”. PLoS Genet., 8(11):e1003077 (2012).

[150] Dobin, A., et al. “STAR: ultrafast universal RNA-seq aligner”. Bioinformatics,

29(1):15–21 (2013).

[151] Anders, S. and Huber, W. “Differential expression analysis for sequence count data”.

Genome Biol., 11(10):R106 (2010).

[152] Gray, K. A., et al. “Genenames.org: the HGNC resources in 2013”. Nucleic Acids

Res., 41(Database issue):D545–552 (2013).

[153] Karolchik, D., et al. “The UCSC Table Browser data retrieval tool”. Nucleic Acids

Res., 32(Database issue):D493–496 (2004).

[154] Palam, L. R., Baird, T. D., and Wek, R. C. “Phosphorylation of eIF2 facilitates

ribosomal bypass of an inhibitory upstream ORF to enhance CHOP translation”. J.

Biol. Chem., 286(13):10939–10949 (2011).

[155] Barbosa, C. and Romao, L. “Translation of the human erythropoietin transcript

is regulated by an upstream open reading frame in response to hypoxia”. RNA,

20(5):594–608 (2014).

112