The Dynamic Fate of the Junction Complex

Dissertation

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy

in the Graduate School of The Ohio State University

By

Robert Dennison Patton, B.S.

Graduate Program in Physics

The Ohio State University

2020

Dissertation Committee

Dr. Ralf Bundschuh, Advisor

Dr. Guramrit Singh, Co-Advisor

Dr. Michael Poirier

Dr. Enam Chowdhury

1

© Copyrighted by

Robert Dennison Patton

2020

2

Abstract

The , or EJC, is a group of deposited on mRNA upstream of exon-exon junctions during splicing, and which stays with the mRNA up until . It consists of a trimeric core made up of EIF4A3, Y14, and MAGOH, and serves as a binding platform for a multitude of peripheral proteins. As a lifelong partner of the mRNA the EJC influences almost every step of post-transcriptional mRNA regulation, including splicing, packaging, transport, translation, and Nonsense-Mediated

Decay (NMD).

In Chapter 2 I show that the EJC exists in two distinct complexes, one containing

CASC3, and the other RNPS1. These complexes are localized to the cytoplasm and nucleus, respectively, and a new model is proposed wherein the EJC begins its life post- splicing bound by RNPS1, which at some point before translation in the cytoplasm is exchanged for CASC3. These alternate complexes also take on distinct roles; RNPS1-

EJCs help form a compact mRNA structure for easier transport and make the mRNA more susceptible to NMD. CASC3-EJCs, on the other hand, cause a more open mRNA configuration and stabilize it against NMD.

Following the work with the two alternate EJCs, in Chapter 3 I examine why previous research only found the CASC3-EJC variant. Using the EJC as a case study which contains both a directly binding peripheral (CASC3) and indirectly binding peripheral protein (RNPS1), I show that CLIP-Seq, a photo crosslinking method widely used to study RNA Binding Proteins (RBPs) is ill suited to researching proteins which do not make direct contact with RNA, or otherwise photo crosslink poorly (RBP-Associated iii

Factors, or RAFs). I propose that chemical crosslinking methods such as RIPiT-Seq be used instead when aiming to study RAFs.

Finally, in Chapter 4 I consider another EJC peripheral protein, PYM, thought to aid in disassembly of the EJC co-transcriptionally. By inhibiting translation as well as

PYM-EJC binding through a mutation on MAGOH, PYM is found to be unnecessary for

EJC removal but does inhibit non-canonically located EJCs. I lay the groundwork for future work regarding PYM and NMD by creating a new NMD target transcript list which gives promising results when examining data from a previous study on NMD.

iv

Dedication

I stand at the seashore, alone, and start to think.

There are the rushing waves, mountains of molecules Each stupidly minding its own business Trillions apart, yet forming white surf in unison

Ages on ages, before any eyes could see Year after year, thunderously pounding the shore as now For whom, for what? On a dead planet, with no life to entertain

Never at rest, tortured by energy Wasted prodigiously by the sun, poured into space A mite makes the sea roar

Deep in the sea, all molecules repeat the patterns Of one another till complex new ones are formed They make others like themselves And a new dance starts

Growing in size and complexity Living things, masses of atoms, DNA, protein Dancing a pattern ever more intricate

Out of the cradle onto the dry land Here it is standing Atoms with consciousness, matter with curiosity Stands at the sea, wonders at wondering

I, a universe of atoms An atom in the universe

- Richard P. Feynman

v

Acknowledgments

First and foremost, I would like to thank my advisor Dr. Ralf Bundschuh and co- advisor Dr. Guramrit Singh. Ralf has been an incredible mentor and his guidance has helped me to grow as scientist more than I ever could have imagined. And though Amrit works in another department, he has always made me feel like one of his own and a true member of his group. Most of all I am thankful to both of them for their incredible patience. As someone from a pure physics background coming abruptly into the world of molecular genetics there are many mistakes to be made, and I have made them all at least twice, but my advisors have been kind and understanding teachers through the entire process.

I would also like to thank my committee members Dr. Michael Poirier and Dr.

Enam Chowdhury. Michael, whose group we share group meetings with, has played a prominent role throughout my graduate research career. Enam similarly played a large role as one of my undergraduate research collaborators and mentors, back when I still did

AMO. And I would like to thank my group members who preceded me, Dr. Billy Baez and Dr. Blythe Moreland, who were incredibly helpful in every conceivable way when I was a fresh graduate student. My current group members Elan Shatoff and Kyle Crocker have also been indispensable companions. Of course, I must also thank my collaborators in the Singh lab, especially Dr. Lauren Woodward, Justin Mabin, and Zhongxia Yi, who are responsible for nearly all of the experimental work done in bringing this thesis to fruition.

vi

Unfortunately, I cannot remember the name of every teacher I ever had, but I feel indebted to every person who has poured some knowledge into this brain. I am especially thankful for some of my undergraduate physics professors who played a large part in shaping my goals and interests, especially Dr. Robert Perry and Dr. Thomas Lemberger.

And of course, my undergraduate advisor at OSU, Dr. Gregory Lafyatis, who was not only a gracious advisor but gave me my first experiences traversing a more biological realm.

Finally, I would like to thank my mother, Jennifer and father, Bob for instilling in me the importance of curiosity from a young age, and always encouraging me towards lofty aspirations. And I would not be who I am today without my elder siblings Sara,

Eric, and Courtney who only complained a little bit when I made them watch NOVA or took their books about science and physics without asking.

vii

Vita

The Ohio State University Columbus, Ohio

Bachelor of Science in Physics; Minor in Japanese May 2015

Thesis: "Optimization of Direct-Write 3D Photolithography in PMMA”

Publications

1. Patton, R., Sanjeev, M., Woodward, L., Mabin, J., Bundschuh, R., & Singh, G.

(2020). Chemical crosslinking enhances RNA immunoprecipitation for efficient identification of binding sites of proteins that photo-crosslink poorly with RNA. RNA, https://doi.org/10.1261/rna.074856.120

2. Gangras, P., Gallagher, T., Patton, R., Yi, Z., Parthun, M., Tietz, K., Deans, N.,

Bundschuh, R., Amacher, S., & Singh, G. (2020). Zebrafish rbm8a and magoh mutants reveal EJC developmental functions and suggest new rules for 3′UTR intron-dependent

NMD. PLOS Genetics, https://doi.org/10.1101/677666

3. Mabin, J.*, Woodward, L.*, Patton, R.*, Yi, Z., Jia, M., Wysocki, V., Bundschuh,

R., & Singh, G. (2018). The exon junction complex undergoes a compositional switch that alters mRNP structure and nonsense-mediated mRNA decay activity. Cell Rep., 25,

2431-2446. [* co-first authors]

Fields of Study

Major Field: Physics

viii

Table of Contents

Abstract ...... iii Dedication ...... v Acknowledgments...... vi Vita ...... viii Table of Contents ...... ix List of Figures ...... xii Chapter 1 Introduction ...... 1 1.1 The Central Dogma ...... 1 1.1.1 Physical structure of nucleic acids ...... 4 1.1.2 Physical structure of proteins ...... 6 1.1.3 The physics of RNA:Protein interactions ...... 8 1.2 mRNA Binding Proteins and the Exon Junction Complex ...... 12 1.2.1 Splicing and the birth of mRNA ...... 13 1.2.2 The Exon Junction Complex ...... 15 1.2.3 EJC peripheral proteins ...... 18 1.2.4 Canonical EJCs ...... 20 1.2.5 Non-Canonical EJCs ...... 20 1.2.6 EJCs influence mRNP structure ...... 22 1.2.7 EJCs and Nonsense-Mediated Decay ...... 22 1.3 Biophysics and Bioinformatics ...... 25 1.3.1 Biophysics and statistical mechanics ...... 25 1.3.2 Bioinformatics and high-throughput RNA sequencing ...... 26 1.3.3 Footprinting of RNA:protein complexes ...... 27 1.4 Outline ...... 28 Chapter 2 The EJC Undergoes a Compositional Switch ...... 30 2.1 Abstract ...... 30 2.2 Introduction ...... 30 2.3 Materials and Methods ...... 33 2.3.1 Experimental methods ...... 33 2.3.2 High-throughput sequencing library preparation ...... 33 2.3.3 Adaptor trimming and PCR duplicate removal...... 34 2.3.4 Alignment and removal of multimapping reads ...... 34 2.3.5 Removal of stable RNA mapping reads ...... 35 2.3.6 Human reference transcriptome ...... 35 2.3.7 K-mer analysis ...... 36 ix

2.3.8 Motif enrichment analysis ...... 36 2.3.9 Differential enrichment analysis ...... 37 2.3.10 Estimation of nuclear versus cytoplasmic levels ...... 37 2.3.11 Comparison to with detained introns ...... 37 2.3.12 occupancy and mRNA half-life estimates ...... 38 2.3.13 ontology analysis ...... 38 2.4 Results ...... 39 2.4.1 RNPS1 and CASC3 associate with the EJC core in a mutually exclusive manner ...... 39 2.4.2 Proteomic analysis of alternate EJCs ...... 41 2.4.3 Alternate EJCs are structurally distinct ...... 45 2.4.4 RNPS1 and CASC3 bind RNA via the EJC core with key distinctions ...... 45 2.4.6 Kinetics of translation and mRNA decay impact the pool of alternate EJC- bound mRNAs ...... 62 2.4.7 RNPS1 is required for efficient NMD of all transcripts; CASC3 is required only for a subset ...... 70 2.4.8 Increased CASC3 levels can slow down NMD ...... 75 2.5 Discussion ...... 77 2.5.1 EJC composition and mRNP structure ...... 77 2.5.2 CASC3-EJC and pre-translation mRNPs ...... 79 2.5.3 EJC composition and NMD ...... 81 Chapter 3 Comparison of High-Throughput Sequencing Methods for Quantification of RNA:Protein Binding...... 83 3.1 Abstract ...... 83 3.2 Introduction ...... 84 3.3 Materials and Methods ...... 88 3.3.1 Source data for classification of RBPs and RAFs ...... 88 3.3.2 analysis ...... 89 3.3.3 High-Throughput DNA Sequencing ...... 89 3.3.4 Human reference transcriptome and annotations ...... 90 3.3.5 Read distribution assignment ...... 90 3.3.6 Read density calculation ...... 90 3.3.7 Quantification of EJC binding in canonical and non-canonical regions, and on intronless genes ...... 91 3.3.8 Quantification of individual EJC binding sites ...... 92 3.3.9 Quantification of EJC binding on Recursively Spliced (RS) and their neighboring exons ...... 92 3.4 Results ...... 93 3.4.1 Proteins that poorly UV crosslink to RNA are widespread in RNA metabolism ...... 93 3.4.2 Chemical crosslinking stabilizes RBPs and RAFs within RNPs ...... 100 3.4.3 Formaldehyde crosslinking enhances RIPIT-Seq signal over background ... 102

x

3.4.4 xRIPIT-Seq is comparable to CLIP-Seq in revealing binding sites of CASC3 ...... 114 3.4.5 xRIPiT-Seq is superior to nRIPiT-Seq and CLIP-Seq for identifying RNPS1- EJC binding sites ...... 119 3.4.6 xRIPiT-Seq is more efficient than CLIP-Seq to detect increased RNPS1 occupancy on exons preceding recursively spliced exons ...... 130 3.5 Discussion ...... 134 3.5.1 xRIPiT-Seq reveals binding profile and functions of RNPS1, an RAF within EJC ...... 135 3.5.2 Factors that impact UV-crosslinking ability of RNA-associated proteins ..... 137 3.5.3 Formaldehyde as an alternative to UV for in vivo RBP/RAF crosslinking ... 140 Chapter 4 Deciphering the Relationship between PYM and the EJC ...... 143 4.1 Abstract ...... 143 4.2 Introduction ...... 144 4.3 Materials and Methods ...... 147 4.3.1 Experimental Methods ...... 147 4.3.2 High-Throughput DNA Sequencing ...... 147 4.3.3 Human reference transcriptome and annotations ...... 147 4.3.4 Genic read distribution assignment ...... 147 4.3.5 Isoform analysis and read distribution ...... 148 4.3.6 Prediction of transcripts with Premature Termination Codons (PTCs) ...... 148 4.4 Results ...... 148 4.4.1 PYM is not necessary for removal of canonical EJCs ...... 148 4.4.2 EJCs lacking PYM interaction bind more at non-canonical positions ...... 151 4.4.3 MAGOH-E117R-EJCs are enriched on single exon genes ...... 155 4.4.4 Reproducing NMD upregulation results to test target lists ...... 157 4.4.5 3’UTR length correlates with NMD targeting as expected ...... 158 4.4.6 Comparing Baird et al.’s de novo list to Ensembl’s NMD biotype ...... 161 4.4.7 Creating our own lists and understanding ISAR’s dependencies ...... 168 4.4.8 Construction of a final set of NMD targets and revisiting UPF1...... 174 4.5 Discussion ...... 179 4.5.1 PYM is unnecessary for canonical EJC removal, but does reduce levels of non- canonical EJCs ...... 179 4.5.2 ISAR is unreliable for PTC prediction, but a combination Ensembl and PTC- based NMD list performs admirably ...... 180 Chapter 5 Conclusions ...... 182 5.1 A New Understanding of the EJC Makeup and its Implications ...... 182 5.2 The Importance of Choosing the Right Crosslinking Method for an Experiment 183 5.3 Examining the Role of PYM and Finding a Trustworthy NMD Annotation ...... 184 5.4 Future Directions ...... 185 Bibliography ...... 187

xi

List of Figures

Figure 1.1 The DNA to RNA to Protein Pathway ...... 3 Figure 1.2 Chemical Strucure of RNA ...... 5 Figure 1.3 Typical Elements Found in RNA Secondary Structure ...... 6 Figure 1.4 Protein Secondary Structures ...... 7 Figure 1.5 Intron Excision and Alternative Splicing ...... 15 Figure 1.6 The EJC Lifecycle ...... 17 Figure 1.7 EJC Structure ...... 18 Figure 1.8 Canonical NMD Pathway ...... 24 Figure 2.1 RNPS1 and CASC3 Exist in Mutually Exclusive EJCs in Mammalian Cells 40 Figure 2.2 CASC3- and RNPS1-Containing Complexes Have Distinct Protein Composition and Hydrodynamic Size ...... 43 Figure 2.3 Proteins Enriched During FLAG-EJC Pulldown ...... 44 Figure 2.4 A Schematic Illustrating the Main Steps in RIPiT-seq...... 46 Figure 2.5 Example Read Coverage ...... 48 Figure 2.6 CASC3 Associates More Strongly with Canonical EJCs than RNPS1 ...... 49 Figure 2.7 Comparison of Samples in Meta-Exon Plot ...... 50 Figure 2.8 K-Mer Analysis of Alternate Peripheral Proteins ...... 52 Figure 2.9 Venn Diagram of Genes with Preferential RNPS1 or CASC3 Binding ...... 54 Figure 2.10 MA-Plot of RNPS1-EJC vs CASC3-EJC Footprint Expression ...... 55 Figure 2.11 Comparison of Cytoplasmic to Nuclear Fraction by Enriched Transcripts .. 57 Figure 2.12 Analysis of Nuclear-Enriched ...... 58 Figure 2.13 Characteristics of Transcripts Localized to the Nucleus ...... 59 Figure 2.14 Analysis of Detained Intron Containing Genes ...... 61 Figure 2.15 CASC3/RNPS1-EJC Occupancy with Translation Inhibition ...... 63 Figure 2.16 Boxplots Showing Translation Efficiency and Alternate EJC Occupancy by Gene Groups ...... 65 Figure 2.17 CASC3-EJC GO Term Analysis ...... 66 Figure 2.18 Translation Inhibition’s Impact on CASC3/RNPS1-EJC Occupancy ...... 67 Figure 2.19 Boxplots Showing Various Conditions Impacted by Translation Inhibition 68 Figure 2.20 Effect of Core and Peripheral EJC Component Knockdown on NMD Targets ...... 71 Figure 2.21 Real-Time PCR Analysis of NMD Targets ...... 73 Figure 2.22 β39 mRNA is Upregulated upon EJC Factor Knockdown ...... 74 Figure 2.23 Promotion of Switch to Late-Acting CASC3-EJC Dampens NMD Activity 76 Figure 2.24 Model of the Proposed EJC Compositional Switch ...... 77 Figure 3.1 Overlap of Predicted RBPs and UV-Crosslinkable RBPs ...... 95 Figure 3.2 RAF Functions...... 98 Figure 3.3 RAFs Involved in the mRNA Lifecycle ...... 99 Figure 3.4 Formaldehyde crosslinking stabilizes an RAF-EJC interaction...... 101 Figure 3.5 Replicate Comparisons ...... 104 Figure 3.6 Formaldehyde Crosslinking Enhances RIPiT-Seq Signal for CASC3 (1) .... 106 xii

Figure 3.7 Linear Fits and their Intercepts ...... 108 Figure 3.8 Formaldehyde Crosslinking Enhances RIPiT-Seq Signal for CASC3 (2) .... 109 Figure 3.9 CASC3-EJC Densities in Highly Expressed Genes – nRIPiT versus xRIPiT ...... 110 Figure 3.10 CASC3-EJC Canonical Site Detection by Expression ...... 111 Figure 3.11 CHX Treatment Enhances CASC3 Signal ...... 113 Figure 3.12 Comparison Between HEK293 and HeLa Cells ...... 116 Figure 3.13 xRIPiT-Seq and CLIP-Seq are Robust and Comparable Approaches to Identify CASC3 Binding Sites (1) ...... 117 Figure 3.14 xRIPiT-Seq and CLIP-Seq are Robust and Comparable Approaches to Identify CASC3 Binding Sites (2) ...... 118 Figure 3.15 CASC3-EJC Densities in Highly Expressed Genes – CLIP versus xRIPiT 119 Figure 3.16 xRIPiT-Seq Outperforms nRIPiT-Seq to Detect RNPS1 Binding Sites (1) 121 Figure 3.17 xRIPiT-Seq Outperforms nRIPiT-Seq to Detect RNPS1 Binding Sites (2) 122 Figure 3.18 RNPS1-EJC Densities in Highly Expressed Genes – nRIPiT versus xRIPiT ...... 123 Figure 3.19 RNPS1-EJC Canonical Site Detection by Expression ...... 124 Figure 3.20 xRIPiT-Seq Outperforms CLIP-Seq to Detect RNPS1 Binding Sites (1) ... 126 Figure 3.21 xRIPiT-Seq Outperforms CLIP-Seq to Detect RNPS1 Binding Sites (2) ... 127 Figure 3.22 RNPS1-EJC Densities in Highly Expressed Genes – CLIP versus xRIPiT 128 Figure 3.23 CHX Treatment Has Little Impact on RNPS1 Signal ...... 129 Figure 3.24 (Preceding Page) Comparison of xRIPiT-Seq and CLIP-Seq Signal for RNPS1 and CASC3 Occupancy on High- and Low-Scoring RS Exons and their Neighboring Exons ...... 132 Figure 3.25 Example Gene Containing RS Exon and Showing xRIPiT Sensitivity ...... 133 Figure 3.26 Transcript Level RS Exon Analysis ...... 134 Figure 3.27 A Schematic Summarizing Suitable Approaches for Identification of Binding Sites of RBPs versus RAFs ...... 135 Figure 4.1 Translation Inhibition but not PYM-EJC Interaction Inhibititon Impacts Non- Canonical EJC Occupancy ...... 150 Figure 4.2 Disrupting PYM Interaction is Largely Inconsequential at Canonical Positions ...... 151 Figure 4.3 Disrupting PYM Interaction has a Larger Impact when Considering Entire Transcripts...... 153 Figure 4.4 Meta-Exon Analysis Shows Lower Mutant-EJC Occupancy at the Canonical Binding Position...... 154 Figure 4.5 Non-Canonical/Canonical Ratios Shows Greater Ratios when PYM Interaction is Inhibited ...... 155 Figure 4.6 Single Exon Genes Show Higher EJC Levels with PYM Interaction Inhibited ...... 156 Figure 4.7 3’UTR Analysis is Consistent in UPF1-KD/Control ...... 159 Figure 4.8 3’UTR Analysis is Consistent in ICE1-KD/Control ...... 160 Figure 4.9 NMD Analysis of ICE1-KD/Control with Baird Processing is Consistent with Expectations ...... 163 xiii

Figure 4.10 NMD Analysis of ICE1-KD/Control with Our Processing is Consistent with Expectations ...... 164 Figure 4.11 NMD Analysis of UPF1-KD/Control with Baird Processing is Consistent only with Ensembl’s List ...... 166 Figure 4.12 NMD Analysis of UPF1-KD/Control with Our Processing is Consistent only with Ensembl’s List ...... 167 Figure 4.13 Comparison of ISAR Methods ...... 170 Figure 4.14 Comparison of ISAR Methods with Baird et al.’s List ...... 171 Figure 4.15 Comparison of Ensembl, Python, and ISAR Stringent Lists...... 173 Figure 4.16 Visual of Uniqe NMD Targets ...... 174 Figure 4.17 Area of the Finalized List ...... 175 Figure 4.18 NMD Analysis of ICE1-KD/Control and UPF1-KD/Control with Our Processing and New NMD List Behaves as Expected...... 177 Figure 4.19 NMD Analysis of ICE1-KD/Control and UPF1-KD/Control with Our Processing and New, Strict, NMD List Behaves as Expected ...... 178

xiv

Chapter 1

Introduction

1.1 The Central Dogma

“The Central Dogma. This states that once ‘information’ has passed into protein it cannot get out again. In more detail, the transfer of information from nucleic acid to nucleic acid, or from nucleic acid to protein may be possible, but transfer from protein to protein, or from protein to nucleic acid is impossible. Information means here the precise determination of sequence, either of bases in the nucleic acid or of residues in the protein.” – Francis Crick, 1958

To this day the above statement represents our core understanding of genetics.

Indeed the flow of sequence-based information is ultimately responsible for the creation of all proteins, complex molecular machines which are involved in every facet of a cell’s lifecycle, from maintaining the cell’s structure and guiding replication to transporting materials within the cell, the creation of more proteins, and the replication of the nucleic acids which encode them. Those nucleic acids take the form of the very stable deoxyribonucleic acid (DNA) which encodes, generally in single copies, the genes which each correspond to a separate protein and ribonucleic acid (RNA) which may take on a host of functions, even structural, but primarily acts as a template for direct translation into proteins in the form of messenger RNA (mRNA).

DNA may be duplicated into DNA, as during cell replication, and RNA into

RNA, as in the case of RNA viruses. However, of key interest in this work are the two transitions into new molecular information carriers. Namely, the transcription of DNA into RNA followed by the translation of RNA into proteins (Figure 1.1). A group of

1 proteins known as the Exon Junction Complex (EJC) is deposited on mRNA around the time of transcription and remains with the mRNA right up until the translation step, playing a vital role in a multitude of the processes which join these fundamental information transfers. Better understanding the role of the EJC and the cohort of proteins it mediates and associates with is the primary goal of this work.

2

Figure 1.1 The DNA to RNA to Protein Pathway An illustration of the flow of information in genetics: sequences in DNA are transcribed into nucleotide sequences in RNA which are translated into amino acid sequences in proteins (yourgenome, 2020)

3

1.1.1 Physical structure of nucleic acids

DNA and RNA are long, sequential polymers made up of the nucleic acids cytosine, guanine, adenine, and either uracil (in the case of RNA) or thymine (in the case of DNA – C, G, A, U, and T for short). These nucleic acids are connected by a ribose (a pentose monosaccharide) backbone. Each nucleobase consists of a nitrogen base in the form of either a pyrimidine, for C and U/T, or a purine, for A and G. The bases are connected to the 1’ hydroxyl group of the ribose via an N-glycosidic (covalent) bond.

The ribose themselves are connected by phosphates linking the 5’ hydroxyl of one sugar to the 3’ hydroxyl of the next, and this asymmetric linkage produces a clear directionality used by the cell (i.e., translation takes place in the 5’ to 3’ direction of mRNAs). This work will almost exclusively deal with RNA, the chemical structure of which can be seen in Figure 1.2. Since the phosphates each retain a single negatively charged oxygen atom while the ribose and bases are neutral, RNA also possesses an overall negative charge along its “backbone.”

4

Figure 1.2 Chemical Strucure of RNA View is from 5’ (top) to 3’ (bottom). Ribose are connected to nucleobases on their 1’ carbon and feature a hydroxyl group on their 2’ carbon. The 3’ of one ribose is connected to the 5’ of the next by a negatively charged phosphate group. The four RNA nucleic acids are shown. Note that carbon is implicit at each intersection and hydrogen is implicit on each carbon. The Hydrogen, Nitrogen, and Oxygen atoms found on each base are highly susceptible to hydrogen bonding, both with other bases and with amino acid residues (Brown & Brown, 2020).

Structurally, RNA is naturally single-stranded (ssRNA) and less stable than DNA.

There is also evidence that RNA predates DNA (and proteins) evolutionarily, which may account for its relative flexibility both physically and in function. This instability arises because of the additional oxygen attached to the 2’ carbon producing a hydroxyl group which is susceptible to hydrolysis. ssRNA will bind to itself in a sequence-dependent manner via hydrogen bonding of its bases – A with U and G with C, which contain 2 and

3 hydrogen bonds, respectively. This leads to segments along the RNA that are double- stranded (dsRNA). dsRNA generally adopts an A-form helical structure, which is similar 5 to the B-form of double-stranded DNA but features a relatively narrow major groove and wide minor groove. Segments of ssRNA are ordinarily intermixed with dsRNA, and this forms a series of hairpin or internal loops as seen in Figure 1.3. The combination of helical dsRNA and joining ssRNA stretches constitutes the secondary structure of RNA.

The RNA may then take on tertiary structures by folding into energetically favorable arrangements, often conformally with proteins (Xu & Chen, 2015). Both the primary and secondary structures adopted by RNA may play roles in the RNA:protein binding pathway.

Figure 1.3 Typical Elements Found in RNA Secondary Structure As RNA binds to itself via hydrogen bonds between bases it forms loops and bulges. Not shown is the A-helical structure typically adopted by segments of dsRNA (Steffen & Giegerich, 2005).

1.1.2 Physical structure of proteins

Like RNA, proteins are also polymers made up of sequences of molecules, specifically amino acids. The amino acids are linked to their neighbors by peptide bonds wherein the carboxyl group of one amino acid reacts with the amino group of another, producing a single molecule of water and leaving behind a covalent bond. The side chain of each amino acid, which lends its chemical properties to the completed protein, may be

6 polar or nonpolar and possess neutral, positive, or negative charge. The primary structure of a protein is determined solely by its amino acid sequence. The secondary structure is a series of more condensed motifs, primarily α-helices and β-sheets (Figure 1.4). α-helices are the most common motif and involve the amino acids coiling in a helical manner held together by hydrogen bonds which skip over some amino acids. β-sheets are made up of

β-strands – approximately straight chains of amino acids with side chains in alternating directions – which are closely bound in parallel with hydrogen bonds connecting each amino acid between the two strands. Finally, proteins take on a tertiary folded structure driven primarily by hydrophobic interactions which is then solidified with various kinds of chemical bonds.

Figure 1.4 Protein Secondary Structures On the left is a diagram of an α-helix with exposed residues. On the right is an antiparallel β-sheet (a) with β-strands running in the same direction, and a parallel β- sheet (b) with β-strands running in opposite directions (Thandapani, O'Connor, Bailey, & Richard, 2013).

7

1.1.3 The physics of RNA:Protein interactions

From an electrostatic perspective RNAs and proteins in cells are charged molecules which would clump together in non-meaningful ways if not for electrostatic screening. The first layer of defense against such processes is water – a solvent which is highly dielectric (high relative permittivity) by virtue of its polarity (Doerr & Yu, 2010).

This natural dielectric occurring throughout the cell and between charged molecules greatly reduces the effective field strength felt by each molecule, both attractive and repulsive. Further screening is provided by the salts in solution, i.e. ions which can move freely within the cell. In the case of RNA, which is negatively charged, cations like Na+ and K+ will surround the molecule in an attempt to neutralize the local charge, minimizing free energy (Roh, et al., 2015). This cloud of ions largely hides the RNA from electrostatic interactions unless the molecules in question come into very close range; for typical mammalian cells the range of the Coulomb interactions is of order 10 Å

(on the order of the diameter of an atom) or less. The exact length scales depend on several factors and are characterized by the Debye length (Lipfert, Sim, Herschlag, &

Doniach, 2010). Such screening also takes place across proteins which feature patches of positive, negative, and neutral charge. The end effect is that RNA and proteins must come into close proximity before electrostatic interactions can take place, which puts more emphasis on relative concentrations of RNA and RNA Binding Proteins (RBPs).

Once in range RBPs are initially drawn to (typically, but not necessarily double- stranded) RNA in the nucleus or cytoplasm because of the electrostatic interaction between the negatively charged RNA backbone and patches of positive charge on the

8

RBP (Chen & Varani, 2005). In contrast to DNA-binding proteins, where on average

80% of a positive patch becomes part of the DNA interface, the area of the positive RNA patches which form the connection vary considerably from 0% to 100% (Stawiski,

Gregoret, & Mandel-Gutfreund, 2003), which illustrates that the electrostatic attraction is primarily a device for bringing the molecules close together and often plays a lesser role in binding. This variability in positive patch coverage may also be attributed to the diverse physical structures of RNA molecules and the fact that binding often takes place over several patches, not just one (Chen & Varani, 2005). Less frequently, ssRNA is drawn into protein pockets due to hydrophobic patches on the RBP, though ssRNA may also attract an RBP by electrostatic interactions (Shulman-Peleg, Shatsky, Nussinov, &

Wolfson, 2008). As compared to non-nucleic-acid-binding proteins (NNBPs) RBPs have similar positive patch sizes, but differ in several key ways. RBPs typically have a much higher dipole moment, which allows for stronger general attraction to negatively charged

RNA molecules and are characterized by a large overlap between positive patch location and deep structural clefts (Chen & Varani, 2005). This implies a greater dependence on

RNA secondary structure for the final binding configuration. RBPs also feature lower molecular weights on average and have lower surface accessibility as compared to

NNBPs, which translates to molecules which can more easily move around RNAs and have more complicated site binding dynamics.

Once in close range the RNA and RBP typically go through conformational changes (Gopinath, 2009) to maximize binding via intermolecular hydrogen bonds and van der Waals forces (Jones, Daley, Luscombe, Berman, & Thornton, 2001). This means

9 protein-induced RNA folding, RNA-induced protein folding, or co-induced folding are often required for a functional RNP (Gopinath, 2009). The exact method of binding depends largely on the RNA and RBP structures with RBPs featuring common RNA binding domains (RBDs) or motifs which dictate the type of interaction. These RBDs recognize sites with as few as 3 – 8 and may induce binding even with some sequence variation (Jankowsky & Harris, 2015). Because these domains are relatively short, multiple RBDs are often linked together, which increases RBP affinity for a more specific RNA (Ban, Zhu, Melcher, & Xu, 2015). This modular way of combining RBDs also makes disassembly and regulation easier as it incorporates multiple weakly bound segments over a single strong connection (Lunde, Moore, & Varani, 2017). Since the domains may also be separated by any number of amino acids it is possible for RBPs to recognize RNA with motifs separated by variable numbers of bases, further increasing specificity. The five most common RBDs in human genes are the RNA-Recognition

Motif (RRM, also RBD or RNP motif), regions rich in and glycine (RGG/RG), the DEAD motif, zinc finger motifs (zf-CCH, zf-C2H2_jaz, etc.), and the K Homology domain (KH domain) (Gerstberger, Hafner, & Tuschl, 2014). Also of special interest is the double-stranded RNA-Binding Motif (dsRBM/dsRBD), which is common in mRNA binding proteins (mRBPs). RBPs may also be non-sequence or secondary structure specific and instead bind by targeting features at the 5’ or 3’ ends of RNA molecules. For example, IFIT5 binds by recognizing the 5’ triphosphate end of an RNA molecule (Ban,

Zhu, Melcher, & Xu, 2015). Many such special cases exist.

10

Of special interest are the so-called DEAD box proteins, characterized by the

DEAD motif, which play key roles in all levels of RNA processing from transcription and chaperoning to degradation (Linder & Jankowsky, 2011). All DEAD box proteins are enzymes that use ATP to bind to or remodel RNA (Shazman & Mandel-Gutfreund,

2008). The core of these proteins binds to RNA exclusively on the backbone over a five nucleobase region, meaning there is no inherent sequence dependence (Linder

& Jankowsky, 2011). This unique ability to bind anywhere along RNA and even slide along it is what makes RBPs containing the DEAD motif so central to mRNA metabolism.

In order to study these interactions, it is first necessary to understand the physical interactions which take place in close quarters after the RNA and RBP make contact. The covalent bonds common to RNA and RBPs are largely non-existent in their interactions.

This is understandable as covalent bonds are quite strong and would make disassociation very energy intensive. On the other hand, hydrogen bonds occur in all known RNP interactions and typically at high numbers (2-60 per interface) (Barik, Nithin, Pilla, &

Bahadur, 2015). A hydrogen bond occurs when a hydrogen atom is covalently bonded to an electronegative atom, like nitrogen or oxygen, which produces a polar group. This polar group is then attracted to other electronegative atoms and binding occurs electrostatically. Hydrogen bonds are much weaker than covalent bonds (of order 10-100 kJ/mol compared to 100-1000 kJ/mol) with some variation depending on the atoms involved, but still represent the stronger end of individual RNA-RBP interactions.

Weaker still are Van der Waals interactions, a catch-all term for interactions that occur

11 between atoms at very close range and are of order 1-10 kJ/mol. This may include dipole- dipole (and higher order) interactions, induction forces, dispersion forces, and, at very close range, a repulsive component from the Pauli exclusion principle. Van der Waals forces tend to hold RNA and RBPs which have come in close conformal contact in that state.

In addition to direct interactions between atoms there are more complex and specific interactions. One example is the salt bridge, in which a hydrogen bond is reinforced by an electrostatic interaction between two ionic groups. In the case of RNPs this typically means the side chain nitrogen atom of a positively charged residue interacting with the negative phosphate group of a nucleotide at a short distance (less than

4 Å) (Barik, Nithin, Pilla, & Bahadur, 2015). Another common, more complex interaction is “stacking,” or π–π interactions, in which closely interleaved aromatic rings

– flat ring shaped molecules like nucleobases which are also found in Histidine,

Phenylalanine, Tyrosine, and Tryptophan residues – undergo proximity related electrostatic interactions arising from quadrupole moments. Nucleobases may also stack with Arginine through π–cation stacking, in which bases are interleaved with the positive guanidinium moiety of Arginine through a quadrupole-monopole interaction (Barik,

Nithin, Pilla, & Bahadur, 2015). It is all of these interactions functioning together and often occurring non-concurrently that guides an RNP into its final configuration.

1.2 mRNA Binding Proteins and the Exon Junction Complex

As the fundamental transporters of information within the cell, protein-encoding mRNAs play an indispensable part in the central dogma of molecular biology. However,

12 mRNAs do not act alone, but are cloaked in peripheral proteins which are just as necessary in helping them fulfill their role. Concurrent with its conception and up until its eventual degradation every messenger RNA is joined by a host of mRNA binding proteins (mRBPs), the sum of which forms an mRNA:protein particle, or mRNP (Singh,

Pratt, Yeo, & Moore, 2015). The constituents of this mRNP change throughout the life of the molecule, with the associated proteins performing a multitude of diverse functions

(Hentze, Castello, Schwarzl, & Preiss, 2018). Of particular interest are those proteins involved in mRNA metabolism, which make up nearly 50% of all mRBPs (Gerstberger,

Hafner, & Tuschl, 2014). These mRBPs play a role in processing of the mRNA itself, including splicing, polyadenylation, and capping, sub-cellular localization, transport within the cell, translation, and degradation (Müller-McNicoll & Neugebauer, 2013).

1.2.1 Splicing and the birth of mRNA

Genes within DNA do not encode for proteins alone, but have many sections around and within coding regions which are used for gene expression regulation and a host of other purposes. These non-coding regions found within genes, known as introns, must be removed during RNA biogenesis and the regions which will become the final mRNA, or exons, must be spliced together in the correct sequence in order to produce a viable mRNA (Figure 1.5). This process is accomplished in the nucleus by the spliceosome, a protein and RNA complex which cuts introns out of pre-mRNA and re- attaches the exons. During RNA biogenesis 5’ capping and polyadenylation also take place. The addition of a 7-methylguanosine cap on the 5’ end of the pre-mRNA will serve as a signal for mRNA export from the cytoplasm and also direct translation initiation.

13

Likewise, a poly-A tail (around 200 repeats of adenine) is attached to the 3’ end of the pre-mRNA which signals the end of the mRNA.

A single region, or gene, found on DNA may also code for several variations of that gene, or transcripts, based on alternative splicing. Some exons may be selectively cut out in addition to the introns, or overlapping exons on the same stretch of DNA may be alternatively selected, which leads to structurally similar but functionally distinct final proteins (Figure 1.5). The relative locations of start and stop codons, 3 nucleotide sequences which align and signal the beginning and end of a given coding region, can also be impacted by this process. The so called alternative splicing process allows for information to be more densely packed into DNA and may take advantage of processes like Nonsense-Mediated Decay (NMD) in order to express isoforms at different levels.

14

Figure 1.5 Intron Excision and Alternative Splicing Illustration of how alternative splicing of exons will produce different mRNAs which code for unique proteins. Non-colored portions in DNA and RNA represent introns. Image Credit: “DNA, alternative splicing” by the National Research Institute (public domain).

1.2.2 The Exon Junction Complex

The primary focus of this work is on a group of RBPs known collectively as the exon junction complex (EJC). The EJC is assembled during pre-mRNA splicing and is deposited 24 nucleotides upstream of exon-exon junctions, and stays with the mRNA up until translation (Boehm & Gehring, 2016; Le Hir, Saulière, & Wang, 2016; Woodward,

Mabin, Gangras, & Singh, 2017). As an active player throughout the mRNA’s life the

EJC enhances gene expression at several post-transcriptional steps and is involved in splicing, export, transport, and localization (Figure 1.6). It also plays a key function in

NMD which occurs when an EJC remains bound to an mRNA after the first round of translation. The EJC performs these functions through a large complement of peripheral

15 proteins, including splicing factors, mRNA export proteins, translation factors, and NMD factors.

The core of the EJC is trimeric, featuring the clamp-like DEAD box protein

EIF4A3 as its base. EIF4A3 contains two connected domains which close around a 6 nt stretch of the mRNA’s backbone, regardless of sequence, with the help of ATP. This attachment is then solidified by the other two core proteins found in all EJCs, MAGOH and Y14, which form a heterodimer and lock EIF4A3 into place. These three core proteins are found together in activated, C complex spliceosomes (Bessonov, Anokhina,

Will, Urlaub, & Lührmann, 2008; Reichert, Le Hir, Jurica, & Moore, 2002; Zhou,

Licklider, Gygi, & Reed, 2002) – see Figure 1.7, and there is no evidence of their assembly on mRNA outside of the splicing reaction (Gehring, Lamprinaki, Hentze, &

Kulozik, 2009). The MAGOH:Y14 heterodimer is also highly conserved between species but does not have a high affinity for RNA by itself, separate from the EJC (Lau, Diem,

Dreyfuss, & Van Duyne, 2003).

16

Figure 1.6 The EJC Lifecycle

(i) During splicing EIF4A3 and the MAGOH:Y14 heterodimer are assembled, forming the EJC (ii) peripheral proteins and complexes like UPF3B, TREX, SKAR, and ASAP associate with the EJC (iii - iv) at some point connected to export MLN51 (CASC3) joins the EJC (v) as the mRNA undergoes translation a reaction occurs with the ribosome, PYM, and EJC, disassembling the latter (vi) IMP13 brings MAGOH:Y14 back into the nucleus for recycling (Woodward, Mabin, Gangras, & Singh, 2017).

17

Figure 1.7 EJC Structure Ribbon diagram showing the structure of the (CASC3) EJC bound to RNA. EIF4A3 (yellow) clamps around the RNA (gray) with the help of ATP in a sequence-independent manner and is solidified by the MAGOH:Y14 heterodimer (blue and purple). Here the EJC is shown bound by Btz, also known as CASC3, in one of its two alternate compositions (Bono, Ebert, Lorentzen, & and Conti, 2006).

1.2.3 EJC peripheral proteins

The EJC is not a static complex but interacts with a variety of peripheral proteins throughout its life. The compositions of these associated proteins changes with the EJC’s position on mRNA, with the mRNA’s subcellular location, and with the post- transcriptional process the mRNA is undergoing. The EJC is also observed to exist in both low and high weight molecular complexes (Singh, et al., 2012), which implies that

EJCs exist in mRNPs with different levels of compaction or with significantly different overall structure.

One of the most common EJC peripheral proteins, previously thought to be part of its core, is CASC3 (MLN51). CASC3 is less conserved than the other three proteins making up the trimeric core, with ~52% sequence identity between human and zebrafish,

18 whereas the other three proteins have greater than 94% sequence identity between species. CASC3 also interacts with the RNA directly, in addition to the EJC core, via a

SELOR (speckle localizer and RNA binding) motif (Degot, et al., 2004). When the crystal structure for the EJC was first obtained CASC3 was included in the core categorization due to its prevalence (Bono, Ebert, Lorentzen, & and Conti, 2006), which caused it to be considered part of a tetrameric core for some time. However, CASC3 is not required for core assembly, and may join the complex after the splicing process is complete (Gehring, Lamprinaki, Hentze, & Kulozik, 2009).

An abundance of SR and SR-like proteins also interact with the EJC, especially in the nucleus (Singh, et al., 2012). SR proteins are so named because they feature serine- arginine (SR) rich domains and motifs. This diverse group of proteins are characterized by their ability to bind to both RNA and other proteins simultaneously. SR proteins play an import role in pre-mRNA constitutive and alternative splicing, post-splicing activities, mRNA export, genome stabilization, translation, and NMD (Shepard & Hertel, 2009).

An especially noteworthy SR-like protein which binds the EJC through the associated ASAP and PSAP complexes is RNPS1. EJC bound RNPS1 has been shown to suppress cryptic and reconstituted 5’ splice sites, preventing the mis-splicing of many mRNAs. Recruitment of RNPS1 during splicing inhibits cryptic 5’ splice site use while the associated EJC core assembly masks reconstituted 3’ splice sites (Boehm, et al.,

2018). Although it is known to shuttle in and out of the cytoplasm, in contrast to CASC3,

RNPS1 is primarily localized to the nucleus.

19

1.2.4 Canonical EJCs

The positioning of EJCs ~24 nt upstream of splice sites was first established in

2000 (Le Hir, Izaurralde, Maquat, & Moore, 2000), at which time it was assumed that a group of proteins were deposited 20-24 nt upstream of exon-exon junctions by the spliceosome after excision of each intron. In 2012 the EJC deposition process was further confirmed in two separate protocols using both crosslinking immunoprecipitation (CLIP) and RNP immunoprecipitation in tandem (RIPiT) with high-throughput sequencing (see

1.3.3) across the entirety of the human transcriptome (Saulière, et al., 2012; Singh, et al.,

2012). These two studies set the deposition site of “canonical” EJCs (cEJCs) at 24 nt upstream of exon-exon junctions and demonstrated their sequence-independent binding.

The EJC’s deposition placement is likely mechanically influenced by the 3D structure of the spliceosome. Placement may also be shifted or disrupted by competing factors and/or the RNA’s secondary structure (Mishler, Christ, & Steitz, 2008).

Consistent with this idea, mRNAs with secondary structure around the canonical 24 nt position are found to have reduced EJC occupancy (Saulière, et al., 2012; Singh, et al.,

2012). In addition ~20% of exon-exon junctions lack EJC binding altogether, although the cause of this is still unknown (Saulière, et al., 2012; Singh, et al., 2012). EJC occupancy at the canonical site is also variable from junction to junction, not just between transcripts but between junctions within the same transcript.

1.2.5 Non-Canonical EJCs

In addition to establishing the canonical position of EJCs the two sequencing experiments conducted in 2012 (Saulière, et al., 2012; Singh, et al., 2012) found that 40-

20

50% of EJC-bound RNA fragments mapped to specific non-canonical sites. These non- canonical EJCs (ncEJCs) have no clear positioning relative to the 3’ end of their exons, or any universal genetic feature, but show consistent deposition within their transcript. Both the method of assembly and function of these ncEJCs is not well understood. CLIP-seq based investigation of the two alternatively EJC-bound peripheral proteins CASC3 and

RNPS1 has found that while CASC3 is typically associated with cEJCs, RNPS1 is often found in non-canonical assemblies (Hauer, et al., 2016). This suggests that EJC location may play a part in influencing peripheral protein binding efficiencies.

Currently two theories exist as to the placement of ncEJCs on internal exons. The first is that these are structurally identical EJCs (containing the conserved trimeric core) which are assembled with the help of SR proteins. SRSF1-binding sites in particular, as well as SRSF3 and Tra2a binding sites, are enriched near non-canonical binding sites, which may be indicative of SR proteins or other RBPs influencing the spliceosome towards assembly shifted away from the canonical site (Mabin, et al., 2018; Saulière, et al., 2012; Singh, et al., 2012). However this model offers no explanation for how ncEJCs deposited very far (>50 nt) away from the canonical site might be formed (Mishler,

Christ, & Steitz, 2008). An alternative view is that the full EJC core is not deposited at these sites, but rather that EJC-associating RBPs bind directly to the RNA and engage with ncEJCs through protein-protein interactions.

ncEJCs have also been found in the final exons and 3’ untranslated regions

(UTRs) of mRNA. This goes against the generally accepted theory of deposition by the spliceosome since there are no downstream junctions to be found in these regions.

21 ncEJCs in these regions should also trigger NMD since they are found downstream of stop codons, further complicating the reason for their existence. Both the method by which these EJCs are assembled and removed remains unknown.

1.2.6 EJCs influence mRNP structure

An important process which takes place co-transcriptionally is the packaging of mRNA into more compact mRNPs. This occurs with the help of a multitude of RBPs which are recruited during transcription, including the TREX complex and several SR proteins. Endogenous EJCs exist as multimers inside of high-molecular weight complexes, often bound to SR proteins, which also package mRNA into higher order mRNPs of unknown structure (Singh, et al., 2012). While the purpose of these structures is not entirely known, compacted RNA can undergo transport with higher efficiency, making this an important process for mRNA which will travel into the cytoplasm for translation.

1.2.7 EJCs and Nonsense-Mediated Decay

NMD is both a surveillance pathway which serves to degrade mRNA which have been incorrectly spliced or contain nonsense mutations and also serves as a gene regulation mechanism. While NMD may occur without EJCs, EJC-dependent NMD involving premature termination codons (PTCs) accounts for a major fraction of NMD activity. PTCs may occur either through the mis-splicing of exons, in which case a final coding exon containing a is inserted in the non-final position of the transcript, through a point mutation which produced a stop codon, or as an evolved adaptation to limit translation events of a specific transcript. The latter EJC and PTC dependent NMD

22 process, along with EJC-independent NMD dictated by long 3’UTRs, serve to downregulate 10-25% of the mammalian transcriptome (Chan, et al., 2007; Lykke-

Andersen & Jensen, 2015). These processes occur when an EJC remains downstream

(>50 nt) of a ribosome which has already terminated on a stop codon (Karousis, Nasif, &

Mühlemann, 2016; Lykke-Andersen & Jensen, 2015).

Mechanistically NMD occurs when UPF1 is recruited to a terminated (stalled) ribosome via release factors eRF3 and eRF1 along with SMG1 kinase, a UPF1 regulator.

Together these form the SURF complex, which in the presence of a UPF2 and UPF3 bound EJC assembles into the decay-inducing (DECID) complex. The interaction between UPF1 and UPF2 induces the activation of the helicase function of UPF1 which is phosphorylated by SMG1 (Karousis, Nasif, & Mühlemann, 2016; Yamashita, 2013).

Finally, the activated UPF1 in turn recruits additional factors which degrade the mRNA through decapping and deadenylation, or by endonucleolytic cleavage.

23

Figure 1.8 Canonical NMD Pathway

A stalled ribosome is first assembled into the SURF complex. After reacting with the UPF2 and UPF3 bound EJC the ensemble becomes the DECID complex and begins the degradation process. The ribosome is released, and decay begins through additional factors recruited by UPF1 (Hug, Longman, & Cáceres, 2016)

24

1.3 Biophysics and Bioinformatics

Biophysics is the scientific discipline in which physics problem-solving approaches and techniques are used to study living systems. Indeed, even simple physical models can very accurately predict many biological processes, and in doing so provide a backbone for quantitative research. From the atomic level interactions that take place between molecules in cells to the statistical nature in which large groups of macromolecules move around each other, biophysics provides insight into the deeper connections between biological entities. The burgeoning field of bioinformatics similarly seeks to explain and understand physical processes through the computationally based statistical analysis of large data sets garnered from experiments that probe how the levels of molecules in the cell, especially nucleotides, rise and fall in different conditions.

1.3.1 Biophysics and statistical mechanics

Physics plays a very direct role in many aspects of biology. One need only consider the role of electricity and magnetism in ion transfer, kinetic theory in osmosis and diffusion, nuclear physics in radioisotope labeling and radiation therapy, and direct applications like X-rays, optics, imaging, and crystallography. In genetics the tool used most to help describe and predict biological systems is statistical mechanics, which aims to find average values of the properties of large ensembles of particles (e.g. nucleotides or amino acids).

A key concept in statistical mechanics is that of entropy maximization, or the second law of thermodynamics, which tell us that in a closed system equilibrium is

25 reached when entropy is maximized. And when considering certain biophysical processes, in which both temperature and pressure are held constant, this simplifies to the idea of equilibrium achieved through free energy minimization. This principle can be used to accurately predict the secondary and tertiary structures of RNA and proteins as well as their binding probabilities (as in the case of some position weight matrices).

Central to this idea is the Boltzmann distribution (Equation 1.1)

푒−휖푖/퐾퐵푇 푃(휖 ) = (퐸푞푢푎푡푖표푛 1.1) 푖 푍

Where P(ϵi) is the probability of observing a given state i with energy ϵi at temperature T, and kB is the Boltzmann constant. Z is the partition function (Equation 1.2) which includes information about all possible states a system can be in.

푍 = ∑ 푒−휖푖/푘퐵푇 (퐸푞푢푎푡푖표푛 1.2) 푖

The exact calculation of a given partition function can be computationally non-feasible but finding simplifying cases in order to apply this concept is still one of the most powerful tools available to biophysicists.

1.3.2 Bioinformatics and high-throughput RNA sequencing

Bioinformatics combines computer science, statistics, and biology in order to study the sequence-based information inherent in living systems (Hogeweg, 2011). In practice it is the extension and application of the mathematical and computational methods used to process and study increasingly large biological data sets, especially those produced by next generation sequencing technologies and similar big data producing experimental procedures. Since the completion of the Human Genome Project

26 in 2003 sequencing technology has only become more robust and cost-effective. In fact, the cost to sequence a single human genome has decreased faster than Moore’s Law – what cost around $2,700,000,000 then can be done for around $1000 today.

Those same advancements have greatly impacted RNA sequencing (RNA-Seq) as well, with high-throughput RNA sequencing becoming an affordable way to quantify gene expression levels in living cells. This technology works by fragmenting and isolating RNA molecules from within a cell. These fragments are converted into complementary DNA (cDNA) and are then sequenced on a high-throughput sequencing platform which can measure hundreds of millions of individual (generally 30-300 base long) reads. After aligning these reads to a reference genome this gives an accurate picture of the mRNA profile of the cell under any given condition, and likewise a view of relative gene expression levels. Quantification of these reads followed by a multitude of statistical analyses makes up the backbone of modern bioinformatics as well as the bulk of the scientific inquiry presented in this work.

1.3.3 Footprinting of RNA:protein complexes

A more recent development which has expanded the potential of high-throughput sequencing is its combination with cross-linking and immunoprecipitation in order to map RBP binding sites. In these experiments a cross-linking agent is first used to solidify the bond between bound RNA:protein complexes. In the case of CLIP-Seq (Cross-

Linking Immunoprecipitation followed by high-throughput Sequencing) and similar methods this is accomplished by exposure to UV light, which produces covalent bonds between RNA and proteins in physical contact. In chemical cross-linking method such as

27

RIP-Seq (RNA Immunoprecipitation) this may be accomplished by a chemical agent such as formaldehyde, which will crosslink any combination of proteins and nucleotides with a strong dependence on formaldehyde concentration.

Following cross-linking cells are lysed and the protein of interest is isolated using immunoprecipitation, a technique which uses antibodies to separate and target specific proteins. In the case of RIPiT-Seq (RNA Immunoprecipitation in Tandem) (Singh, Ricci,

& Moore, 2014) two proteins may be pulled down, isolating RNA bound by multiple proteins simultaneously. Next RNase digestion is used to remove RNA not bound, and therefore protected, by any proteins. Finally, the bound proteins are also digested, leaving behind only the previously protected RNA fragments. Once these fragments undergo high-throughput sequencing and alignment their “footprints” impart detailed information about which genes, and where on those genes, the protein was bound.

1.4 Outline

In the chapters which follow I will first take a closer look at the Exon Junction

Complex and discuss the finding that it exists in two alternative configurations. In

Chapter 2 I present evidence for a new model in which the EJC has an important peripheral protein, RNPS1, at the time of its initial deposition by the spliceosome, but by the time the EJC is removed during translation this protein has been replaced by CASC3.

The switch likely converts the compact mRNP with higher-order EJC structure found within the nucleus to a less compact form featuring monomeric EJCs. This conversion also changes how susceptible the mRNA is to NMD as it matures.

28

Chapter 3 delves into the reason why this switch was previously unknown. I find that depending on the type of RNA:protein experimental crosslinking protocol there is a large group of proteins which bind RNA indirectly (e.g. through intermediate proteins) which may be overlooked. The EJC, which contains both directly and indirectly binding proteins depending on its configuration, makes for a perfect case study to examine this dependency. I apply our new knowledge concerning the importance of experimental crosslinking method to show how a previous study looking at how the EJC interacts with recursively spliced (RS) exons was unable to see the full extent of the EJC’s involvement.

Finally, in Chapter 4 we will turn our attention to another EJC associating protein,

PYM, which until now was thought to remove the EJC co-translationally as its primary function. I find that PYM is unnecessary for removal of canonical EJCs but does inhibit binding of non-canonical EJCs. I then lay the groundwork for future study of PYM as it relates to NMD by reproducing an NMD study and using that data as a reference for the creation of a universal list of NMD target transcripts.

29

Chapter 2

The EJC Undergoes a Compositional Switch1

2.1 Abstract

The exon junction complex (EJC), deposited upstream of mRNA exon junctions, shapes the structure, composition, and fate of spliced mRNA ribonucleoprotein particles

(mRNPs). To achieve this, the EJC core nucleates assembly of a dynamic shell of peripheral proteins that function in diverse post-transcriptional processes. To illuminate the consequences of the EJC compositional change, we purified EJCs from human cells via peripheral proteins RNPS1 and CASC3. We show that the EJC originates as an SR- rich mega-dalton-sized RNP that contains RNPS1 but lacks CASC3. Sometime before or during translation, the EJC undergoes compositional and structural remodeling into an

SR-devoid monomeric complex that contains CASC3. RNPS1 is important for nonsense- mediated mRNA decay (NMD) in general, whereas CASC3 is needed for NMD of only select mRNAs. The switch to CASC3-EJC also slows down NMD. Overall, the EJC compositional switch dramatically alters mRNP structure and specifies two distinct phases of EJC-dependent NMD.

2.2 Introduction

From the time of their birth until their eventual demise, messenger RNAs

(mRNAs) exist decorated with proteins as mRNA:protein particles, or mRNPs (Singh,

1 This chapter is based on a published article (Mabin, et al., 2018). Wet lab work was conducted by the co- authors cited in the figures and text, while I conducted all bioinformatics and computational analysis unless otherwise noted. 30

Pratt, Yeo, & Moore, 2015). The vast protein complement of mRNPs has been illuminated (Hentze, Castello, Schwarzl, & Preiss, 2018) and is presumed to change as mRNPs progress through their life. However, the understanding of mechanisms and consequences of mRNP composition change remains confined to only a handful of its components. For example, mRNA export adapters are removed upon mRNP export to provide directionality to mRNP metabolic pathways, and the nuclear cap and poly(A)-tail binding proteins are exchanged for their cytoplasmic counterparts after mRNP export to promote translation (Singh, Pratt, Yeo, & Moore, 2015). When, where, and how the multitude of mRNP components change during its lifetime and how such changes impact mRNP function remain largely unknown.

A key component of all spliced mRNPs is the exon junction complex (EJC), which assembles during pre-mRNA splicing 24 nucleotides (nt) upstream of exon-exon junctions (Boehm & Gehring, 2016; Le Hir, Saulière, & Wang, 2016; Woodward, Mabin,

Gangras, & Singh, 2017). Once deposited, the EJC enhances gene expression at several post-transcriptional steps, including pre-mRNA splicing, mRNA export, mRNA transport and localization, and translation. If an EJC remains bound to an mRNA downstream of a ribosome terminating translation, it stimulates nonsense-mediated mRNA decay (NMD).

The stable trimeric EJC core forms when RNA-bound EIF4A3 is locked in place by

RBM8A (also known as Y14) and MAGOH. This trimeric core is thought to be joined by a fourth protein CASC3 (also known as MLN51 or Barentsz) to form a stable tetrameric core (Boehm & Gehring, 2016; Hauer, et al., 2016; Le Hir, Saulière, & Wang, 2016).

However, more recent evidence suggests that CASC3 may not be present in all EJCs and

31 may not be necessary for all EJC functions (Mao, Brown, & Silver, 2017; Singh, et al.,

2012). Nonetheless, the stable EJC core interacts with a dynamic shell of peripheral EJC proteins such as pre-mRNA splicing factors (e.g., SRm160, RNPS1), mRNA export proteins (e.g., the TREX complex), translation factors (e.g., SKAR), and NMD factors

(e.g., UPF3B) (Boehm & Gehring, 2016; Le Hir, Saulière, & Wang, 2016; Woodward,

Mabin, Gangras, & Singh, 2017). Some peripheral EJC proteins share similar functions and yet may act on different mRNAs; e.g., RNPS1 and CASC3 can both enhance NMD but may have distinct mRNA targets (Gehring, et al., 2005). Thus, the peripheral EJC shell may vary between mRNPs leading to compositionally distinct mRNPs, an idea that has largely remained untested.

Within spliced mRNPs, EJCs interact with one another as well as with several SR and SR-like proteins to assemble into mega-dalton-sized RNPs (Singh, et al., 2012).

These stable mega-RNPs ensheath RNA well beyond the canonical EJC deposition site, leading to 150- to 200-nt-long RNA footprints, suggesting that the RNA polymer within these complexes is packaged into an overall compact mRNP structure. Such a compact structure may facilitate mRNP navigation of the intranuclear environment, export through the , and transport within the cytoplasm to arrive at its site of translation.

Eventually, the mRNA within mRNPs must be unpacked to allow access to the translation machinery. How long mRNPs exist in their compact states and when, where, and how they are unfurled remains yet to be understood.

Previous work showing that, in human embryonic kidney (HEK293) cells,

CASC3 and many peripheral EJC factors are substoichiometric to the EJC core (Singh, et

32 al., 2012) spurred us to investigate variability in EJC composition. Here, we use EJC purification via substoichiometric factors to reveal that EJCs first assemble into SR-rich mega-dalton-sized RNPs and then undergo a compositional switch into SR-devoid monomeric CASC3-containing EJCs. These EJC forms differ in their NMD activity;

RNPS1, a component of the SR-rich EJCs, is crucial for all tested NMD substrates, whereas CASC3, a constituent of SR-devoid EJCs, is dispensable for NMD of some transcripts. Our findings reveal a step in the mRNP life cycle wherein EJCs, and by extension mRNPs, undergo a remarkable compositional switch that alters the mRNP structure and specifies two distinct phases of EJC-dependent NMD.

2.3 Materials and Methods

2.3.1 Experimental methods

Non-computational work (i.e. wet lab experimentation) including cell preparation, immunoprecipitation, and preparation for high-throughput sequencing was conducted primarily by Justin Mabin and Lauren Woodward. A detailed explanation of these methods can be found in (Mabin, et al., 2018).

2.3.2 High-throughput sequencing library preparation

For RIPiT-seq, RNA extracted from ∼80% of RIPiT elution was used to generate strand-specific libraries. For RNA-seq libraries, 5 μg of total cellular RNA was depleted of ribosomal RNA (RiboZero kit, Illumina), and subjected to base hydrolysis. RNA fragments were then used to generate strand-specific libraries using a custom library preparation method (Gangras, Dayeh, Mabin, Nakanishi, & Singh, 2018). Briefly, a pre- adenylated miR-Cat33 DNA adaptor was ligated to RNA 3′ ends and used as a primer

33 binding site for reverse-transcription (RT) using a special RT primer. This RT primer contains two sequences linked via a flexible PEG spacer. The DNA with free 3′ end contains sequence complementary to the DNA adaptor as well as Illumina PE2.0 primer sequences. The DNA with free 5′ end contains Illumina PE1.0 primer sequences followed by a random pentamer, a 5 nt barcode sequence, and ends in GG at the 5′ end. Following

RT, the extended RT primer is gel purified, circularized using CircLigase (Illumina), and used for PCR amplification using Illumina PE1.0 and PE2.0 primers. All DNA libraries were quantified using Bioanalyzer (DNA lengths) and Qubit (DNA amounts). Libraries were sequenced on Illumina HiSeq 2500 in single-end format (50 and 100 nt read lengths). All high-throughput sequencing library preparation was conducted by Lauren

Woodward and Justin Mabin.

2.3.3 Adaptor trimming and PCR duplicate removal

After demultiplexing, fastq files containing unmapped reads were first trimmed using Cutadapt. A 12 nt sequence on read 5′ ends consisting of a 5 nt random sequence, 5 nt identifying barcode, and a CC was removed with the random sequence saved for each read for identifying PCR duplicates down the line. Next as much of the 3′-adaptor (miR-

Cat22) sequence TGGAATTCTCGGGTGCCAAGG was removed from the 3′ end as possible. Any reads less than 20 nt in length after trimming were discarded.

2.3.4 Alignment and removal of multimapping reads

Following trimming, reads were aligned with TopHat v2.1.1 (Trapnell, Pachter, &

Salzberg, 2009) using 12 threads to NCBI GRCh38 with corresponding Bowtie2 index.

34

After alignment, reads with a mapping score less than 50 (uniquely mapped) were removed, i.e., all multimapped reads were discarded.

2.3.5 Removal of stable RNA mapping reads

Next, reads which came from stable RNAs were counted and removed as follows.

All reads were checked for overlap against hg38 annotations for miRNA, rRNA, tRNA, scaRNA, snoRNA, and snRNA using bedtools intersect (Quinlan & Hall, 2010), and any reads overlapping by more than 50% were removed. Reads aligned to chrM

(mitochondrial) were also counted and removed.

2.3.6 Human reference transcriptome

The primary reference transcriptome used in all post-alignment analysis was obtained from the UCSC Table Browser. CDS, exon, and intron boundaries were obtained for canonical genes by selecting Track: Gencode v24, Table: knownGene,

Filter: knownCanonical (describes the canonical splice variant of a gene).

Read distribution assignment:

Fractions of reads corresponding to exonic, intronic, intergenic, and canonical

EJC and non-canonical EJC regions were then computed. Exonic regions were defined by the canonical hg38 genes, with intronic regions defined as the regions between exons in said genes. Bedtools intersect was used to compare reads against these exon and intron annotations and reads which overlapped the annotation by more than 50% were counted.

Any reads which did not overlap either the exon or intron annotations sufficiently were counted as intergenic. For classification of reads as canonical versus non-canonical EJC footprints, the canonical region for each library was defined using the meta-exon

35 distribution at exon 3′ ends (Figures 2.6 and 2.7). All reads with their 5′ ends falling within the window starting at the −24 position up until the 25% max height on the 5′ side of the canonical EJC peak were counted as canonical reads. Similarly, any read whose 5′ end was found anywhere between the start of the exon and 10 bp upstream of that 25% max point was considered non-canonical.

2.3.7 K-mer analysis

Lists of all 6-mers and 3-mers present in reads mapping to exonic regions (as described above) were produced for each RIPiT-seq sample. The ratio of total 3-mer frequency in RIPiT-seq samples to RNA-seq samples was then used to identify 3-mers enriched in alternate EJCs.

2.3.8 Motif enrichment analysis

Motif enrichment analysis was performed by first selecting RNA binding proteins of interest - namely SR proteins - from the position weight matrices (PWMs) available on http://rbpdb.ccbr.utoronto.ca/. For all reads mapped to exonic regions, a score was then generated representing the highest possible binding probability for each protein on that read. For visualization, the cumulative distribution of these score frequencies was plotted for both pull down and RNA-seq replicates, with a relatively higher score frequency at a positive score implying greater binding affinity. The p values between RIPiT-seq and

RNA-seq replicates were also computed for every score using a negative binomial based model, with significant values primarily in positive score regions implying a binding preference for that protein.

36

2.3.9 Differential enrichment analysis

Differential analysis of exons and transcripts between CASC3 and RNPS1 pull down was conducted with the DESeq2 (Love, Huber, & Anders, 2014) package in R.

Exons and transcripts with significant differential expression (p < 0.05) were selected. All the following analysis (Chapter 2) was conducted using only the lists of significantly differentially expressed transcripts, unless otherwise noted.

2.3.10 Estimation of nuclear versus cytoplasmic levels

Nuclear and cytoplasmic RNA levels were estimated by first obtaining nuclear and cytoplasmic reads from (Neve, et al., 2016). Reads were aligned and mapped to our exonic annotation as described above, and a ratio of nuclear to cytoplasmic reads was then calculated for all transcripts.

2.3.11 Comparison to genes with detained introns

A list of detained and non-detained introns was obtained from (Boutz, Bhutkar, &

Sharp, 2015). To identify detained intron containing and lacking genes in our RNA-seq data, we carried out analysis using a DESeq2-based pipeline as described by Boutz et al.

Briefly, using two RNA-seq replicates we first created artificial datasets containing the same numbers of total reads per replicate, but with counts originating from introns in a given gene spread evenly among those introns in the artificial set. By comparing the intron distributions of these artificial replicates to the experimental replicates using

DESeq we were able to produce lists of detained introns (introns with significantly (p <

0.05) higher expression levels compared to the artificially spread data) and non-detained introns (introns with non-significant (p > 0.1) expression levels compared to the

37 artificially spread data). The transcripts containing introns found to be detained/non- detained in both our analysis and the analysis done by Boutz et al. make up the stringent lists, which were used for analysis in Figure 2.14A. Of the 693 canonical genes containing detained introns reported by Boutz et al. We found 555 in our own analysis

(80%) and of the 5294 canonical genes lacking detained introns we found 812 (15%).

2.3.12 Ribosome occupancy and mRNA half-life estimates

Ribosome occupancy data for knownCanonical transcripts was obtained from

(Kiss, Baez, Huebner, Bundschuh, & Schoenberg, 2017) with no further processing. mRNA half-life data was similarly obtained from (Tani, et al., 2012).

2.3.13 Gene ontology analysis

DAVID gene ontology tool (Huang, Sherman, & Lempicki, 2009) was used to compare the set of genes (canonical Ensembl transcript IDs) predicted by DESeq2 analysis to be significantly enriched in CASC3 or RNPS1 EJCs against a background list containing only those human genes that were reliably detected by DESeq2 (all genes for which DESeq2 calculated adjusted p values). Only non-redundant categories with lowest p value (with Benjamini-Hochberg correction) are reported.

38

2.4 Results

2.4.1 RNPS1 and CASC3 associate with the EJC core in a mutually exclusive manner

We reasoned that substoichiometric EJC proteins may not interact with all EJC cores, and therefore, some of them may not interact with each other. To test this prediction, Lauren Woodward and Justin Mabin performed immunoprecipitation (IP) for either endogenous core factor EIF4A3 or the substoichiometric EJC proteins RNPS1 and

CASC3 from RNase-A-treated HEK293 total cell extracts. As expected, EIF4A3 IP enriches EJC core, as well as all peripheral proteins tested (Figure 2.1A, lane 3). In contrast, the IPs of substoichiometric factors enrich distinct sets of proteins. CASC3 immunopurified EIF4A3, RBM8A, and MAGOH but not the peripheral proteins ACIN1 and SAP18 (Figure 2.1A, lane 4, and data not shown; RNPS1 could not be detected as it co-migrates with antibody heavy chain). Conversely, RNPS1 IP enriches the EJC core proteins and its binding partner SAP18 (Murachelli, Ebert, Basquin, Le Hir, & Conti,

2012; Tange, Shibuya, Jurica, & Moore, 2005), but no CASC3 is detected (Figure 2.1A, lane 5). A similar lack of co-IP between RNPS1 and CASC3 even after formaldehyde crosslinking of cells prior to lysis (data not shown; see below) suggests that the lack of interaction is not due to their dissociation in extracts. In similar IPs from RNase-A- treated total extracts of mouse brain cortical slices, mouse embryonal carcinoma (P19) cells, and HeLa cells, RNPS1 and CASC3 efficiently co-IP with EIF4A3 but not with each other (Figure 2.1B). Thus, in mammalian cells, the EJC core forms mutually exclusive complexes with RNPS1 and CASC3, which we refer to as alternate EJCs.

39

Figure 2.1 RNPS1 and CASC3 Exist in Mutually Exclusive EJCs in Mammalian Cells

(A) Western blots showing proteins on the right in RNase-A-treated total HEK293 cell extract (T) (lane 1) or in the immunoprecipitants (IP) (lanes 2–5) of the antibodies listed on the top. Asterisk (∗) indicates IgG heavy chain.

(B) Western blots as in (A) from RNase-A- treated mouse cortical total extracts and IPs. Arrow: RNPS1 signal sandwiched between IgG heavy chain and EIF4A3.

(C) A heatmap of signal for normalized- weighted spectra observed for the proteins on the right in the bottom-up proteomics of the indicated FLAG IPs (top). Bottom: heatmap color scale.

Analysis in (A) and (C) conducted by Justin Mabin; Analysis in (B) conducted by Lauren Woodward

40

2.4.2 Proteomic analysis of alternate EJCs

To gain further insights into alternate EJCs, we characterized RNPS1 and

CASC3-containing complexes using bottom-up proteomics. Justin Mabin generated stable HEK293 cell lines to achieve tetracycline-inducible expression of FLAG-tagged

RNPS1 and CASC3 at near-endogenous levels. FLAG IPs from RNase-treated total extracts of these cells confirmed that the tagged proteins also exist in mutually exclusive

EJCs. Importantly, proteomic analysis of FLAG-affinity-purified alternate EJCs shows almost complete lack of spectra corresponding to RNPS1 and CASC3 in the IP of the other alternate EJC factor (Figure 2.1C). In comparison, a FLAG-MAGOH IP enriches both CASC3 and RNPS1 as expected (Figures 2.1C).

We next analyzed the complement of proteins more than 2-fold enriched in

FLAG-RNPS1, FLAG-CASC3, or FLAG-MAGOH IPs as compared to the FLAG-only control. Of the 59 proteins enriched in the two FLAG-RNPS1 biological replicates 38 are common with the 45 proteins identified in the two FLAG-MAGOH replicates, suggesting that RNPS1-containing complexes are compositionally similar to those purified via the

EJC core. In comparison, among the FLAG-CASC3 replicates, EIF4A3 and MAGOH are the only common proteins that are also shared with FLAG-MAGOH. Known EJC interactors such as UPF3B and PYM1 are identified in only one of the two FLAG-

CASC3 replicates, indicating their weaker CASC3 association.

A direct comparison of FLAG-RNPS1 and FLAG-CASC3 proteomes reveals their stark differences beyond the EJC core (Figures 2.2A and 2.3). Several SR and SR- like proteins are enriched in the RNPS1 and MAGOH proteomes but are absent from the

41

CASC3 IPs. Among the SR protein family, SRSF1, 6, 7, and 10 are enriched in both

RNPS1 and MAGOH samples, while all other canonical SR proteins, with the exception of non-shuttling SRSF2, are detected in at least one of the two replicates (Figures 2.2A-

B, 2.3). Several SR-like proteins such as ACIN1, PNN, SRRM1, and SRRM2 are also enriched in MAGOH and RNPS1 IPs but are absent in CASC3 IPs (Figure 2.2). A weak but erratic signal for SRSF1 and SAP18 is seen in FLAG-CASC3 IPs (Figure 2.3).

However, their enrichment with CASC3 is much weaker as compared to MAGOH or

RNPS1. Thus, an interaction network exists between SR and SR-like proteins and the

RNPS1-EJC but not the CASC3-EJC.

Other proteins specifically associated with RNPS1 (and MAGOH), but not with

CASC3, include transcription machinery components (e.g., RPB1, RPB2), transcriptional regulators (e.g., CD11A, CDK12), and RNA processing factors (e.g., NCBP1, FIP1)

(Figure 2.3). Consistent with EJC core assembly during pre-mRNA splicing, MAGOH and RNPS1 interactors also include U2, U4, and U6 snRNP components and the nineteen complex subunits. None of the splicing components interact with CASC3. Considering that SR proteins also assemble onto nascent RNAs coincident with transcription and splicing (Zhong, Ding, Adams, Ghosh, & Fu, 2009), the EJC-RNPS1-SR interaction network likely originates during co-transcriptional mRNP biogenesis. In contrast, CASC3 most likely engages with the EJC only post-splicing, as previously suggested (Gehring,

Lamprinaki, Hentze, & Kulozik, 2009).

42

Figure 2.2 CASC3- and RNPS1- Containing Complexes Have Distinct Protein Composition and Hydrodynamic Size

(A) A scatterplot comparing fold- enrichment of proteins in FLAG- RNPS1 IP over FLAG-only control (x axis) to fold- enrichment of the same proteins in FLAG-CASC3 IP over FLAG- only control (y axis). Each dot represents a protein identified in FLAG-EJC or FLAG-only control samples by Scaffold. Black dots: proteins enriched > 2- fold over control in the FLAG- MAGOH IP.

(B) Western blots showing proteins (right) in total extract (T) or FLAG-IPs as on the top from HEK293 cells stably expressing FLAG-tagged proteins indicated on the top left. Note that RNPS1 migrates just above EIF4A3. Asterisk: CASC3 signal from an earlier probing.

(C) Western blots showing proteins on the right in glycerol gradient fractions of FLAG-IPs from HEK293 cells stably expressing FLAG-MAGOH, FLAG-RNPS1, or FLAG-CASC3 (far left). Top: gradient fraction. Bottom: molecular weight standards.

Analysis by Justin Mabin

43

Figure 2.3 Proteins Enriched During FLAG-EJC Pulldown

A heatmap showing proteins >10-fold enriched in any of the FLAG-EJC IPs (indicated on the top) over FLAG-only control. Right: Protein groups by functions. Bottom: heatmap color scale.

Analysis by Guramrit Singh

44

2.4.3 Alternate EJCs are structurally distinct

The enrichment of SR and SR-like proteins exclusively with RNPS1 suggests that

RNPS1-EJCs are likely to resemble the previously described higher-order EJCs (Singh, et al., 2012). Indeed, glycerol gradient fractionation of RNase-treated FLAG-RNPS1 complexes shows that, like FLAG-MAGOH EJCs, FLAG-RNPS1 EJCs contain both lower- and higher-molecular-weight complexes (Figure 2.2C). On the other hand,

CASC3 is mainly detected in lower-molecular-weight complexes purified via FLAG-

MAGOH (Figure 2.2C, compare CASC3 in fractions 2–10 and 22–24). Further, FLAG-

CASC3 complexes are exclusively comprised of lower-molecular-weight complexes, likely to be EJC monomers (Figure 2.2C). Thus, compositional distinctions between the two alternate EJCs give rise to two structurally distinct complexes.

2.4.4 RNPS1 and CASC3 bind RNA via the EJC core with key distinctions

We next identified the RNA binding sites for the two alternate EJC factors using

RNA:protein immunoprecipitation in tandem (RIPiT) combined with high-throughput sequencing, or RIPiT-seq (Singh, et al., 2012; Singh, Ricci, & Moore, 2014). RIPiT-seq entails tandem purification of two subunits of an RNP and is well suited to study EJC composition via sequential IP of its constant (e.g., EIF4A3, MAGOH) and variable (e.g.,

RNPS1, CASC3) components (Figure 2.4). Lauren Woodward and Justin Mabin carried out RIPiTs from HEK293 cells by either pulling first on FLAG-tagged alternate EJC factor followed by IP of an endogenous core factor or vice versa. EJC binding studies thus far have used translation elongation inhibitor cycloheximide (CHX) treatment to limit EJC disassembly by translating ribosomes (Hauer, et al., 2016; Saulière, et al.,

45

2012; Singh, et al., 2012). However, to capture unperturbed, steady-state populations of

RNPS1- and CASC3-EJCs Lauren Woodward and Justin Mabin performed RIPiT-seq without translation inhibition.

Figure 2.4 A Schematic Illustrating the Main Steps in RIPiT-seq.

As expected, RIPiTs for each of the alternate EJCs specifically purified the targeted complex along with the EJC core factors and yielded abundant RNA footprints.

Strand-specific RIPiT-seq libraries from ∼35–60 nucleotide footprints yielded ∼2.5–27 million reads, of which > 80% mapped uniquely to the human genome. Genic read counts

46 are highly correlated between RIPiTs, where the order of IP of EJC core and alternate factors was reversed. Unlike CASC3-EJC, RNPS1-EJC interaction was susceptible to

NaCl concentration > 250 mM (data not shown). Therefore, to preserve labile interactions, Lauren Woodward and Justin Mabin performed alternate EJC RIPiT-seq from cells cross-linked with formaldehyde before cell lysis. A strong correlation is seen between crosslinked and uncrosslinked samples. All analysis presented below is from two well-correlated biological replicates of formaldehyde crosslinked RIPiT-seq datasets of

RNPS1- and CASC3-EJC.

Consistent with EJC deposition on spliced RNAs, alternate EJC footprints are enriched in exonic sequences from multi-exon genes. Along exons, footprints of both alternate EJCs accumulate mainly at the canonical EJC binding site 24 nt upstream of exon junctions (Figures 2.5 and 2.6) (Hauer, et al., 2016; Saulière, et al., 2012; Singh, et al., 2012). Notably, the enrichment at the canonical EJC site is irrespective of whether the alternate EJC factors are IPed first or second during RIPiT (Figure 2.7). Thus, both

RNPS1 and CASC3 bind to RNA via EJCs at the canonical site. The location of the 5′- and 3′-ends of the alternate EJC footprint reads shows that both alternate EJCs block positions −26 to −19 nt from RNase I cleavage (Figure 2.6B). Our finding that CASC3 mainly binds to the canonical EJC site agrees with previous work (Hauer, et al., 2016).

However, in contrast to that study, RNPS1 binding to canonical EJC sites is readily apparent in our data (Figures 2.5, 2.6), which could reflect differences in the UV- crosslinking-dependent CLIP-seq approach used by Hauer et al. (2016) and the photo-

47

crosslinking independent RIPiT-seq (see also Chapter 3). Altogether, both RNPS1 and

CASC3 exist in complex with EJC at its canonical site.

Figure 2.5 Example Read Coverage

Genome browser screenshots comparing read coverage along the ISYNA1 gene in RNA- seq or RIPiT-seq libraries (indicated on the right). Blue rectangles: exons; thinner rectangles: untranslated regions; lines with arrows: introns.

48

Figure 2.6 CASC3 Associates More Strongly with Canonical EJCs than RNPS1

(A) Meta-exon plots showing read depth in different RIPiT-seq or RNA-seq libraries (indicated in the middle) in the 150 nt from the exon 5′ (left) or 3′ end (right). Vertical black dotted line: canonical EJC position at −24 nt.

(B) A composite plot of RNPS1 and CASC3 RIPiT-seq footprint read 5′ (solid lines) and 3′ (dotted lines) ends. Vertical dotted line: canonical EJC site (−24 nt). Vertical gray line: EJC 3′ boundary (−19 nt).

49

Figure 2.7 Comparison of Samples in Meta-Exon Plot

Meta-exon plot showing read depth in different RIPiT-Seq or RNA-Seq libraries (top left corner) in the 150 nucleotides (nt) from the exon 3’ end.

While read densities for both alternate EJC factors are highest at the canonical

EJC site, 47%–62% of reads map outside of the canonical EJC site similar to previous estimates (Saulière, et al., 2012; Singh, et al., 2012). Non-canonical footprints are more prevalent in RNPS1-EJC as compared to CASC3-EJC. As RNPS1-EJC is intimately associated with SR and SR-like proteins (Figures 2.2A and 2.3), a k-mer enrichment analysis revealed a modest but significant enrichment of GA-rich 6-mers in RNPS1 over

CASC3 footprints (Figure 2.8A). Such purine-rich sequences occur in binding sites of several SR proteins (SRSF1, SRSF4, Tra2a and b) (Änkö, et al., 2012; Pandit, et al.,

50

2013; Tacke, Tohyama, Ogawa, & Manley, 1998). A small but significant enrichment of

SRSF1 and SRSF9 sequence motifs is seen in RNPS1 footprints as compared to CASC3 footprints or RNA-seq reads (Figures 2.8B, 2.8C). Thus, within spliced RNPs, RNPS1-

EJC is engaged with SR and SR-like proteins and other RNA binding proteins, which leads to co-enrichment of RNA binding sites of these proteins during RNPS1 RIPiT.

51

Figure 2.8 K-Mer Analysis of Alternate Peripheral Proteins

(A) Boxplots showing frequencies of 6-mers that contained CG 3-mers (CGG, GCG, CCG, CGC) or GA 3-mers (GGA, GAA, AGG, GAG) in CASC3 or RNPS1 RIPiT-seq reads. Top: Wilcoxon rank-sum test p values.

(B) Cumulative distribution function plots showing frequency of reads in the indicated samples with the highest score for match to SRSF1 motif (inset) position weight matrix (PWM). Bottom left: sample identity legend. The SRSF1 motif in the inset is from (Tacke & Manley, 1995)

(C) A negative binomial based assessment of significance of differences in SRSF1 motif PWM scores in (B) between the two alternate EJC footprint reads or between alternate EJC and RNA-seq reads (legend on the bottom left). Horizontal dotted red line: p = 0.05. Horizontal solid red line: Bonferroni adjusted p value.

52

2.4.5 Subcellular mRNP Localization and Nuclear Retention Mechanisms Affect Relative

RNPS1 and CASC3 Occupancy

Surprisingly, despite the mutually exclusive association of RNPS1 and CASC3 with the EJC core the two proteins are often detected on the same sites on RNA, leading to their similar apparent occupancy on individual exons as well as entire transcripts.

These results suggest that the two alternate EJC factors bind to two distinct pools of the same RNAs. To further reveal RNA binding patterns of the alternate EJC factors, we identified exons differentially enriched in one or the other factor. If two or more exons from the same gene are differentially enriched in RNPS1 or CASC3 footprints, these exons are almost always enriched in the same alternate EJC factor (Figure 2.9). This tight linkage between EJC compositions of different exons of the same gene suggests that

RNPS1 and CASC3 binding to the EJC is likely to be determined at the level of the entire transcript rather than at the level of the individual exon. Additionally, no RNA-dependent

RNPS1 co-IP is observed with FLAG-CASC3, and only a weak RNA-dependent CASC3 co-IP is seen with FLAG-RNPS1. Thus, while some RNPs with a heterogeneous mix of alternate EJCs can be captured, at any given time, a transcript with multiple EJCs is likely to be more homogeneously associated with one or the other alternate factors.

53

Figure 2.9 Venn Diagram of Genes with Preferential RNPS1 or CASC3 Binding

A Venn diagram showing genes with at least two differentially enriched exons, for which all differentially enriched exons are preferentially bound to CASC3-EJC (blue region), RNPS1-EJC (orange region), or genes with at least one exon enriched in CASC3-EJC and one in RNPS1-EJC (overlap region). The probability of observing an overlap of 5 or fewer genes using a binomial distribution approximating each gene to contain two exons is shown.

At steady state, RNPS1 is mainly nuclear, whereas CASC3 is predominantly cytoplasmic, although both proteins shuttle between the two compartments (Daguenet, et al., 2012; Degot, et al., 2002; Lykke-Andersen, Shu, & Steitz, 2001). We reasoned that different concentrations of alternate EJC factors in the two compartments might mirror

54 their EJC and RNA association. To test this possibility, we identified subsets of transcripts preferentially enriched in RNPS1- (242 transcripts) or CASC3-EJC (625 transcripts; Figure 2.10) and compared their cytoplasm/nucleus ratios based on subcellular RNA distribution estimates in HEK293 cells (Neve, et al., 2016). Indeed, the transcripts enriched in CASC3-EJC show higher cytoplasmic levels (median cytoplasm/nucleus ratio = 0.76), whereas those preferentially bound to RNPS1 show a higher nuclear localization (median cytoplasm/nucleus ratio = 0.48; Figure 2.11). Thus, steady-state subcellular RNA localization is a key determinant of EJC composition.

Figure 2.10 MA-Plot of RNPS1-EJC vs CASC3-EJC Footprint Expression

An MA-plot showing fold-change in RNPS1-EJC versus CASC3-EJC footprint reads (y axis) against expression levels (x axis). Each dot represents a canonical transcript for each known gene in GRChg38 from UCSC “knownCanonical” splice variant table. Transcripts differentially enriched (p-adjusted < 0.05) in RNPS1-EJC (orange) and CASC3-EJC (blue) are indicated. 55

Although CASC3-enriched RNAs show higher cytoplasmic levels as a group, a quarter of them are more nuclear (Figure 2.11). Consistent with CASC3 shuttling into the nucleus (Daguenet, et al., 2012; Degot, et al., 2002), its footprints are abundantly detected on XIST RNA and several other spliced non-coding RNAs restricted to or enriched in the nucleus (Figure 2.12). Interestingly, CASC3-EJC-enriched RNAs that fall within the top

25% nuclear localized transcripts are characterized by significantly longer transcript, internal exon, and gene lengths (Figures 2.13). In contrast, neither the RNPS1-enriched transcripts nor the CASC3-enriched transcripts with higher cytoplasmic levels display these properties. We noted a direct correlation between transcript length and relative nuclear localization (Figure 2.13 and data not shown). Therefore, RNPs transcribed from long genes and containing long exons may persist for extended duration in the nucleus after their biogenesis and may eventually undergo the switch to CASC3-EJC within the nucleus itself.

56

Figure 2.11 Comparison of Cytoplasmic to Nuclear Fraction by Enriched Transcripts

Boxplots showing distribution of cytoplasmic/nuclear fraction (y axis) for all (gray), CASC3-EJC-enriched (blue), and RNPS1-EJC-enriched (orange) transcripts. The median values are to the right of each boxplot. Top: p values (Wilcoxon rank sum test). Bottom: number of transcripts in each group. Nuclear/cytoplasmic transcript level data are from (Neve, et al., 2016).

57

Figure 2.12 Analysis of Nuclear-Enriched RNAs

(A) Genome browser screenshots showing read coverage along the XIST gene in RNA- seq (top) or RIPiT-seq libraries (labeled on right).

(B) Relative alternate EJC levels versus nuclear abundance of long non-coding RNAs (lncRNAs). Each dot on the scatter plot represents knownCanonical transcript with spliced lncRNAs labeled in black. CASC3 or RNPS1-enriched transcripts are colored as in the legend (bottom right). y-axis: - foldchange= CASC3-enrichment, + fold change= RNPS1-enrichment.

58

Figure 2.13 Characteristics of Transcripts Localized to the Nucleus

(A) Boxplots as in Figure 2.11 showing transcript lengths (y axis) of the groups indicated on the x axis. Top: p values (Wilcoxon rank sum test). Bottom: number of transcripts in each group.

(B) Boxplots as in (A) showing gene length.

(C) Boxplots as in (A) showing the length of the longest internal exon.

59

We next compared alternate EJC occupancy of another class of transcripts retained in the nucleus due to presence of slow splicing introns (Boutz, Bhutkar, & Sharp,

2015). Transcripts containing these detained introns (DIs) are restricted to the nucleus in a mostly pre-processed, poly(A)-tailed state until DIs are spliced. Of the genes that Boutz et al. (2015) found to contain DI in four human cell lines, 542 also contained DIs in

HEK293 cells, whereas a set of 389 transcripts do not contain DI in any human cell lines, including HEK293 cells (data not shown). In contrast to the nuclear localized long transcripts, the DI-containing transcripts are significantly more enriched in RNPS1-EJC

(Figure 2.14), while DI-lacking transcripts show enrichment in CASC3-EJC. Therefore, transcripts that persist longer in the pre-processed state may remain preferentially associated with RNPS1-EJCs and switch to CASC3-EJC at a slower rate. Overall, distinct nuclear retention mechanisms can have opposing effects on EJC composition.

60

Figure 2.14 Analysis of Detained Intron Containing Genes

(A) Boxplots showing distribution of fold-change values (y axis) for RNPS1-EJC versus CASC3-EJC footprint read counts for all (gray), detained intron-containing (DI+; yellow) or detained intron-lacking (DI−; green) transcripts. (+ values: RNPS1- EJC enriched; − values: CASC3-EJC enriched). Top: p values (Wilcoxon rank sum test). Bottom: number of transcripts in each group. DI+/DI− genes were defined based on (Boutz, Bhutkar, & Sharp, 2015).

(B) Genome browser screenshot as in Figure 2.12A for the SRSF5 locus. Note increased RNA-seq reads in introns 4 and 5.

61

2.4.6 Kinetics of translation and mRNA decay impact the pool of alternate EJC-bound mRNAs

As EJCs are disassembled during translation (Dostie & Dreyfuss, 2002; Gehring,

Lamprinaki, Hentze, & Kulozik, 2009), abundant RNPS1 and CASC3 footprints at EJC deposition sites suggest that the bulk of mRNPs undergo the compositional switch before translation. Consistently, we detected RNA-dependent interactions between alternate EJC factors and nuclear cap binding protein CBP80. To test how translation impacts alternate

EJC occupancy, and if this occupancy is influenced by the rate at which mRNAs enter the translation pool, we obtained RNPS1- and CASC3-EJC footprints from cells treated with

CHX and compared them to alternate EJC footprints from untreated cells. When mRNAs bound to each alternate EJC are compared across the two conditions, CASC3-EJC occupancy shows a dramatic change (Figure 2.15). The change in RNPS1-EJC occupancy trends in the same direction (R2 = 0.27) but is much more modest. Thus, the

RNPS1-EJC likely precedes the CASC3-EJC. Also, either the RNPS1 to CASC3 switch is mildly affected by translation inhibition or RNPS1-EJCs are also subject to some translation-dependent disassembly. Nonetheless, translation inhibition leads to accumulation of mRNPs mainly with CASC3-EJC.

62

Figure 2.15 CASC3/RNPS1-EJC Occupancy with Translation Inhibition

A scatterplot of fold-change in CASC3-EJC occupancy with and without cycloheximide (CHX; x axis) and fold-change in RNPS1-EJC with and without CHX (y axis). Each dot represents a canonical transcript for each GRChg38 known gene and is colored as indicated in the legend (bottom right). Green outlined dots: genes. Dotted red line: linear regression. Top left corner: coefficient of determination (R2).

We predicted that poorly translated mRNAs will be enriched in CASC3-EJC under normal conditions, and more efficiently translated mRNAs will be differentially enriched upon translation inhibition. To test this idea, we inferred a measure of translation efficiency of human mRNAs based on their abundance normalized ribosome footprint counts in a human colorectal cancer cell line (Kiss, Baez, Huebner, Bundschuh,

& Schoenberg, 2017). Presumably, this “ribosome occupancy” measure is comparable at

63 least for ubiquitously expressed mRNAs across different human cell types and can be used as an indirect measure of HEK293 mRNA translation efficiencies. A comparison of the CASC3-EJC-enriched transcripts from untreated versus CHX-treated conditions, however, showed only a minor difference in their median ribosome occupancy (Figure

2.16A). A search for functionally related genes in the two sets revealed that each contains diverse groups (Figure 2.17). Under normal conditions, the largest and most significant

CASC3-EJC-enriched group encodes signal-peptide bearing secretory/membrane proteins, which has significantly higher ribosome occupancy as compared to all transcripts (Figure 2.16B). We reason that, despite their higher ribosome occupancy, the

“secretome” (Jan, Williams, & Weissman, 2014) transcripts may be enriched in CASC3-

EJC because binding of the signal peptide to the signal recognition particle (SRP) halts translation until the ribosome engages with the endoplasmic reticulum (ER) (Walter,

Ibrahimi, & Blobel, 1981). Presumably, the time before translation resumption on the ER allows capture of mRNPs where EJC composition has switched but it has not yet been disassembled. Consistently, a weak RNA-dependent interaction is seen between CASC3 and SRP68, an SRP component. When we considered only the cytosol-translated transcripts (Chen, Jagannathan, Reid, Zheng, & and Nicchitta, 2011), a comparison of ribosome occupancy of CASC3-EJC-enriched transcripts from untreated versus CHX- treated conditions confirmed our initial hypothesis. The median ribosome occupancy of transcripts bound to CASC3-EJC in the absence of CHX is significantly lower (−2.56) as compared to transcripts bound to CASC3-EJC in the presence of CHX (−2.08, p = 2.7 ×

10−4, Figure 2.19A).

64

Figure 2.16 Boxplots Showing Translation Efficiency and Alternate EJC Occupancy by Gene Groups

(A) Box plots showing distribution of translation efficiency estimates (y-axis) of transcript groups on y-axis. The median values are given to the right of each box plot. Top: p-values (Wilcoxon rank sum test). Bottom: the number of transcripts in each group.

(B) Comparison of ribosome occupancy (y-axis) of secrotome (red) and ribosomal protein (RP) transcripts (green) to all transcripts (grey). Number of genes in group are at the bottom. Ribosome occupancy data from (Kiss, Baez, Huebner, Bundschuh, & Schoenberg, 2017).

(C) Ribosome occupancy of RP genes in mouse embryonic stem cells (data from (Ingolia, Lareau, & Weissman, 2011)).

(D) Box plots showing fold-change (y-axis) in alternate EJC occupancy (x-axis, at the bottom) for transcript groups shown at the bottom. x-axis: -fold-change= enrichment in -CHX condition, +fold-change= enrichment in +CHX condition. The median fold-change values are given to the right of each box.

65

Figure 2.17 CASC3-EJC GO Term Analysis

Top 5 GO term keywords and their enrichment p values in CASC3-EJC enriched transcripts in the absence (left) or presence of CHX (right).

Analysis by Guramrit Singh

Another functional group enriched in CASC3-EJC under normal conditions comprises the ribosomal protein (RP)-coding mRNAs (Figure 2.17). Strikingly, transcripts encoding ∼50% of all cytosolic ribosome proteins, as well as 13 subunits, are among this group (Figure 2.15). Although RP mRNAs are among the most well translated in the cell, a sizeable fraction of RP mRNAs exist in a dormant untranslated state (Geyer, Meyuhas, Perry, & Johnson, 1982; Meyuhas

& Kahan, 2015; Patursky-Polischuk, et al., 2009). Consistently, RP mRNAs have significantly lower ribosome occupancy in human and mouse cells (Figures 2.16B and

2.16C; (Ingolia, Lareau, & Weissman, 2011)). When transcripts differentially bound to

RNPS1- versus CASC3-EJC are directly compared in normally translating cells, RP mRNAs are specifically enriched among CASC3-EJC-bound transcripts (Figure 2.18).

Therefore, RP mRNPs, and perhaps other translationally repressed mRNPs, switch to and 66 persist in the CASC3-bound form of the EJC. Consistently, under normal conditions,

CASC3-EJC-enriched RNAs have significantly lower ribosome occupancy as compared to RNPS1-EJC-bound transcripts (Figure 2.19C). When mRNPs are forced to persist in an untranslated state upon CHX treatment, cytosol-translated mRNAs show increased

CASC3 occupancy, whereas their RNPS1 occupancy is not affected (Figure 2.16D).

Figure 2.18 Translation Inhibition’s Impact on CASC3/RNPS1-EJC Occupancy

Scatterplot as in Figure 2.15 comparing fold-change in CASC3-EJC versus RNPS1-EJC footprint read counts in −CHX (x axis) and +CHX (y axis) conditions.

67

Figure 2.19 Boxplots Showing Various Conditions Impacted by Translation Inhibition

(A) Boxplot showing distribution of ribosome occupancy (y axis) of transcript groups on y-axis. Median values are shown to the right of each boxplot. Top: p values (Wilcoxon rank sum test). Bottom: number of transcripts in each group. Ribosome occupancy estimates were based on (Kiss, Baez, Huebner, Bundschuh, & Schoenberg, 2017).

(B) Comparison of fold-change in CASC3-EJC or RNPS1-EJC footprint reads at canonical EJC sites from ribosomal protein (RP)-coding mRNAs or non-ribosomal (non- RP)-protein coding mRNAs. Top: p values (Wilcoxon rank sum test).

(C) Boxplots as in (A) showing distribution of ribosome occupancy estimates of transcript groups from Figure 2.18 as indicated on the bottom.

(D) Boxplots as in (A) and (C) above comparing mRNA half-life of transcript groups from Figure 2.18 as indicated on the bottom. mRNA half-life data are from (Tani, et al., 2012).

68

As reported by (Hauer, et al., 2016), RP mRNAs are depleted of CASC3-EJC upon translation inhibition (Figures 2.15 and 2.18). Upon CHX treatment, CASC3 occupancy is significantly reduced at canonical EJC sites of RP mRNAs as compared to non-RP mRNAs, which show an increase in CASC3 occupancy (Figure 2.19B). In comparison,

RNPS1 occupancy on all transcripts modestly increases upon CHX treatment (Figure

2.19B). The decrease in CASC3-EJC occupancy on RP mRNAs upon CHX treatment suggests a paradoxical possibility that the untranslated reserves of RP mRNAs enter translation when the pool of free ribosomes is dramatically reduced upon CHX-mediated arrest of translating ribosomes. Intriguingly, a similar contradictory increase in ribosome footprint densities on RP mRNAs upon CHX treatment was recently reported in fission yeast (Duncan & Mata, 2017).

The alternate EJC occupancy landscape is also impacted by mRNA decay kinetics. CASC3-EJC-enriched RNAs have longer half-lives as compared to RNPS1-

EJC-enriched transcripts under both translation conducive (median t1/2 = 5.9 hr versus 4.6

−3 hr, p = 3.1 × 10 ) and inhibitory conditions (median t1/2 = 4.8 hr versus 3.4 hr, p = 5.3 ×

10−5, Figure 2.19D). Notably, RNAs enriched in both alternate EJCs upon CHX treatment have shorter half-lives as compared to the corresponding cohorts enriched from normal conditions. Thus, EJC detection is enhanced on transcripts that are stabilized after CHX treatment. Consistently, functionally related groups of genes encoding unstable transcripts (e.g., cell cycle, mRNA processing, and DNA damage; (Schwanhäusser, et al.,

2011)) are enriched in CASC3-EJC upon translation inhibition (Figure 2.17).

69

2.4.7 RNPS1 is required for efficient NMD of all transcripts; CASC3 is required only for a subset

Justin Mabin next tested if alternate EJC factors were equally important for NMD and/or if they have distinct targets as previously suggested (Gehring, et al., 2005). In

HEK293 cells with ∼80% of RNPS1 mRNA and proteins depleted, a majority of endogenous NMD targets tested are significantly upregulated (Figures 2.20A).

Surprisingly, however, in cells with ∼85% of CASC3 depleted, some NMD targets are largely unaffected, whereas others are only modestly upregulated. Notably, the RNAs that were upregulated upon CASC3 depletion are even more upregulated upon RNPS1 depletion. Most tested RNAs are significantly upregulated upon depletion of EIF4A3 and the central NMD factor UPF1 (Figure 2.20A and B). These data suggest that normal

RNPS1 levels are required for efficient NMD of all endogenous NMD substrates tested, whereas even a large reduction in CASC3 levels leads to only small effects on NMD of this set of transcripts.

70

Figure 2.20 Effect of Core and Peripheral EJC Component Knockdown on NMD Targets

(A) Fold change in levels of endogenous NMD-targeted transcripts (bottom) in HEK293 Flp-In TRex cells depleted of core or alternate EJC factors. (EIF4A3 knockdown: 48 hr, alternate EJC factor knockdown: 96 hr). Shown are the average values normalized to TBP levels from three biological replicates ± standard error of means (SEM). ∗p < 0.05, ∗∗p < 0.01, ∗∗∗p < 0.001 (Student’s t test). Percent knockdown of each protein in a representative experiment is in the legend.

(B) Fold change as in (A) but with UPF1 knockdown (96 hr)

Analysis by Justin Mabin

71

It is possible that either the residual amount of protein after CASC3 knockdown is sufficient to support NMD of the tested transcripts or CASC3 is dispensable for their

NMD. To evaluate these possibilities further, we supplemented EIF4A3 knockdown cells with either a WT EIF4A3 (WT) or a mutant EIF4A3 (YRAA: Y205A, R206A) with much reduced CASC3 binding (Andersen, et al., 2006; Ballut, et al., 2005; Bono, Ebert,

Lorentzen, & and Conti, 2006). Upon complementation of EIF4A3 knockdown cells with exogenous FLAG-EIF4A3 or FLAG-EIF4A3 YRAA proteins, the NMD targets tested can be divided into three groups based on the degree of rescue (Figure 2.21). (i) SF3B1 and C1orf37 show almost complete rescue of their NMD upon expression of either wild- type or mutant EIF4A3. ARC and SRSF4 also show a similar rescue as their transcript levels in the two rescue conditions are not significantly different than the control.

Therefore, this group of transcripts can undergo NMD largely independently of CASC3.

(ii) EIF4A2 shows partial, but not complete, rescue under both conditions. (iii) DNAJB2 and TMEM33 levels were rescued by wild-type EIF4A3 but not by EIF4A3 YRAA.

Therefore, at least these two transcripts depend on both RNPS1- and CASC3-EJC for their efficient NMD.

72

Figure 2.21 Real-Time PCR Analysis of NMD Targets

Real-time PCR analysis from HEK293 cells depleted of EIF4A3. The EIF4A3 knockdown data are the same as in Figure 2.20B and are plotted on the left y axis. Either wild-type EIF4A3 or a mutant with reduced CASC3 interaction were exogenously expressed in EIF4A3 knockdown cells as indicated in the legend on the top right, and these data are plotted on the right y axis.

Analysis by Justin Mabin

Zhongxia Yi next tested the effect of knockdown of RNPS1 and CASC3 on NMD of a well-known β-globin mRNA with a premature stop codon at codon 39 (β39). We observed that, like DNAJB2 and TMEM33, β39 mRNA is upregulated upon knockdown of either alternate EJC factor in HeLa cells (Figure 2.22). These results are also consistent with recent genome-wide screens that have identified CASC3 as an effector of NMD of exogenous reporters (Alexandrov, Colognori, Shu, & Steitz, 2012; Baird, et al., 2018).

Overall, our results show that, in human cells, while all tested transcripts depend on

RNPS1 for their efficient NMD, some NMD targets can undergo CASC3-independent

73

NMD, and some endogenous transcripts and exogenously (over) expressed NMD targets may depend on both EJC compositions for their efficient NMD.

Figure 2.22 β39 mRNA is Upregulated upon EJC Factor Knockdown

Northern blots showing levels of wild-type (βwt, lane 1) or PTC-containing (codon 39; β39; lanes 2-5) β-globin mRNA and a longer internal control β-globin (βUAC-GAP) mRNA from HeLa Tet-off cells treated with siRNAs indicated on the top. Tables below indicate percentage of normalized β39 mRNA as compared to normalized βwt mRNA (top) or fold-increase in β39 mRNA upon knockdown as compared to control.

Analysis by Zhongxia Yi

74

2.4.8 Increased CASC3 levels can slow down NMD

Justin Mabin next tested if overexpression of RNPS1 or CASC3 can tilt EJC composition toward one of the two alternate EJCs, and if such a change can impact

NMD. While CASC3 overexpression in HEK293 cells leads to a several-fold increase in

CASC3 co-IP with EIF4A3 (Figures 2.23A and B) and RBM8A (not shown), no concomitant decrease is seen in RNPS1 co-precipitation with EJC core proteins.

Consistent with this, manifold overexpression of RNPS1 did not cause any detectable change in levels of the two alternate factors in the EJC core IPs. These results rule out a simple, direct competition between the two proteins for EJC core interaction.

Surprisingly, upon CASC3 overexpression, all the tested endogenous NMD targets show small but significant upregulation (apart from EIF4A2 and SF3B1, Figure 2.23C).

Similarly, the β39 mRNA exogenously expressed in HeLa Tet-off cells is also modestly stabilized upon CASC3 overexpression (Figures 2.23D). As previously reported (Viegas,

Gehring, Breit, Hentze, & Kulozik, 2007), RNPS1 overexpression further downregulates

β39 mRNA in HeLa cells (Figure 2.22), indicating that NMD of this RNA occurs more efficiently when it is associated with early acting SR-rich EJC.

75

Figure 2.23 Promotion of Switch to Late-Acting CASC3-EJC Dampens NMD Activity

(A) Western blots showing proteins on the right in total extract (T, lanes 1–3) or EIF4A3 IPs (lanes 4–6) from HEK293 cells overexpressing FLAG-fusion proteins at top left.

(B) A histogram showing the fold enrichment of alternate EJC factors in EIF4A3 IPs from (A). Overexpressed (OE) alternate EJC factors are at the bottom and CASC3 (blue) and RNPS1 (orange) levels in each OE sample (lane 5 or 6) were compared to the control IP in lane 4 (gray, set to 1).

(C) Fold change in levels of NMD-targeted endogenous transcripts (bottom) in HEK293 cells exogenously overexpressing FLAG or FLAG-CASC3. Average values from three biological replicates ± SEM are shown. ∗p < 0.05 and ∗∗p < 0.01 (Student’s t test).

(D) Northern blots showing decay of Tetracycline-inducible β39 mRNA in HeLa Tet-off cells overexpressing FLAG-tagged proteins indicated on the left. Time after Tet- mediated transcriptional shut-off of β39 mRNA is indicated on the top. Right: β39 mRNA half-life (t1/2, average of three biological replicates ± standard deviation).

Analysis by Justin Mabin

76

2.5 Discussion

The EJC is a cornerstone of all spliced mRNPs and interacts with >50 proteins to connect the bound RNA to a wide variety of post-transcriptional events. The EJC is thus widely presumed to be “dynamic.” By purifying EJC via key peripheral proteins, we demonstrate that a remarkable binary switch occurs in EJC’s complement of bound proteins. Such an EJC composition change has important implications for mRNP structure and function, including mRNA regulation via NMD (Figure 2.24).

Figure 2.24 Model of the Proposed EJC Compositional Switch

A model depicting a switch in EJC composition and its effect on mRNP structure and NMD activity. Key EJC proteins are indicated. Oval with radial orange fill in the high molecular weight (HMW) EJCs: unknown interactions that mediate EJC multimerization. Grey and black lines: exons.

2.5.1 EJC composition and mRNP structure

Our findings suggest that, when EJCs first assemble during co-transcriptional splicing, the core complex consisting of EIF4A3, RBM8A, and MAGOH engages with

SR proteins and SR-like factors, including RNPS1 (Figures 2.2 and 2.3). Within these

77 complexes, RNPS1 is likely bound to both the EJC core, as well as to the SR and SR-like proteins bound to their cognate binding sites on the RNA (Figures 2.6 – 2.8). This network of interactions bridges adjacent and distant stretches of mRNA, winding the mRNA up into a higher-order structure, which is characteristic of pre-translation RNPs purified from human cells via the EJC core or RNPS1 (Figures 2.2 and 2.3, (Singh, et al.,

2012; Metkar, et al., 2018)) Such higher-order interactions are likely to be key in packaging spliced RNA into a compact RNP particle (Adivarahan, et al., 2018).

Presumably, the higher-order EJCs assemble via multiple weak interactions among low- complexity sequences (LCS) within EJC bound SR and SR-like proteins (Haynes &

Iakoucheva, 2006; Kwon, et al., 2013). RNPS1, which possesses an SR-rich LCS, could possibly act as a bridge between the EJC core and more distantly bound SR proteins. Our data also indicate that, at least on average-sized mRNPs, these SR-rich and RNPS1- containing higher-order EJCs persist through much of their nuclear lifetime (Figure 2.11).

At some point before or during translation, the SR and SR-like proteins are evicted from all EJCs of an mRNP and the EJC is joined by CASC3 (Figures 2.15 and 2.18). It remains to be seen if CASC3’s EJC incorporation drives EJC remodeling. Alternatively, active process(es) such as RNP modification via SR protein by cytoplasmic SR protein kinases (Zhou & Fu, 2013) or RNP remodeling by ATPases may precede CASC3 binding to EJC (Lee & Lykke-Andersen, 2013). What is clear is that

CASC3-bound EJCs lose their higher-order structure and exist as monomeric complexes at the sites where EJC cores were co-transcriptionally deposited. Thus, the switch in EJC composition from RNPS1 and SR-rich complexes to CASC3-bound complexes changes

78 the higher-order EJC, and possibly, mRNP structure. The CASC3 bound form of the EJC is likely the main target of translation dependent disassembly, although RNPS1-EJC may also undergo similar disassembly.

2.5.2 CASC3-EJC and pre-translation mRNPs

Our findings support the emerging view that CASC3 is not an obligate component of all EJC cores. A population of assembled EJCs, especially those early in their lifetime, may completely lack CASC3 (Figures 2.1, 2.2, and 2.3). Such a view of partial CASC3 dispensability for EJC structure and function is in agreement with findings from

Drosophila, where the assembled trimeric EJC core as well as RNPS1 and its partner

ACIN1 are required for splicing of long or sub-optimal introns, whereas CASC3 is not

(Hayashi, Handler, Ish-Horowicz, & Brennecke, 2014; Malone, et al., 2014; Roignant &

Treisman, 2010). Recent findings regarding EJC core protein functions during mouse embryonic brain development also support non-overlapping functions of CASC3 and the other core factors. While haploinsufficiency of EIF4A3, RBM8A, and MAGOH lead to premature neuronal differentiation and apoptosis in the mutant brains (Mao, McMahon,

Tsai, Wang, & Silver, 2016; McMahon, Miller, & Silver, 2016), similar reduction in

CASC3 levels (or even its near complete depletion) does not cause the same defects but leads to a more general developmental delay (Mao, Brown, & Silver, 2017). We note that a view contrary to our findings is presented by the recently reported human spliceosome

C∗ structure, where CASC3 is seen bound to the trimeric EJC core (Zhang, et al., 2017).

As the spliceosomes described in these structural studies were assembled in vitro in nuclear extracts, it is possible that CASC3 present in extracts can enter pre-assembled

79 spliceosomes and interact with EJC. Consistently, in the human spliceosome C∗ structure, one of the two CASC3 binding surfaces on EIF4A3 is exposed and available for CASC3 interaction. Still, it is possible that, at least on some RNAs or exon junctions,

CASC3 assembly may occur soon after splicing within perispeckles (Daguenet, et al.,

2012).

What are the cellular functions of CASC3? Our data suggest that CASC3 is a prominent component of cytoplasmic EJCs within mRNPs that have not yet been translated or are undergoing their first round of translation. Previously described functions of CASC3 within translationally repressed neuronal transport granules (Macchi, et al., 2003) and posterior-pole localized oskar mRNPs in Drosophila oocytes (van

Eeden, Palacios, Petronczki, Weston, & St Johnston, 2001) further support CASC3 being a component of cytoplasmic pre-translation mRNPs. CASC3 plays an active role in oskar mRNA localization and translation repression (van Eeden, Palacios, Petronczki, Weston,

& St Johnston, 2001), and a similar function is presumed in neuronal transport granules.

These observations suggest that CASC3-EJCs may have a much more prominent role within longer-lived mRNPs that are transported to distant cytoplasmic locations. CASC3 is also known to activate translation of the bound RNA via EIF3 recruitment (Chazal, et al., 2013). Although such a function appears to contradict with its association with translationally repressed mRNPs as discussed above, it is possible that, once localized mRNPs are relieved of repressive activity, EJC and CASC3 can promote translation activation.

80

2.5.3 EJC composition and NMD

Based on the binary EJC composition switch, EJC-dependent steps in the NMD pathway can be divided into at least two phases wherein the two alternate factors may perform distinct functions. The susceptibility of all tested NMD targets to RNPS1 levels suggests that this protein, and perhaps other components of SR-rich EJCs, serve a critical function in an early phase in the pathway. Such a function could be to recruit and/or activate other EJC/UPF factors either in the nucleus or even during premature translation termination as part of the downstream EJC. In a later phase of the EJC life cycle following the compositional switch, the EJC core maintains the ability to activate NMD as it can directly communicate with the NMD machinery via UPF3B (Buchwald,

Schüssler, Basquin, Le Hir, & Conti, 2013). Still, following CASC3 incorporation into the EJC, its ability to stimulate NMD is reduced perhaps due to the loss of RNPS1 or SR proteins, which are known to enhance NMD (Aznarez, et al., 2018; Sato, Hosoda, &

Maquat, 2008; Viegas, Gehring, Breit, Hentze, & Kulozik, 2007; Zhang & Krainer,

2004). CASC3 overexpression can cause the compositional switch to occur at a faster rate or on a greater proportion of mRNAs, or both. Nevertheless, overexpressed NMD reporters (e.g., β39 mRNA) and some endogenous mRNAs depend on both early and late

EJC compositions for their NMD. Notably, the β-globin NMD reporter was previously shown to undergo biphasic decay with faster turnover around the nuclear periphery and slower decay in more distant cytoplasmic regions (Trcek, Sato, Singer, & Maquat, 2013).

More recently, single-molecule imaging of reporter RNPs showed that a fraction of their population diffuses for several minutes and micrometers away from the nucleus before

81 undergoing the first round of translation (Halstead, et al., 2015). It remains to be seen if mRNPs that are first translated in distant cytoplasmic locales, including those localized to specialized compartments such as neuronal dendrites and growth cones, may rely more on a CASC3-dependent slower phase of NMD. The EJC compositional switch may also underlie the distinct NMD branches identified earlier via tethering of RNPS1 and CASC3 to reporter mRNAs (Gehring, et al., 2005). Consistent with these observations, we find that RNPS1 co-purifies more strongly with UPF2, while CASC3 appears to interact more with UPF3B. The nature of the relationship between EJC composition and UPF2 and

UPF3B-independent NMD branches is an important avenue for future work.

82

Chapter 3

Comparison of High-Throughput Sequencing Methods for

Quantification of RNA:Protein Binding2

3.1 Abstract

In eukaryotic cells, proteins that associate with RNA regulate its activity to control cellular function. To fully illuminate the basis of RNA function, it is essential to identify such RNA associated proteins, their mode of action on RNA, and their preferred

RNA targets and binding sites. By analyzing catalogs of human RNA-associated proteins defined by ultraviolet light (UV)-dependent and independent approaches, we classify these proteins into two major groups: (1) the widely-recognized RNA binding proteins

(RBPs), which bind RNA directly and UV-crosslink efficiently to RNA, and (2) a new group of RBP-associated factors (RAFs), which bind RNA indirectly via RBPs and UV- crosslink poorly to RNA. As the UV cross-linking and immunoprecipitation followed by sequencing (CLIP-Seq) approach will be unsuitable to identify binding sites of RAFs, we show that formaldehyde crosslinking stabilizes RAFs within ribonucleoproteins to allow for their immunoprecipitation under stringent conditions. Using an RBP (CASC3) and an

RAF (RNPS1) within the exon junction complex (EJC) as examples, we show that

2 This chapter is based on a published article (Patton, et al., 2020). Wet lab work was conducted by the co- authors cited in the figures and text, while I conducted all bioinformatics and computational analysis unless otherwise noted. 83 formaldehyde crosslinking combined with RNA immunoprecipitation in tandem followed by sequencing (xRIPiT-Seq) far exceeds CLIP-Seq to identify binding sites of RNPS1. xRIPiT-Seq reveals that RNPS1 occupancy is increased on exons immediately upstream of strong recursively spliced exons, which depend on EJC for their inclusion.

3.2 Introduction

As cells grow, divide, and respond to their environment, they critically depend on

RNA-associated proteins to regulate RNA biogenesis and function. RNA-associated proteins participate in all aspects of RNA biology – they assemble RNA into ribonucleoprotein (RNP) machines (e.g. spliceosome, ribosome), membraneless RNP organelles (e.g. nuclear speckles, stress granules), or gene/ regulatory complexes (e.g. Polycomb repressive complex 2). In the case of messenger RNA

(mRNA), RNA-associated proteins control its processing, sub-cellular location, intracellular transport, translation into proteins, and its eventual degradation (Müller-

McNicoll & Neugebauer, 2013; Singh, Pratt, Yeo, & Moore, 2015) Therefore, to fully comprehend the intricate workings of cellular processes, it is important to identify RNA- associated proteins and elucidate their functions.

The post-genomic era has witnessed a revolution in our ability to catalog RNA- associated proteins encoded in the human and other genomes. Early efforts to build catalogs of RNA-associated proteins mainly relied on sequence similarity to well-known

RNA binding domains (Anantharaman, Koonin, & Aravind, 2002) More recent efforts have taken advantage of the ability of RNA and proteins in direct physical contact to form “zero-length” covalent bonds when exposed to shortwave ultraviolet (UV)-light

84

(Hockensmith, Kubasek, Vorachek, & von Hippel, 1993). Such covalent crosslinking

“freezes” dynamic intermolecular RNA:protein interactions as they occur in situ to enable their biochemical analysis. This property of direct RNA:protein contacts has been exploited to identify the protein interactome of polyadenylated RNA (Baltz, et al., 2012;

Hentze, Castello, Schwarzl, & Preiss, 2018), and more recently to unveil the proteins that come in contact with all cellular RNA (Queiroz, et al., 2019; Trendel, et al., 2019;

Urdaneta, et al., 2019). These studies conservatively estimate that the human genome encodes more than 1200 proteins that directly contact RNA, and hence function as RNA binding proteins (RBPs). The UV reactivity of the RNA:protein contacts has also transformed our understanding of global-scale RNA cargoes of individual RBPs. UV- crosslinking-immunoprecipitation (CLIP) based methods allow purification of an RBP of interest and its crosslinked RNAs, which can then be identified via high-throughput sequencing (Lee & Ule, 2018). Advantageously, the protein adducts on crosslinked RNA can be leveraged to map RNA positions directly in contact with RBPs to obtain a single- nucleotide resolution view of in vivo RNA:protein interactions.

While UV-crosslinking based approaches have illuminated many areas of RNA biology, like any other method, they too come with limitations. For example, the UV- crosslinking ability of an RBP is likely to be influenced by sequence composition of

RNA and protein at binding interfaces (Hockensmith, Kubasek, Vorachek, & von Hippel,

1986; Smith, 1969) and also by the strength and duration of interactions. Importantly,

UV-crosslinking is limited in applicability to proteins that are in direct contact with RNA

(i.e. RBPs) and will be ill-suited to study proteins that interact with RNA indirectly via

85

RBPs (RBP-Associated Factors, RAFs). Although we currently lack full understanding of the prevalence of RAFs, a closer examination of cellular RNPs reveals many proteins that play a critical role in RNA biology without directly contacting RNA (e.g. nuclear export factors GLE1 and NXT1; EIF4E binding proteins). RAFs can function along with RBPs either as their regulators or as sub-units of multi-protein complexes that act on RNA, or both. It is therefore important to gain insights into prevalence, properties, and functions of RAFs. Thus, UV-crosslinking independent methods are critical to investigate interactions of RAFs with RNA inside the cells.

Our perspective on RBPs and RAFs is shaped by our investigation of the exon junction complex (EJC), a multi-protein complex that mainly assembles ~24 nt upstream of exon-exon junctions during pre-mRNA splicing (Boehm & Gehring, 2016; Le Hir,

Saulière, & Wang, 2016; Woodward, Mabin, Gangras, & Singh, 2017). The EJC contains both RBPs (EIF4A3) and RAFs (MAGOH, RBM8A) within its core. The EJC core also interacts with several peripheral proteins that link it to various steps in post- transcriptional mRNA regulation. In Chapter 2 we showed that two peripheral proteins

RNPS1 and CASC3 interact with the EJC in a mutually exclusive and sequential manner

(Mabin, et al., 2018). To identify RNAs bound to EJC inside human cells, Singh et al. have devised a UV-crosslinking independent tandem purification approach termed RNA

IP in tandem (RIPiT) (Singh, et al., 2012; Singh, Ricci, & Moore, 2014). To investigate binding sites of more transient and/or labile complexes, such as the EJCs containing alternate factor RNPS1, RIPiT has also been combined with formaldehyde based chemical crosslinking (Singh, Ricci, & Moore, 2014). RIPiT-Seq has also been applied

86 with and without formaldehyde crosslinking to investigate binding profiles of RNA- associated proteins beyond the EJC, i.e. Staufen1 (Ricci, et al., 2014) and WDR5 (Yang, et al., 2014).

Formaldehyde is a small, cell permeable, rapid and reversible crosslinker that can covalently link proteins to nucleic acids and other proteins when they exist in close proximity (Hoffman, Frey, Smith, & Auble, 2015). Thus, formaldehyde is an attractive alternative to UV to crosslink both RBPs and RAFs within cellular RNPs. The utility of formaldehyde crosslinking prior to RIP to enrich RBP-bound RNAs was first shown nearly two decades ago (Niranjanakumari, Lasda, Brazas, & Garcia-Blanco, 2002). Ever since, formaldehyde crosslinking has been utilized to capture RNA cargoes of RNA- associated proteins from diverse eukaryotic systems (e.g. (Chatterjee, et al., 2017; G

Hendrickson, Kelley, Tenen, Bernstein, & Rinn, 2016; Huang & Hopper, 2015)). More recently, formaldehyde crosslinking has been combined with a CLIP-Seq workflow to identify binding sites of DROSHA, a double-stranded RBP that poorly UV crosslinks with RNA (Kim & Kim, 2019). Despite such general use, several fundamental issues regarding formaldehyde crosslinking remain to be tested: its degree of influence on specificity of RIP signal, its performance in comparison to UV crosslinking, and its applicability to RBPs versus RAFs, to name a few.

Here we describe a comparative analysis of published catalogs of human RNA- associated proteins that were defined based on UV-crosslinking ability of proteins to

RNA or protein:protein interaction networks of annotated RBPs (a UV-crosslinking- independent approach). This analysis enables us to categorize human RNA-associated

87 proteins into RBPs and RAFs. We find that RAFs are prevalent in all steps of RNA metabolism and display a wide array of molecular functions and biochemical activities.

This analysis also confirms the classification of EJC factors into RBPs (EIF4A3, CASC3) and RAFs (MAGOH, RNPS1). To investigate binding sites of EJC proteins, we have previously employed RIPiT-Seq either from uncrosslinked human cells (nRIPiT, (Mabin, et al., 2018; Singh, et al., 2012)) or from formaldehyde-crosslinked cells (xRIPiT-Seq,

(Mabin, et al., 2018)). Other groups have instead employed CLIP-Seq to map binding sites of RBPs and RAFs within EJCs (Hauer, et al., 2016; Saulière, et al., 2012). Here we use these existing nRIPiT-Seq, xRIPiT-Seq and CLIP-Seq datasets for CASC3 and

RNPS1 as a case study to systematically evaluate and compare the efficacy of UV and formaldehyde crosslinking methods for enriching RBP/RAF binding sites. We show that formaldehyde crosslinking significantly improves RIPiT-Seq signal for both CASC3 and

RNPS1. Further, xRIPiT-Seq is comparable to CLIP-Seq for identifying CASC3 binding sites but is far superior to CLIP-Seq for finding RNPS1 occupancy sites. Finally, RNPS1 binding to RNA measured via xRIPiT-Seq led us to uncover increased RNPS1 occupancy on exons preceding recursively spliced exons, which were previously shown to depend on the EJC and RNPS1 for their inclusion in mRNA.

3.3 Materials and Methods

3.3.1 Source data for classification of RBPs and RAFs

Sets of UV-crosslinkable human RBPs were from two sources: (1) Proteins detected in RNA interactome capture (RIC) experiments were obtained from (Hentze,

Castello, Schwarzl, & Preiss, 2018). For inclusion in our analysis, we required the

88 proteins to be detectable in at least 2 of the 7 RIC datasets. (2) Proteins that were defined as the “integrated human RBPome” or “ihRBPome” based on protein-crosslinked RNA extraction (XRNAX) were obtained from (Trendel, et al., 2019). For a comprehensive set of UV-crosslinking independent human RBPs, we relied on the analysis of (Brannan, et al., 2016). A list of 1786 annotated RBPs was combined with 1923 proteins that achieved the RBP classification score of greater than 0.79 as predicted by SONAR, resulting in a set of 2784 unique proteins that we defined as “annotated and predicted RNA-associated proteins.”

3.3.2 Gene ontology analysis

DAVID gene ontology tool (Huang, Sherman, & Lempicki, 2009) was used to determine terms enriched in RAFs with all human genes as a background. Only non- redundant terms with lowest p-value (with Benjamini-Hochberg correction) are reported.

3.3.3 High-Throughput DNA Sequencing

All RNA-Seq and RIPiT-Seq datasets were previously described in Chapter 2

(Mabin, et al., 2018)(GEO: GSE115977), and all data processing was carried out as described therein. CLIP-Seq data from (Hauer, et al., 2016) downloaded from

ArrayExpress: E-MTAB-4215) was similarly processed using the same pipeline with the exception of adaptor trimming and PCR duplicate removal which was previously completed by Hauer et al.

89

3.3.4 Human reference transcriptome and annotations

Reference annotations, e.g. for protein coding transcripts, were retrieved from the

Ensembl BioMart (GRCh38.p12). All analyses were done using the APPRIS Principal 1 transcript variant for each gene (Rodriguez, et al., 2013).

3.3.5 Read distribution assignment

Fractions of reads corresponding to exonic and canonical EJC and non-canonical

EJC regions were computed as follows. All analyses were limited to exons 100 nt or longer in order to have sufficiently long non-canonical EJC regions. Furthermore, for canonical and non-canonical EJC annotations we excluded the last exon, which lacks a canonical EJC deposition site. For the canonical EJC regions the 30 nt surrounding the center of the primary EJC binding site at -24 (-39 to -9) were considered, while for the non-canonical EJC region all nucleotides from the start of each exon to -50 were used.

The single exon gene annotation was similarly limited to exons (genes) at least 100 nt in length. Bedtools function “intersect” was used to compare reads against these annotations and reads which overlapped the annotation by more than 50% were counted.

3.3.6 Read density calculation

After read assignments, reads from replicates of each experiment were combined to compute read densities (RPKMRIPiT-Seq or RPKMIP-Seq) per protein-coding gene as follows. For each calculation, the total aligned reads were used as the scaling factor and the length (in kb) of the canonical or non-canonical regions of a gene, or length of the intronless genes were used for the base adjustment.

90

3.3.7 Quantification of EJC binding in canonical and non-canonical regions, and on intronless genes

Only protein-coding genes with RNA-Seq RPKM greater than zero were included in the analysis. For comparing EJC protein binding at different expression levels, the full range of RNA-Seq gene log2 RPKM values was covered by 20 reference points equidistant in logarithmic space. Each point was associated with a bin containing all genes within 2-fold of the central RPKM value. This was done to ensure the same number of bins, with the same expression range (in terms of fold change) between experiments - bins may therefore overlap to different degrees depending on the comparison. Log2 of mean experimental RPKM values in each bin (RPKMRIPiT-Seq or

RPKMIP-Seq), which provide estimates of EJC protein binding on all canonical or non- canonical regions within a gene, or on intronless genes, were plotted against RNA-Seq

RPKM values. In tests comparing RIPiT-Seq and CLIP-Seq, analyses were limited to the set of genes with RPKM values within 1.5-fold in RNA-Seq libraries from each cell line.

The average between these two libraries was then used as an estimate of gene expression

(RPKMRNA-Seq) to compare xRIPiT and CLIP-Seq datasets. Linear fits were then produced for each experiment in log2-space with a set slope of 1. That is, a linear agreement between experimental and RNA-Seq RPKMs was assumed. The calculated intercepts therefore correspond to the exponent of the slope in linear space: log2(F(x)) =

b log2(x) + b in log2-space becomes F(x) = x*2 in linear space. In the bar plots that compare the intercepts calculated from linear fits to the scatter plots, the Y axis is 2 to the

91 power of each intercept, which is exactly the slope in linear space and therefore the total fold-change of each experiment versus RNA-Seq.

3.3.8 Quantification of individual EJC binding sites

For analysis of individual binding sites, we first limited our search to genes in the top 20% expression range in RNA-Seq, corresponding to an RPKM > 5.9, and normalized canonical region counts by intronless gene levels as detailed above.

Canonical regions between experiment types were then compared to observe overall trends. Individual canonical sites were considered to be observed binding sites if the expression in those regions was 2-fold higher than then extrapolated intronless gene expression on the same genes. Venn diagrams showing the overlap between individual binding sites were then constructed to show overlap between experimental methods and similarly all binding sites were plotted in deciles as a function of their RPKM range to show discovered binding sites by experiment, based on gene expression levels.

3.3.9 Quantification of EJC binding on Recursively Spliced (RS) exons and their neighboring exons

MaxEntScan splice site scoring software (Yeo & Burge, 2004) was first used to find 5' splice site (5'ss) scores of all exon-exon junctions using sequences obtained from

Ensembl BioMart (GRCh38.p12). For all exon-exon junctions with the 5'ss score above the threshold of 5.52, which was defined previously by (Blazquez, et al., 2018), the downstream exons were classified as RS exons. The same number of exons that showed the lowest 5'ss score at their start were used as a non-RS exon control. CASC3 or RNPS1 occupancy at each exon was estimated by determining total exon coverage (CLIP or

92

RIPiT RPKM for a particular exon) normalized to gene coverage, with gene coverage as the average of the two RNA-Seq data sets within 1.5-fold of each other, as detailed above. Only exons with an RPKM > 0 were considered in the analysis. Such normalized exon coverage was determined for the RS exon itself, as well as the three preceding exons. For gene level analysis genes were classified by the highest exon-exon 5'ss score within a given gene, with RS genes defined by the same 5.52 threshold. Similarly, an equal number of non-RS genes were selected to have the lowest maximum 5'ss score within a gene.

3.4 Results

3.4.1 Proteins that poorly UV crosslink to RNA are widespread in RNA metabolism

Within the EJC core, RBM8A and MAGOH bind RNA indirectly via EIF4A3 and do not UV crosslink to RNA, likely due to their lack of direct contact with RNA

(Andersen, et al., 2006; Bono, Ebert, Lorentzen, & and Conti, 2006). We hypothesized that many more RAFs like RBM8A and MAGOH must exist that interact indirectly with

RNA via RBPs to control RNA function. We therefore sought to systematically categorize RNA-associated proteins encoded in the human genome into RBPs and RAFs.

Such a classification can be made by comparing a set of proteins that efficiently UV crosslink to RNA to a set where RNA-associated proteins are defined without a requirement for UV crosslinking. For a set of proteins that efficiently UV-crosslink to

RNA, we identified 830 proteins that are present in at least two of the seven poly(A)

RNA interactome capture (RIC) datasets compiled by (Hentze, Castello, Schwarzl, &

Preiss, 2018). For a set of proteins defined as RNA-associated proteins independent of

93 their UV-crosslinking ability, we combined two datasets reported by (Brannan, et al.,

2016). The first subset comprises 1786 annotated human RBPs that were used as training set for predicting RBPs via SONAR, a computational approach that analyzes large-scale affinity purification-mass spectrometry protein-protein interactomes of annotated RBPs to predict RNA binding activity. The second subset consisted of 1923 proteins that were predicted as RBPs by SONAR based on RBP classification score >0.79. These two subsets were combined to obtain a set of 2784 unique proteins that we refer to as the human “annotated and predicted RNA-associated proteins.” The UV-crosslinkable poly(A) RIC RBP set shows almost complete (~94 %) overlap with the annotated and predicted RNA-associated proteins (Figure 3.1A). However, only ~30 % of proteins among the annotated and predicted RNA-associated proteins represent the poly(A) UV- crosslinkable proteins. The remaining two-thirds of the RNA-associated proteins likely include proteins that do not efficiently UV-crosslink to RNA. Consistently, RBM8A and

MAGOH are among this group. Further, the alternate EJC factor RNPS1 is also among the non-UV-crosslinkable RBPs. In contrast, EIF4A3, the RNA anchor of the EJC, and

CASC3, the alternate EJC factor that directly contacts RNA, are detected in three out of seven RIC datasets. These observations validate our classification approach.

94

Figure 3.1 Overlap of Predicted RBPs and UV-Crosslinkable RBPs

A. Venn-diagram showing the overlap between RBPs defined based on UV- crosslinking-dependent RNA interactome capture (RIC, from (Hentze, Castello, Schwarzl, & Preiss, 2018)) and those defined by protein:protein interaction network of annotated RBPs (Annotated and predicted RNA-associated proteins, from (Brannan, et al., 2016)).

B. Venn-diagram as in A showing overlap between integrated human RBPome defined based on UV-crosslinkability (ihRBPome, from (Trendel, et al., 2019)) and Annotated and predicted RNA-associated proteins (red, from (Brannan, et al., 2016)) based on protein:protein interaction network of annotated RBPs.

Analysis by Guramrit Singh

95

Among the annotated and predicted RNA-associated proteins that do not overlap with RIC RBPs, it is likely that a subset may interact with non-poly(A) RNA and are thus not represented in the RIC dataset. To test this idea, we compared the human annotated and predicted RNA-associated proteins to the integrated human RBPome (ihRBPome) defined by (Trendel, et al., 2019) based on proteins that UV-crosslink to all cellular RNA.

In this comparison, indeed the number of UV-crosslinkable RBPs from the annotated and predicted RNA-associated proteins goes up to ~40 % (Figure 3.1B). Still, a large group of the annotated and predicted RNA-associated proteins remain undetected among the UV- crosslinkable proteins, which are likely to contain RAFs. MAGOH and RNPS1 are again within this RAF category. Unexpectedly, RBM8A is detected in the ihRBPome along with EIF4A3 and CASC3 suggesting that it can bind RNA directly, perhaps in an EJC- independent fashion. Overall, this analysis suggests that a sizable fraction of the RNA- associated proteins encoded in the human genome are RAFs. Notably, the ihRBPome contains 638 proteins that are absent from the annotated and predicted RNA-associated proteins set (Figure 3.1B), suggesting that no one approach is sufficient to define all

RNA-associated proteins encoded in a genome.

The comparison of the annotated and predicted RNA-associated proteins to the ihRBPome provides a refined list of RAFs. A search for functionally related groups of genes among the RAFs revealed that these proteins function in diverse biological processes involving coding as well as non-coding RNAs (Figure 3.2A). They are constituents of discrete RNP complexes (e.g. the spliceosome) and of the various nuclear and cytoplasmic phase-separated membraneless RNP organelles (e.g. nuclear speckles,

96

Cajal body, chromatoid body, processing bodies) (Figure 3.2B). RAFs within these processes and cellular compartments function as RNA modifying and degrading enzymes

(Figure 3.2C). In the case of mRNA metabolism, RAFs are key factors within all major nuclear mRNA processing steps (capping, pre-mRNA splicing, cleavage and polyadenylation), mRNA export into the cytoplasm as well as translation and mRNA degradation in the cytoplasm (some examples are listed in Figure 3.3). Among other notable RAFs are three of the four protein subunits of the polycomb repressive complex

(EED, EZH1 and EZH2) whose chromatin modification function is guided by binding to several long non-coding RNAs (Margueron & Reinberg, 2011). Conspicuously underrepresented are ribosomal proteins suggesting that most protein subunits of this

RNP machine are RBPs. These observations indicate that the RAFs function within all major steps of RNA metabolism.

97

Figure 3.2 RAF Functions

(A) Top sixteen gene ontology terms (biological process) enriched among RAFs. Legend on the right indicates the number of genes in each term. Some redundant GO terms were removed.

(B) GO terms (cellular compartment) enriched among RAFs as in (A).

(C) GO terms (molecular function) enriched among RAFs as in (A).

Analysis by Guramrit Singh

98

Figure 3.3 RAFs Involved in the mRNA Lifecycle

Major processes in the life cycle of eukaryotic messenger RNAs (left) along with some examples of RAFs involved in each of the steps. Also shown are some examples of RAFs that are components of key membraneless RNP compartments or large RNP complexes (right).

Analysis by Guramrit Singh 99

3.4.2 Chemical crosslinking stabilizes RBPs and RAFs within RNPs

The above analysis suggests that CLIP-Seq based approaches are ill-suited to identify RNA cargoes and binding sites of RAFs. Instead, UV-crosslinking-independent

RNA immunoprecipitation (RIP) based approaches such as digestion-optimized RIP

(DO-RIP-Seq (Nicholson, Friedersdorf, & Keene, 2017)) or RIPiT-Seq (Singh, Ricci, &

Moore, 2014) may be more suitable for this purpose. However, these RIP-based strategies lack the covalent protein attachment to RNA, which is the main advantage of

CLIP. The absence of direct attachment between protein and RNA leads to two challenges in applying RIP to RAFs. First, due to the dynamic nature of molecular assemblies, RAFs may dissociate from their RNA:protein complexes and/or reassort during purification under native conditions (Mili & Steitz, 2004). Second, native conditions during RIP, as compared to stringent conditions used for CLIP, can lead to co- purification of non-specific interactors. The former issue is highlighted in our attempts to increase EJC purification stringency during immunoprecipitation (IP) of stably expressed

FLAG-tagged MAGOH from human embryonic kidney (HEK293) cells. As seen in

Figure 3.4A, RNPS1 (and its associated factors) dissociate from the EJC core as the ionic strength of the IP reaction is increased (compare lanes 4-6 to lane 3). Thus, to use RIP to faithfully capture binding sites of RAFs such as RNPS1 with high specificity, it will be important to stabilize their in vivo interactions prior to cell lysis. To achieve this, Singh et al. have systematically evaluated chemical crosslinking of RNPs using formaldehyde

(Singh, Ricci, & Moore, 2014).

100

Figure 3.4 Formaldehyde crosslinking stabilizes an RAF-EJC interaction.

(A) Western blots showing proteins on the right in the total extract (TE) or FLAG immunoprecipitation (IP) fractions of HEK293 Flp-In cells expressing FLAG-MAGOH. The presence or absence of RNase A and the amount of NaCl present in the extracts during the IP are indicated above each lane. The dotted line indicates where gel images were spliced together. Data are representative of two biological replicates.

(B) Western blots showing proteins (labeled on the right) in the insoluble pellet, soluble extract, and anti-FLAG immunoprecipitate fractions (indicated on the top) of HEK293 Flp-In cells. The lysis and IP conditions are also indicated on the top. Also indicated above each lane is the expression of FLAG-RNPS1 (+) or FLAG epitope only (-) in the cells, and the formaldehyde concentration used for in vivo crosslinking. Data are representative of three biological replicates.

Analysis Manu Sanjeev

A wide range of formaldehyde concentration (0.1% to 3%) has been used in previous studies to crosslink macromolecular complexes (Chu, et al., 2015; Fabre, et al.,

2013; Ricci, et al., 2014). Excessive formaldehyde treatment can crosslink non-specific interactions and can also crosslink complexes to cellular structures thereby rendering them insoluble. Singh et al. found that a 10-minute treatment of HEK293 cells with 1% or more formaldehyde followed by sonication-mediated cell disruption under stringent lysis conditions (in the presence of 0.1 % sodium dodecyl sulfate and 0.1 % sodium 101 deoxycholate) leads to extremely poor solubility of EJC proteins (data not shown).

Therefore, three formaldehyde concentrations were evaluated below this threshold

(0.03%, 0.1% and 0.3%) to identify an optimal balance between RNP crosslinking and solubility. Singh et al. found that crosslinking cells with even 0.3% formaldehyde traps a significant fraction of the tested proteins in the insoluble fraction, which pellets along with cellular debris at 15,000 x g (Figure 3.4B, lane 5). In comparison, cells treated with

0.1% formaldehyde show much better protein solubility (Figure 3.4B, lane 4), which is comparable to protein solubility under native lysis conditions (Figure 3.4B, lane 6). Singh et al. also find that crosslinking cells with 0.1% formaldehyde sufficiently stabilizes interaction of FLAG-RNPS1 with the EJC core such that EIF4A3 and MAGOH co-IP with FLAG-RNPS1 even under stringent conditions (Figure 3.4B, compare lanes 14 and

16). Thus, 0.1% formaldehyde sufficiently crosslinks FLAG-RNPS1 containing RNPs while maintaining their solubility.

3.4.3 Formaldehyde crosslinking enhances RIPIT-Seq signal over background

While formaldehyde has long been utilized for RNP crosslinking during RIP, the quantitative influence of formaldehyde crosslinking on the efficiency and specificity of

RIP signal remains yet to be evaluated. We therefore decided to test the effect of formaldehyde crosslinking on enrichment of RBP and RAF binding sites via RIPiT-Seq.

This two-step purification approach is an extension of RIP and involves sequential IP of a pair of proteins to enrich an RNP of a particular composition, whose RNA footprints can then be identified by high-throughput sequencing. Previously (see Chapter 2), we used

RIPiT-Seq to obtain footprints of EJCs containing CASC3 and RNPS1 to describe the

102 mutually exclusive nature of the two complexes (Mabin, et al., 2018). These CASC3 and

RNPS1 RIPiTs were performed from formaldehyde-crosslinked HEK293 cells under stringent conditions (xRIPiT-Seq) as well as from non-crosslinked cells under native conditions (nRIPiT-Seq), although in the previous work we exclusively focused on the xRIPiT-Seq data to study the two mutually exclusive complexes. We re-mapped these datasets to the human genome (see 3.3 Materials and Methods) to compare nRIPiT and xRIPiT side-by-side to assess the effect of formaldehyde crosslinking on enrichment of

CASC3 and RNPS1 footprints. As previously reported, RIPiT-Seq yields highly reproducible EJC binding across protein coding genes (Figure 3.5). Therefore, for all further analyses, we combined the biological replicates (where available, i.e. all xRIPiT-

Seq experiments) for each unique RIPiT-Seq experiment.

103

Figure 3.5 Replicate Comparisons

(A) Scatterplots showing comparison between CASC3 nRIPiT and xRIPiT replicates (leftmost plot), and between replicates of CASC3 xRIPiTs (other three plots). The specific RIPiT being compared is indicated on top of each plot. For comparisons between xRIPiTs from -CHX and +CHX conditions, individual replicates for each RIPiT were combined. Correlation coefficients are in the top left corner.

(B) CASC3 CLIP replicate comparisons. Correlation coefficients are in the top left corner.

(C) Scatterplots as in (A) showing comparison between RNPS1 nRIPiT and xRIPiT, or between replicates of RNPS1 xRIPiTs.

(D) RNPS1 CLIP replicate comparisons as in (B).

First, we compared the ability of nRIPiT and xRIPiT to reveal the major binding site of CASC3, which was purified by sequential IP of FLAG-CASC3 and EIF4A3 under

104 the two conditions. A meta-exon plot of CASC3 footprint densities at exon ends show that while both approaches yield highest read densities ~24 nt upstream from exon junction, xRIPiT shows a much more striking enrichment of reads at this position (Figure

3.6A). Next, we quantified the ability of the two RIPiT formats to obtain CASC3-EJC footprints on two different “regions” of spliced protein-coding transcripts: canonical EJC sites (cEJC; -39 to -9 nt from exon ends) and non-canonical regions (ncEJC; exon start to

-50 nt from exon ends). To estimate background binding of the EJC proteins to RNA, we also quantified CASC3 footprints on intronless protein-coding transcripts as the EJC is not expected to bind to intronless transcripts. To compare CASC3 footprint densities

(RIPiT-Seq reads per kilobase per million or RPKMRIPiT-Seq) across the entire gene expression range, we divided spliced transcripts or intronless transcripts into twenty bins such that each bin contained transcripts that fall within two-fold expression levels based on RNA-Seq. Figure 3.6B shows that CASC3 footprint detection by either RIPiT-Seq format is directly proportional to RNA expression levels. Further, as expected, at all expression levels, each RIPiT format shows highest signal on canonical sites, followed by non-canonical regions and lowest signal on intronless transcripts. Importantly, across the entire gene expression range, xRIPiT shows higher signal as compared to nRIPiT at canonical sites except the last expression bin where detection by the two RIPiT formats is comparable. In the case of non-canonical regions, differences in CASC3 detection via xRIPiT and nRIPiT are much smaller, and insignificant in the higher expression bins. We conclude that formaldehyde crosslinking boosts CASC3 footprint enrichment via RIPiT, possibly by freezing dynamic RNPs and preventing their dissociation after lysis.

105

Figure 3.6 Formaldehyde Crosslinking Enhances RIPiT-Seq Signal for CASC3 (1)

(A) A meta-exon plot showing nRIPiT and xRIPiT read densities in the 150 nt window from the end of exons of protein-coding genes (excluding final exons).

(B) Comparison of gene-level CASC3 read density (RPKMRIPiT-Seq) in native RIPiT (nRIPiT, squares) and formaldehyde-crosslinked RIPiT (xRIPiT; circles) for canonical (darker-shaded shapes) and non-canonical regions (lighter-shaded shapes), and for intronless genes (empty shapes). Along the x-axis, genes are binned into twenty bins where each bin contains exons from genes within a 2-fold expression level range based on RNA-Seq. Error bars represent the standard error of the mean signal in each bin. 106

To compare CASC3 binding site enrichment via native and crosslinked RIPiTs, we obtained a single summary statistic that represents overall CASC3 binding in each of the canonical regions, non-canonical regions and intronless transcripts. Assuming a direct relationship between RIPiT enrichment (RIPiT-Seq) and RNA expression level (RNA-

Seq), we fitted a linear trendline with slope equal to one to each binned-data distribution in Figure 3.6B (as demonstrated in Figure 3.7). The y-intercept of each fit in the log space corresponds to its slope in the untransformed space. That is, the y-intercept of each trendline reflects in a relative sense how much a given RNA region is detected in CASC3 footprints relative to RNA-Seq. To normalize the signal within native and crosslinked

RIPiTs to CASC3 binding to intronless transcripts, the average signal for intronless transcripts in the two RIPiT conditions was set to one, and all intercepts were shifted accordingly. The fit coefficients thus computed when compared within the nRIPiT-Seq dataset show that, as compared to intronless transcripts, CASC3 footprints are enriched

~5-fold in canonical and ~2.5-fold in non-canonical regions. CASC3 enrichment is much more pronounced within xRIPiT-Seq datasets, which shows 14-fold and ~3.6-fold enrichment of CASC3 footprints in canonical and non-canonical regions, respectively

(Figure 3.8A). Thus, in comparison to uncrosslinked nRIPiT, formaldehyde crosslinking during xRIPiT leads to a nearly three-fold enhancement of CASC3 footprint signal at cEJC sites (Figure 3.8A). This overall increase in CASC3 enrichment via xRIPiT over nRIPiT is likely due to greater detection at individual binding sites. To examine this idea, we limited our analysis to highly expressed genes (top 20% expression corresponding to

>5.9 RPKM). We find that a greater fraction of individual canonical regions (Figure

107

3.9A) as well as genes (Figure 3.9B) show higher CASC3-EJC read densities in xRIPiT as compared to nRIPiT, as shown by the rightward shift of the scatter plots.

Figure 3.7 Linear Fits and their Intercepts

Linear fit lines for each of the six classes where CASC3 binding is estimated (Figure 3.6). Fit lines for xRIPiT-Seq (dotted lines) and nRIPiT-Seq (solid lines) are color-coded to match the corresponding class.

108

Figure 3.8 Formaldehyde Crosslinking Enhances RIPiT-Seq Signal for CASC3 (2)

(A) A comparison of linear fit coefficients (or intercepts, in log space) of the six classes in Figure 3.6B. Classes are labeled on the bottom. The coefficient for the average of the two intronless classes was set to 1 and all intercepts were adjusted accordingly. The fold-change as compared to the average of the two intronless classes is shown above each bar.

(B) Percentage of all canonical EJC regions where read count is ≥2-fold as compared to read counts on intronless genes of similar expression level.

109

Figure 3.9 CASC3-EJC Densities in Highly Expressed Genes – nRIPiT versus xRIPiT

(A) Scatterplot of CACS3 nRIPiT versus xRIPiT read densities (RPKM) at individual canonical sites normalized to intronless genes. Heatmap colors indicate plot density (red = most dense, blue = least dense). The diagonal represents canonical sites where nRIPiT and xRIPiT yield equal RPKM counts.

(B) Scatterplot as in (A) comparing gene level normalized RPKM (sum of RPKM for all canonical sites) between CASC3 nRIPiT and xRIPiT.

We predict that enhanced CASC3 RNA binding site enrichment observed in xRIPiT over nRIPiT also leads to significant enrichment of CASC3 footprints on a greater number of cEJC sites. To test this idea, we identified the canonical regions on which CASC3 footprint signal is ≥2-fold as compared to the signal on intronless transcripts at similar expression level. As seen in Figure 3.8B, we observe an increase in significantly enriched cEJC sites from 38 % in nRIPiT to 82 % in xRIPiT. The increase in cEJC sites significantly bound by CASC3 in xRIPiT over nRIPiT is observed over almost the entire expression range sampled (Figure 3.10), suggesting that xRIPiT enhances CASC3 detection irrespective of gene expression level. Therefore, we conclude

110 that formaldehyde crosslinking of cells before performing RIPiT-Seq enhances the capture of in situ CASC3-EJCs to yield higher signal over background at canonical and non-canonical sites. Such an effect consequently also leads to more robust detection of

CASC3 at a greater proportion of the expected binding sites.

Figure 3.10 CASC3-EJC Canonical Site Detection by Expression

Bar-plots showing percent of canonical regions where CASC3 RIPiT or CLIP read counts are ≥2-fold over counts in intronless genes in the same expression range bin (921-922 canonical sites/bin).

In Chapter 2 we showed that translation inhibition impacts CASC3-EJC occupancy (Mabin, et al., 2018). We argued that translation inhibition, when combined with formaldehyde crosslinking, will lead to further preservation of CASC3-containing

EJCs on their in vivo binding sites. To test this prediction, we compared CASC3 enrichment via xRIPiT-Seq with and without pre-treatment with translation elongation inhibitor, cycloheximide (CHX). As expected, CHX treatment leads to ~2.7-fold increase in CASC3 detection on cEJC sites (Figure 3.11A, B). This CHX-dependent enhanced

111

CASC3 enrichment leads to detection of significant CASC3 footprints at 90 % of all possible cEJC binding sites as compared to detection on 82 % sites under normal translation conditions (Figure 3.11C). These data show that translation impacts CASC3-

EJC occupancy, and further highlights the quantitative ability of our approach to compare protein binding site enrichment

112

Figure 3.11 CHX Treatment Enhances CASC3 Signal

(A) Comparison of gene- level CASC3 read density (RPKMRIPiT-Seq) in cycloheximide (CHX) treated xRIPiT (+CHX, diamonds) and untreated xRIPiT (circles) for canonical (darker-shaded shapes) and non- canonical regions (lighter-shaded shapes), and for intronless genes (empty shapes).

(B) Comparison of the linear fit coefficients (or intercepts, in log space) of the six classes in (A). Classes are labeled on the bottom.

(C) Percent of all canonical EJC regions where read depth is ≥2- fold as compared to intronless gene read counts in the indicated datasets.

113

3.4.4 xRIPIT-Seq is comparable to CLIP-Seq in revealing binding sites of CASC3

CASC3 is an RBP that can be photo-crosslinked to RNA. To compare the ability of chemical and photo-crosslinking methods to enrich CASC3 binding sites, we compared CASC3 xRIPiT-Seq with the CASC3 CLIP-Seq dataset of (Hauer, et al.,

2016). This CASC3 binding profile was obtained using the individual nucleotide resolution CLIP-Seq variation (Huppertz, et al., 2014), which will be referred to simply as CLIP-Seq in this work. The raw CLIP-Seq and the corresponding RNA-Seq datasets were aligned to the human reference genome using the same parameters as the xRIPiT-

Seq datasets. The CLIP-Seq data are from HeLa cells and xRIPiT-Seq from HEK293 cells. To minimize the effect of gene expression differences in the two cell lines on our analysis, we limited the analysis to only the subset of protein-coding genes whose expression levels are within 1.5-fold (based on RNA-Seq RPKM) in the two cell lines

(Figure 3.12). Importantly, both datasets compared were obtained from cycloheximide treated cells. Similar to the findings in Figures 3.6-3.8, both CLIP-Seq and xRIPiT-Seq for CASC3 strongly enrich canonical EJC binding sites (Figure 3.13A). Further, across the gene expression bins, both approaches detect the highest CASC3 binding at canonical sites followed by non-canonical sites and then by intronless transcripts (Figure 3.13B).

Some deviation from this trend is observed in the four highest expression bins where both

CLIP and xRIPiT signal is more variable in the three regions. Still, even in these bins, the highest CASC3 signal is detected at the canonical sites. When xRIPiT-Seq is compared to

CLIP-Seq, both methods detect similar CASC3 binding throughout the gene expression range (Figure 3.13B). The few exceptions are noted in the medium expression range

114 where xRIPiT-Seq signal is significantly higher than CLIP-Seq in both canonical as well as non-canonical regions. When intercepts from the linear fits for the six classes are compared, the two methods show a robust detection of CASC3 binding at canonical regions (30-fold enrichment in CLIP-Seq and 56-fold enrichment in xRIPiT-Seq over intronless mRNA signal; Figure 3.14A). Lower but appreciable CASC3 binding is detected in the non-canonical regions. Notably, xRIPiT shows nearly two-fold higher signal as compared to CLIP-Seq in both canonical and non-canonical regions. This increased detection of CASC3 binding by xRIPiT over CLIP is also evident at a larger fraction of individual canonical sites (Figure 3.15A) and genes (Figure 3.15B). The robust detection of CASC3 binding by both xRIPiT and CLIP leads to ≥2-fold enrichment of CASC3 signal on a slightly larger number of canonical sites as compared to background detection on intronless transcripts (84 % of all cEJC sites are enriched in

CLIP as compared to 90 % with xRIPiT; Figure 3.14B; also see Figure 3.10). Finally, most of the sites enriched by xRIPiT and CLIP are shared between the two approaches, and also with nRIPiT (Figure 3.14C). Overall, we conclude that xRIPiT-Seq is comparable to, or even slightly more efficient than, CLIP-Seq to uncover binding sites of an RBP such as CASC3.

115

Figure 3.12 Gene Expression Comparison Between HEK293 and HeLa Cells

A scatter plot comparing RPKM values in HEK293 cells (x-axis) and HeLa cells (y-axis) for APPRIS principal 1 isoforms representing each GRCh38 human gene. The points representing genes that fall within 1.5-fold change in the two datasets are in red. These were used in the analyses comparing CLIP-Seq to xRIPiT-Seq

116

Figure 3.13 xRIPiT-Seq and CLIP-Seq are Robust and Comparable Approaches to Identify CASC3 Binding Sites (1)

(A) A meta-exon plot showing xRIPiT and CLIP read counts in the 150 nt window at exon ends. Read normalization was carried out as in Figure 3.6A.

(B) Comparison of gene-level CASC3 read density (RPKMIP-Seq) in xRIPiT (diamonds) and CLIP (triangles) for canonical (darker-shaded shapes) and non-canonical regions (lighter-shaded shapes), and for intronless genes (empty shapes). Gene binning and error bars are as in Figure 3.6B. 117

Figure 3.14 xRIPiT-Seq and CLIP-Seq are Robust and Comparable Approaches to Identify CASC3 Binding Sites (2)

(A) Comparison of the linear fit coefficients (or intercepts, in log space) of the six classes in Figure 3.13B. Classes are labeled on the bottom.

(B) Percentage of all canonical EJC regions where read depth is ≥2-fold as compared to intronless gene read counts in the indicated datasets.

(C) Venn diagram showing counts of canonical regions from the top 20 % expressed genes where CASC3 footprint read depth in nRIPiT, xRIPiT and CLIP is ≥2-fold as compared to intronless gene read counts.

118

Figure 3.15 CASC3-EJC Densities in Highly Expressed Genes – CLIP versus xRIPiT

(A) Scatterplot as in Figure 3.9A comparing normalized RPKM at each canonical site detected in CASC3 xRIPiT and CLIP.

(B) Scatterplot as in Figure 3.9B comparing gene level normalized RPKM between CASC3 xRIPiT and CLIP.

3.4.5 xRIPiT-Seq is superior to nRIPiT-Seq and CLIP-Seq for identifying RNPS1-EJC binding sites

We next compared the three different approaches (nRIPiT-Seq, xRIPiT-Seq and

CLIP-Seq) to enrich binding sites of RNPS1, an RAF within the EJC. A direct comparison of nRIPiT and xRIPiT shows that, as in the case of CASC3, formaldehyde crosslinking dramatically improves identification of the major RNPS1 binding site, which corresponds to the canonical EJC position (Figure 3.16A). Further, at low and medium expression levels xRIPiT yields better enrichment of RNPS1 binding sites than nRIPiT both in canonical as well as non-canonical regions (Figures 3.16B, 3.17A) although amongst the five highest expression bins, xRIPiT efficacy drops to the level of nRIPiT

119

(Figure 3.16B). These trends are observed at both canonical and non-canonical positions.

Still, xRIPiT leads to increased detection of RNPS1 over nRIPiT across individual canonical sites (Figure 3.18A) and also at the individual gene level (Figure 3.18B).

Furthermore, within the top 20% most expressed genes, xRIPiT also boosts RNPS1 detection at a greater percentage of canonical positions (92 %) as compared to nRIPiT

(47 %) (Figure 3.17B). In this group of genes, xRIPiT detects increased RNPS1 binding sites in each of the ten expression bins (Figure 3.19). Thus, xRIPiT leads to an overall enhanced capture of RNPS1-EJC bound to its target sites.

120

Figure 3.16 xRIPiT-Seq Outperforms nRIPiT-Seq to Detect RNPS1 Binding Sites (1)

(A) A meta-exon plot showing RNPS1 nRIPiT and xRIPiT footprint read counts at each position in the 150 nt window from exon ends.

(B) Comparison of gene-level RNPS1 read density (RPKM) in nRIPiT (squares) and xRIPiT (circles) for canonical (darker-shaded shapes) and non-canonical regions (lighter-shaded shapes), and for intronless genes (empty shapes). Gene bins along x-axis and error bars are as in Figure 3.6B. 121

Figure 3.17 xRIPiT-Seq Outperforms nRIPiT-Seq to Detect RNPS1 Binding Sites (2)

(A) Comparison of linear fit coefficients (or intercepts, in log space) of the six classes in Figure 3.16B, which are labeled on the bottom.

(B) Percentage of all canonical EJC regions where read depth is ≥2-fold as compared to intronless read counts in the indicated datasets.

122

Figure 3.18 RNPS1-EJC Densities in Highly Expressed Genes – nRIPiT versus xRIPiT

(A) Scatterplot of RNPS1 nRIPiT versus xRIPiT read densities (RPKM) at individual canonical sites normalized to intronless genes. Heatmap colors indicate plot density (red = most dense, blue = least dense). The diagonal represents canonical sites where nRIPiT and xRIPiT yield equal RPKM counts.

(B) Scatterplot as in (A) comparing gene level normalized RPKM (sum of RPKM for all canonical sites) between RNPS1 nRIPiT and xRIPiT.

123

Figure 3.19 RNPS1-EJC Canonical Site Detection by Expression

Bar-plots showing percent of canonical regions where RNPS1 RIPiT or CLIP read counts are ≥2-fold over counts in intronless genes in the same expression range bin (921-922 canonical sites/bin).

We next compared the detection of RNPS1 binding on RNA by CLIP and xRIPiT.

The RNPS1 CLIP-Seq data from (Hauer, et al., 2016) were mapped and compared to xRIPiT-Seq data as in the case of CASC3 above. Importantly, xRIPiT reveals a clear preference for RNPS1 binding to canonical EJC sites near exon 3' ends whereas RNPS1

CLIP-Seq reads show only a modest enrichment at this site (Figure 3.20A). RNPS1 xRIPiT signal in canonical and non-canonical regions on spliced transcripts is consistently, and in many expression bins, significantly higher than RNPS1 CLIP signal

(Figure 3.20B). Further, in xRIPiT, the RNPS1 binding observed on canonical and non- canonical regions is significantly higher than its binding to intronless mRNAs with exception of the top four bins (Figure 3.20B). In comparison, RNPS1 CLIP-Seq signal on canonical and non-canonical regions of spliced transcripts is indistinguishable from signal on intronless mRNAs, except for some of the medium expression bins (Figure 124

3.20B). Comparisons of coefficients of linear fits show that detection of RNPS1 binding is much higher in xRIPiT as compared to CLIP at both canonical sites (~7-fold higher) as well as non-canonical positions (~4.5-fold higher, Figure 3.21A). The superiority of xRIPiT over CLIP for RNPS1 binding detection is evident at individual canonical sites

(Figure 3.22A) and at individual genes (Figure 3.22B). Consequently, xRIPiT detects

RNPS1 binding at more than ten times the number of canonical EJC positions as compared to CLIP (cEJC sites detected among the top 20 % of all expressed genes: xRIPiT – 82 %, CLIP – 7 %; Figure 3.21B). It is noteworthy that even nRIPiT detects

RNPS1 binding on a greater number of canonical positions when compared to CLIP

(Figure 3.21C, also compare Figures 3.17B and 3.21B). As expected, based on our previous work, unlike CASC3 occupancy, RNPS1 occupancy does not show an increase when formaldehyde crosslinking is combined with cycloheximide treatment (Figures

3.23A-C). In fact, RNPS1 binding slightly decreases at both canonical and non-canonical sites upon translation inhibition as compared to normal conditions. Overall, we conclude that RIPiT in general, and xRIPiT in particular, is a much more suitable and sensitive approach than CLIP to identify RNAs and specific sites bound by RAFs such as RNPS1.

125

Figure 3.20 xRIPiT-Seq Outperforms CLIP-Seq to Detect RNPS1 Binding Sites (1)

(A) A meta-exon plot showing RNPS1 xRIPiT and CLIP footprint read counts at each position in the 150 nt window from exon ends.

(B) Comparison of normalized read density (RPKM) in xRIPiT (diamonds) and CLIP (triangles) for canonical (darker-shaded shapes) and non-canonical regions (lighter- shaded shapes), and for intronless genes (empty shapes). The bins of genes along the x- axis and the error bars are as in Figure 3.6B.

126

Figure 3.21 xRIPiT-Seq Outperforms CLIP-Seq to Detect RNPS1 Binding Sites (2)

(A) Comparison of linear fit coefficients (or intercepts, in log space) of the six datasets in Figure 3.20B.

(B) Percentage of all canonical EJC regions where read depth is ≥2-fold as compared to intronless gene read counts.

(C) Venn diagram showing counts of canonical regions from the top 20 % expressed genes where RNPS1 footprint read depth in nRIPiT, xRIPiT and CLIP is ≥2-fold as compared to intronless gene read counts.

127

Figure 3.22 RNPS1-EJC Densities in Highly Expressed Genes – CLIP versus xRIPiT

(A) Scatterplot as in Figure 3.9A comparing normalized RPKM at each canonical site detected in RNPS1 xRIPiT and CLIP.

(B) Scatterplot as in Figure 3.9B comparing gene level normalized RPKM between RNPS1 xRIPiT and CLIP.

128

Figure 3.23 CHX Treatment Has Little Impact on RNPS1 Signal

(A) Comparison of gene-level RNPS1 read density (RPKMRIPiT-Seq) in cycloheximide (CHX) treated xRIPiT (+CHX, diamonds) and untreated xRIPiT (circles) for canonical (darker-shaded shapes) and non-canonical regions (lighter-shaded shapes), and for intronless genes (empty shapes).

(B) Comparison of the linear fit coefficients (or intercepts, in log space) of the six classes in (A). Classes are labeled on the bottom.

(C) Percent of all canonical EJC regions where read depth is ≥2-fold as compared to intronless gene read counts in the indicated datasets.

129

3.4.6 xRIPiT-Seq is more efficient than CLIP-Seq to detect increased RNPS1 occupancy on exons preceding recursively spliced exons

The end goal of approaches that map RBP/RAF binding sites is to obtain insights into the functions of these proteins. We wanted to determine if increased detection of

RNPS1 binding to EJC sites via xRIPiT-Seq can shed light on the biological roles of

RNPS1 and the EJC. Blazquez et al. (Blazquez, et al., 2018) recently showed that the presence of an EJC on an exon-exon junction inhibits recursive splicing (RS) of the downstream exon when this exon begins with a sequence resembling a 5’-splice-site

(Figure 3.24A). Although RNPS1 and EJC core proteins were shown to be critical for repression of 5’-splice-site usage at RS exon junctions, it remains unknown if such splicing-regulatory activity of RNPS1 is specifically dependent on its increased binding on exons preceding RS exons. To test this idea, we identified 1912 exons in the human transcriptome with RS scores higher than the threshold defined by Blazquez et al., and compared RNPS1 and CASC3 binding signal from xRIPiT-Seq and CLIP-Seq on these

RS exons and up to three preceding exons (Figure 3.24A). As a control, we similarly compared RNPS1 and CASC3 binding around an equal number of exons with the lowest

RS scores (non-RS exons). Regardless of the experimental method, we detected increased binding of CASC3 on the upstream exon of RS junctions as well as one exon further upstream as compared to similar exons upstream of non-RS junctions (Figure 3.24B).

This increased CASC3 occupancy, which likely signifies increased EJC core deposition, is less prominent on the exon further upstream and on the RS exon itself. Consistent with the results in Figures 3.16 – 3.23, we detected greater RNPS1-EJC signal using xRIPiT-

130

Seq as compared to CLIP-Seq on all exons regardless of their position relative to the RS exon (Figure 3.24C). Importantly, xRIPiT-Seq finds significantly higher RNPS1-EJC binding on all exons preceding RS junctions compared to those preceding non-RS junctions with the most significantly increased RNPS1 binding observed on the exon upstream of the RS junction. In comparison, CLIP-Seq finds a smaller and less significant increase in RNPS1 binding on this exon when it precedes RS versus non-RS junctions, and no difference in RNPS1 binding on other exons. This increase in RNPS1 binding on exons upstream of RS junctions is also evident at individual genes containing

RS exons where EJC deposition was previously shown by Blazquez et al. to suppress RS

(Figure 3.25). Finally, we also observed that entire transcripts that contain high-scoring

RS junctions show increased RNPS1 and CASC3 binding as compared to transcripts that contain only low-scoring RS junctions (Figure 3.26). Again, as compared to CLIP-Seq, xRIPiT-Seq detected much higher and more significantly increased RNPS1 binding to transcripts containing RS junctions whereas both CLIP-Seq and xRIPiT-Seq performed similarly in the case of CASC3. These results further highlight xRIPiT-Seq’s ability to uncover biologically relevant features of RAF binding as compared to CLIP-Seq. We also conclude that EJC and RNPS1 deposition on upstream junctions may play a role in suppressing downstream recursive exon splicing, possibly by stabilizing EJC and/or

RNPS1 binding on downstream RS junctions (Figure 3.24A).

131

Figure 3.24 Comparison of xRIPiT-Seq and CLIP-Seq Signal for RNPS1 and CASC3 Occupancy on High- and Low-Scoring RS Exons and their Neighboring Exons

(A) A schematic of a recursively spliced (RS) exon and its neighboring exons. Empty rectangles: constitutive exons; shaded rectangle: RS exon; black line: intron; dotted lines: possible exon splicing patterns; shaded ovals: RNPS1-EJC; the RNPS1-EJC upstream of RS junction suppresses RS whereas the complex on one exon further upstream (shown on shaded background) stabilizes the downstream complex. Number below each exon represents the number of exons for which data is presented in panels (B) and (C).

132

Figure 3.24, Continued

(B) Box plots showing CASC3 xRIPiT-Seq and CLIP-Seq read densities on high- scoring versus low-scoring RS exons and their three preceding exons. Each set of four boxplots is arranged directly below the RS exon or one of the three preceding exons in A to which they correspond. Top: Wilcoxon rank-sum test p values.

(C) Box plots as in (B) showing RNPS1 xRIPiT-Seq and CLIP-Seq exonic read densities.

Figure 3.25 Example Gene Containing RS Exon and Showing xRIPiT Sensitivity

Integrated genome viewer tracks showing read coverage (normalized for library size) on the TMA16 RS exon and its three preceding exons.

133

Figure 3.26 Transcript Level RS Exon Analysis

Box plots showing RNPS1 and CASC3 xRIPiT-Seq and CLIP-Seq genic read densities on genes that contain a high-scoring RS exon (n=5001) and those containing only low- scoring RS exons (n=5001). Top: Wilcoxon rank-sum test p values.

3.5 Discussion

In the quest to define functions of RNA-associated proteins, their different modes of interactions with RNA necessitate orthogonal approaches to “freeze” RNA-associated proteins in action on cellular RNAs. Here we use the EJC as a test case to show that while UV-crosslinking works well to capture RNA interaction sites of proteins that bind

RNA directly (CASC3, an RBP), chemical crosslinking using formaldehyde is a far 134 superior option for identifying binding sites of proteins that act on RNA via other proteins (RNPS1, an RAF). Our results suggest that xRIPiT-Seq, or similar RIP-Seq approaches from formaldehyde-crosslinked cells, can be generally applicable to all RNA- associated proteins, in particular to RAFs, to interrogate their cellular sites of action

(Figure 3.27).

Figure 3.27 A Schematic Summarizing Suitable Approaches for Identification of Binding Sites of RBPs versus RAFs RNA (dark line) is shown bound by RBPs (lower ovals) and RAFs (upper ovals). Methods suitable for binding site enrichment of the two classes of RNA-associated proteins are on the right – while CLIP-Seq is best suited to RBPs which bind the RNA directly, xRIPiT-Seq can capture all types of RBPs and RAFs

3.5.1 xRIPiT-Seq reveals the binding profile and functions of RNPS1, an RAF within the

EJC

We have previously developed and applied RIPiT-Seq to investigate functions of

EJC factors in post-transcriptional gene regulation. A key advantage of RIPiT is the two- step purification, which enables enrichment of an RNP containing a pair of proteins. The

135 tandem purification strategy boosts signal specificity by reducing enrichment of non- specific RNA interactors. Combination of formaldehyde crosslinking with RIPiT-Seq, i.e. xRIPiT-Seq (Chapter 2), further enhances several-fold the specificity of capturing binding sites of RNA-associated proteins. Essentially, xRIPiT is an extension of well- tested formaldehyde-crosslinked RIP approaches described in the past (Niranjanakumari,

Lasda, Brazas, & Garcia-Blanco, 2002) and is analogous to the fCLIP-Seq approach described recently (Kim & Kim, 2019), with the added advantage of tandem purification.

Formaldehyde crosslinking, being agnostic to direct versus indirect binding of RNA- associated proteins to RNAs, makes xRIPiT-Seq more broadly applicable than CLIP-Seq approaches. Consistent with this, as compared to CLIP-Seq, xRIPiT-Seq yields robust signal for both CASC3 and RNPS1 binding to EJC sites.

The enhanced detection of RNA binding sites via xRIPiT-Seq is particularly striking in the case of RNPS1, providing new insights into its binding and functions within the EJC. The poor UV-crosslinking of RNPS1 to canonical EJC sites and yet its strong enrichment at these positions during xRIPiT suggests that RNPS1 binds to EJC mainly via protein:protein interactions. Such a view is consistent with the current understanding of RNPS1 interaction with the EJC core, which likely occurs via other peripheral EJC factors such as ACIN1 and/or PNN (Boehm, et al., 2018; Wang, Ballut,

Barbosa, & Le Hir, 2018). Importantly, as compared to the CASC3-EJC binding sites, the

RNPS1-EJC occupancy sites revealed by xRIPiT are enriched in degenerate sequences that resemble SR-protein binding motifs, e.g. GA-rich sequences (see k-mer analysis in

Chapter 2). Notably, numerous SR and SR-like proteins co-purify with RNPS1-EJC but

136 not with CASC3-EJC. Thus, while RNPS1 does not contact RNA on its own, xRIPiT-Seq faithfully captures RNA sites that are associated with RNPS1-EJCs. xRIPiT-Seq also reveals that RNPS1-EJC has an increased abundance upstream of exons that depend on

RNPS1 for their splicing. This increased RNPS1 occupancy on neighboring exons could be important for its splicing-regulatory function on RS exons (Blazquez, et al., 2018).

Possibly, akin to its function in promoting splicing of neighboring introns in Drosophila piwi transcripts (Malone, et al., 2014), RNPS1 binding on neighboring exon-junctions can promote increased EJC deposition on RS exon junctions thereby suppressing RS exon splicing.

3.5.2 Factors that impact UV-crosslinking ability of RNA-associated proteins

In addition to illuminating EJC functions, our work has broader implications for investigating RNA:protein interactions using chemical versus photo-crosslinking approaches. Therefore, it is important to consider factors that can influence UV- and formaldehyde crosslinking abilities of RNA-associated proteins.

Several factors can negatively impact a protein’s ability to efficiently crosslink to

RNA with UV light, resulting in their classification as RAFs (Figure 3.1). The most obvious factor is the indirect RNA binding mode of proteins where they act on RNA from a distance through RBPs. Such RAFs can function as regulators of RBPs as in the case of RBM8A and MAGOH, which interact exclusively with the RNA-clamped form of EIF4A3 (Andersen, et al., 2006; Bono, Ebert, Lorentzen, & and Conti, 2006). Other

RAFs can serve as adapters to connect an RBP and its bound RNA to other cellular machineries (e.g. mRNA export factors NXT1, GLE1). RAFs can also serve as

137 components of multi-subunit assemblies where they may take a structural or regulatory role (e.g. EIF3 subunits - 3F, 3I, 3K; nuclear exosome non-catalytic subunits - EXOSC5,

7, 8). For such proteins, which are physically away from RNA while in action, only chemical crosslinking methods can immobilize them in their native in vivo complexes.

The chemistry at the RNA:protein interaction interface also influences UV- crosslinking of proteins to RNA. Among the nucleobases in single-stranded DNA polymers, polypurine oligomers are much less reactive than polypyrimidine oligomers for protein photo-crosslinking (Hockensmith, Kubasek, Vorachek, & von Hippel, 1986).

Among the amino acids, most robust UV induced crosslinking to uracil is observed for amino acids with aromatic side-chains that can engage in stacking interactions with nucleobases (F and Y), and amino acids with positively-charged side-chains that can form electrostatic interactions with the negatively charged phosphate backbone (K, R and

H) (Smith, 1969). Confirming these biases, in UV crosslinked RBP:RNA complexes identified by XRNAX, uracil is the most frequently crosslinked base, and phenylalanine, lysine and glycine are the top three amino acids in the uracil-crosslinked peptides

(Trendel, et al., 2019). EIF4A3 is an example of an RBP that inefficiently UV-crosslinks to RNA possibly due to the chemical nature of the RNA:protein interface. EIF4A3 binds

RNA by exclusively contacting the ribose-phosphate backbone and lacks specific interactions with bases (Andersen, et al., 2006; Bono, Ebert, Lorentzen, & and Conti,

2006). We have previously shown in human cells that this protein UV-crosslinks to RNA very inefficiently as compared to a sequence-specific RBP HNRNPA1 (Singh, Ricci, &

Moore, 2014). Recently, this poor in vivo UV-crosslinking ability of EIF4A3 was also

138 reported in Drosophila adult animals (Obrdlik, Lin, Haberman, Ule, & Ephrussi, 2019).

Instead, chemical crosslinking using dithio(bis-) succinimidyl propionate was found to stabilize EIF4A3 within EJCs for identification of Drosophila EJC binding sites. Many other DEAD-box proteins bind RNA in a sequence-independent manner similar to

EIF4A3, and some of these RBPs may lack readily UV-crosslinkable functional groups at

RNA:RBP interfaces. Despite being “true” RBPs, such proteins are more likely to be classified as RAFs and can benefit from xRIPiT-Seq over CLIP-Seq.

Surprisingly, many proteins that interact with the RNA 7-methyl-guanosine cap are classified among RAFs, e.g. nuclear cap binding protein NCBP2, all three cytoplasmic cap binding proteins EIF4E1, EIF4E2 and EIF4E3, and decapping proteins

DCP2, NUDT16 and DCPS. Notably, these proteins interact with the cap via stacking interactions between aromatic amino acid side chains and the methylated guanosine base of the cap (Marcotrigiano, Gingras, Sonenberg, & Burley, 1997). Despite such direct and specific contacts between these proteins and the RNA cap structure, they apparently poorly UV-crosslink to RNA in vivo. A previous report that chemical crosslinking is superior to UV crosslinking to detect EIF4E-cap interactions (Kahvejian, Svitkin,

Sukarieh, M’Boutchou, & Sonenberg, 2005) further suggests that the arrangement of functional groups at the RBP:RNA interface of these proteins could be suboptimal for

UV crosslinking. RAFs also include many RNA decay enzymes: 19 different RNases

(e.g. RNASE1, RNASE2), cytoplasmic RNA exosome catalytic subunit DIS3L, polyuridylated RNA specific 3’-5’ exonuclease DIS3L2, and major cytoplasmic deadenylases PAN2 and PAN3. While these proteins likely act on RNA directly, their

139 engagement with RNA in vivo is likely not amenable to efficient UV crosslinking, possibly due to their mode of interaction with RNA and also due to the transient nature of their interactions. The latter view is supported by the observation that a much stronger

PAR-CLIP signal is obtained when nucleolytic activity of DIS3 is mutated (Szczepińska, et al., 2015), which possibly traps the protein on RNA for more efficient photo- crosslinking. Overall, the above examples highlight several conditions where UV crosslinking can prove ineffective, and chemical crosslinking with formaldehyde (or other crosslinkers) is a more viable approach to trap RBPs/RAFs in action.

3.5.3 Formaldehyde as an alternative to UV for in vivo RBP/RAF crosslinking

Formaldehyde is a bifunctional, electrophilic molecule that reacts with two nucleophilic groups in sequential steps to link the two groups via a methylene bridge

(Hoffman, Frey, Smith, & Auble, 2015). In the first step, a nucleophilic group, such as an amino or imino group from a protein or nucleic acid, reacts with formaldehyde to form a

Schiff’s base. In the second step, the Schiff’s base reacts with a second nucleophilic group resulting in a methylene bridge. If the two attacking nucleophilic groups are on two different molecules, their reaction with formaldehyde forms an inter-molecular crosslink.

Due to the small size of formaldehyde, the two groups to be crosslinked must be no more than ~2 Å apart, thus making it suitable to study molecules that are in close proximity.

Formaldehyde-mediated protein:protein crosslinking stabilizes macromolecular complexes in vivo (reviewed in (Sutherland, Toews, & Kast, 2008)). Mild formaldehyde crosslinking stabilizes dynamic interactions between the core and regulatory particles of the proteasome enabling purification of catalytically active proteasome (Fabre, et al.,

140

2013). It is this protein:protein crosslinking ability of formaldehyde that makes it suitable for stabilizing RAFs such as RNPS1 within their functional RNPs, allowing enrichment of sites where it is bound to the EJC core in vivo. Formaldehyde crosslinking can also stabilize RBPs such as CASC3 within RNPs. This can occur due to protein-protein crosslinks between an RBP and other proteins within an RNP, which can stabilize an

RBP:RNA interaction. Consistent with this, it was shown that formaldehyde crosslinking but not UV crosslinking enhances capture of AGO:siRNA interactions (Au, Helliwell, &

Wang, 2014). Similarly, formaldehyde crosslinking stabilized double-stranded RBP

DROSHA within pri-microRNPs to allow mapping of DROSHA cleavage sites with a

CLIP-Seq workflow (Kim & Kim, 2019). Similar to UV light, the stabilizing effect of formaldehyde on RBPs can also result from its ability to form protein:nucleic acid crosslinks, which makes it a widely popular crosslinker of choice in chromatin IP studies

(Collas, 2010; Hoffman, Frey, Smith, & Auble, 2015). Future studies should evaluate formaldehyde’s ability to immobilize protein:protein and protein:RNA interactions, and relative contributions of these two types of linkages to RNP stabilization.

For formaldehyde crosslinking of in vivo RNPs, crosslinker concentration is an important consideration. We find that a relatively low formaldehyde concentration (0.1%) provides a good balance between protein crosslinking and solubility. Possibly, at higher concentrations, formaldehyde can crosslink non-specific interactions trapping macromolecular complexes in their local cellular environments. Indeed, a previous study showed that formaldehyde concentrations of 0.2% or higher leads to inappropriate detection of cytoplasmic and mitochondrial proteins in the nuclear fraction (Fabre, et al.,

141

2013). Interestingly, the xRIPiT signal shows a dip in the top quarter of expression bins while such a drop is not seen in nRIPiT samples. Possibly, this decrease in signal results from non-specific, over-crosslinking of RNPs to cellular structures, which may particularly affect more abundant RNPs. The formaldehyde concentrations we employ here are an order of magnitude lower than those in ChIRP-MS approaches used to illuminate lncRNA proteomes (Chu, et al., 2015). Perhaps, high formaldehyde concentrations can induce crosslinks within a larger fraction of an individual RNP to improve signal as in the case of single RNP interactomes. However, low concentrations of formaldehyde like we employ here are likely to be more critical to preserve the relative abundance of mRNPs, which range several orders of magnitude in expression.

While chemical crosslinking with formaldehyde provides a robust alternative to

UV crosslinking to study in vivo RNA:protein interactions, unlike UV crosslinking, it is unable to discriminate between direct and indirect interactions. Also, unlike CLIP, so far formaldehyde crosslinking has not been exploited to identify the direct sites of contacts between an RNA and a protein. It is noteworthy that formaldehyde crosslinking has been used to map points of protein contact on DNA (Solomon & Varshavsky, 1985), and can be similarly applied to RNA:protein complexes as well. Nevertheless, our current and previous work shows that formaldehyde crosslinking combined with RIPiT-Seq is a powerful approach to isolate and analyze RNA footprints of multisubunit RNP assemblies, which are often heterogeneous in nature.

142

Chapter 4

Deciphering the Relationship between PYM and the EJC

4.1 Abstract

The EJC performs a multitude of functions through a large complement of peripheral proteins, including splicing factors, mRNA export proteins, translation factors, and NMD factors. One such protein is PYM (Partner of Y14 and MAGOH), which binds to the core components of the EJC. Current models suggest that PYM's principle function is to disassemble the EJC co-translationally, a step which allows for the recycling of EJC factors. Here we examine the role of PYM in translation and EJC-binding by conducting

EJC core pull-downs. In these experiments, translation was inhibited using cycloheximide (CHX) treatment and PYM-EJC binding was inhibited via a mutation in

MAGOH. We observe that while CHX treatment in the wild-type cells greatly impacts relative transcript-level EJC occupancy, loss of PYM-EJC binding is largely insignificant for EJC occupancy. These data suggest that PYM-EJC interaction is not necessary for

EJC disassembly per se. We further find that inhibition of PYM-EJC binding decreases the relative ratio of observed EJC canonical (24 nt upstream of an exon-exon junction) to non-canonical (more than 50 nt upstream of an exon-exon junction) occupancy. We conclude that PYM is not required for EJC removal at canonical deposition sites but does prevent EJC deposition at non-canonical sites, which is confirmed by an increase in occupancy of PYM-interaction mutant EJCs on intronless transcripts relative to the wild-

143 type EJCs. Our work thus sheds new light on the role of PYM in maintaining normal EJC binding and function.

While the EJC itself plays a crucial role in NMD, and the EJC composition affects the efficiency of this process (see Chapter 2), there is still much to be understood. As

PYM is associated with EJC disassembly it is reasonable to hypothesize that PYM may be an antagonist to NMD, which depends on interactions with the EJC. PYM overexpression has indeed been shown to inhibit NMD (Gehring, Lamprinaki, Hentze, &

Kulozik, 2009), however the link may be less straightforward. Tethering experiments have shown that PYM plays a direct role as a component of the NMD pathway, although the exact mechanisms are still unknown. To elucidate the function of PYM in NMD we find ourselves in need of a trustworthy list of NMD target transcripts. To this end we compare several methods for defining and obtaining such a list, including using

Ensembl’s NMD biotype, creating a list de novo using exon and UTR coordinate information (also from Ensembl), and creating lists using the program

IsoformSwitchAnalyzeR (which predicts PTCs). Finally we produce a combination list of high fidelity which shows the expected trends when observing knockdown experiments of two proteins with well-established roles in NMD.

4.2 Introduction

As the ribosome cannot translate an mRNA without removing or displacing the

EJCs in its path, and because the components of the EJCs are themselves recycled, it is assumed that EJCs are removed upon the first round of translation (Gehring, Lamprinaki,

Hentze, & Kulozik, 2009). PYM, which interacts with mature EJCs on already spliced

144 mRNAs, is believed to act in this regard as an EJC disassembly factor (Bono, et al.,

2004). The N-terminus of PYM interacts with a composite surface of the Y14:MAGOH heterodimer, which makes up two of the three proteins in the EJC’s trimeric core.

However PYM does not interact with either Y14 or MAGOH in isolation, and steric clashing ought to prevent the binding of PYM to the Y14:MAGOH heterodimer within an assembled EJC, lending credence to the idea the PYM is involved in disassembly

(Bono, et al., 2004). Exogenous overexpression of PYM also results in a reduction in EJC binding to reporter RNA (Gehring, Lamprinaki, Hentze, & Kulozik, 2009).

The current model suggests that EJC-bound PYM interacts with the pioneer ribosome and causes EJC disassembly at the canonical (-24 nt upstream from exon-exon junctions) position, while also aiding in recruitment of said translation machinery and therefore acting as a translation enhancer (Diem, Chan, Younis, & Dreyfuss, 2007).

However, PYM levels are substoichiometric to both assembled ribosomes and EJCs, suggesting that it does not play a role in a majority of EJC removal events (Gehring,

Lamprinaki, Hentze, & Kulozik, 2009). PYM is also non-essential for viability both in

Drosophila and mammalian cells (Ghosh, Obrdlik, Marchand, & Ephrussi, 2014; Paix, et al., 2017). This suggests that the pioneering round of translation by the ribosomes is enough to displace or remove EJCs. Therefore PYM’s exact role and the importance of its association with the EJC is still a subject for further research.

In the hopes of uncovering insights into the part played by PYM we now investigate how EJC binding, both canonical and non-canonical, is impacted by a mutation on MAGOH which inhibits interaction with PYM. To isolate the role which

145 ribosome translocation plays we also divide our mutant/wild-type assays into two additional conditions marked by treatment (or lack of treatment) with cycloheximide

(CHX), a translation inhibitor. We find that while the loss of PYM is largely inconsequential for specifically canonically located EJCs it does cause a greater abundance of non-canonically located EJCs, especially on single exons where no EJC deposition is expected. PYM’s primary function therefore may not be to aide in canonical

EJC disassembly as previously thought, but rather to prevent the association of EJCs in non-canonical positions.

The next step in our study of PYM is to gain a better understanding of its role in

NMD. PYM overexpression has been shown to inhibit NMD (Gehring, Lamprinaki,

Hentze, & Kulozik, 2009), which is likely a direct consequence of its ability to either disassemble or prevent the binding of EJCs. However, PYM has also been shown to play a direct role in the NMD machinery (Bono, et al., 2004). In order to uncover its exact part in the story we have constructed cell lines in which PYM has either been knocked down via siRNA or overexpressed. We would like to observe what happens to known NMD targets in either case, but before doing so it is necessary to have a reliable list of such targets. We compare a variety of methods for arriving at such a list, including use of a previous set (Ensembl), creation de novo, and creation using a PTC predicting program, by reproducing aspects of a study on NMD featuring the knockdown of two proteins important for the NMD pathway: UPF1 and ICE1 (Baird, et al., 2018). Finally, we arrive at a combination list which is universal and easily reproducible, and clearly shows the same trends observed by Baird et al.

146

4.3 Materials and Methods

4.3.1 Experimental Methods

Non-computational work (i.e. wet lab experimentation) including cell preparation, immunoprecipitation, and preparation for high-throughput sequencing was conducted by Lauren Woodward. A detailed explanation of these methods can be found in (Woodward, Gangras, & Singh, 2019) and in her doctoral thesis: “Examining the

Effects of Translation on the Exon Junction Complex.”

4.3.2 High-Throughput DNA Sequencing

Alignment and pre-processing were conducted as described in Chapter 2. RNA-

Seq data from (Baird, et al., 2018) downloaded from the NCBI SRA SRX3677669 was similarly processed using the same pipeline with the exception of adaptor trimming and

PCR duplicate removal which was previously completed by Baird et al.

4.3.3 Human reference transcriptome and annotations

Reference annotations, e.g. for protein coding transcripts, were retrieved from the

Ensembl BioMart (GRCh38.p12). Reference transcript information including the list of

NMD biotype transcripts was similarly downloaded, specifically for Ensembl v100.

4.3.4 Genic read distribution assignment

Fractions of reads corresponding to exonic and canonical EJC and non-canonical

EJC regions were computed at the gene level as detailed in 3.3.5. Briefly, all analyses were limited to exons 100 nt or longer and we excluded the last exon, which lacks a canonical EJC deposition site. For the canonical EJC regions the 30 nt surrounding the center of the primary EJC binding site at -24 (-39 to -9) were considered, while for the

147 non-canonical EJC region all nucleotides from the start of each exon to -50 were used.

The single exon gene annotation was similarly limited to exons (genes) at least 100 nt in length. Bedtools function “intersect” was used to compare reads against these annotations and reads which overlapped the annotation by more than 50% were counted.

4.3.5 Isoform analysis and read distribution

After alignment and cleaning reads in BAM format were converted into FASTQ using Samtools before being run through Kallisto (Bray, Pimentel, Melsted, & Pachter,

2016) for isoform quantification.

4.3.6 Prediction of transcripts with Premature Termination Codons (PTCs)

The R package IsoformSwitchAnalyzeR (Vitting-Seerup & Sendelin, 2019) was used in the same manner as by Baird et al (Baird, et al., 2018) in order to identify potential NMD targets originating from a PTC. An in-house Python script was also written to find transcripts with a PTC located 50 nt upstream of their final exon-exon junction (therefore qualifying as NMD targets) using exon and 3’UTR start and end locations retrieved from Ensembl v100.

4.4 Results

4.4.1 PYM is not necessary for removal of canonical EJCs

In Chapter 2 we learned that translation inhibition via the translation elongation inhibitor cycloheximide (CHX) has a substantial effect on canonical CASC3-EJC occupancy (as opposed to RNPS1-EJCs) (Mabin, et al., 2018). This together with the knowledge that CASC3-EJCs are primarily localized to the cytoplasm implies that ribosome translocation may play a significant role in the removal of EJCs. Since PYM is 148 also purportedly an EJC disassembly factor we conducted experiments in which either ribosome translocation or the PYM-EJC interaction (or both or neither) was disrupted. To test the impact of ribosome translocation, cells were treated with CHX as described in

Chapter 2. As for the PYM-EJC interaction, HEK293 cell lines stably expressing FLAG- tagged MAGOH-E117R, which lacks the PYM binding affinity of wild-type MAGOH but retains EJC assembly and NMD functionality, were generated. RIPiT experiments as described in Chapter 2 were then conducted for each of the four tests (with/without CHX, with/without mutation) which targeted MAGOH followed by CASC3. This was done in order to specifically target cytoplasmic EJCs which would be impacted by the translation machinery and therefore be targeted for disassembly. Differential analysis of genes with canonical EJC occupancy up or down regulated across each condition, PYM-interaction inhibition or translation inhibition, showed a significant change when exposed to CHX but very little between the mutant and wild-type cell lines (translating normally), with only a single gene showing significant change (Figure 4.1). Similarly looking at expression change at canonical positions with and without the mutation both with and without translation inhibition shows hardly any significantly altered genes (Figure 4.2) which tells us that PYM interaction does not play a role in canonical EJC binding levels regardless of ribosome translocation. From this it is clear that the translation process itself plays a substantial role in the abundance of canonical EJCs but PYM’s interaction with the EJC has little to no effect.

149

Figure 4.1 Translation Inhibition but not PYM-EJC Interaction Inhibititon Impacts Canonical EJC Occupancy Scatterplot showing differential expression in terms of log2 fold change (L2FC) of EJC footprint reads on canonical sites. On the X-axis is wild-type cells pulldown of MAGOH-CASC3 with and without CHX treatment. Positive is with treatment (translation inhibited) and negative is without treatment (translation as normal). On the Y-axis is wild-type or mutant (MAGOH-E117R) conditions, with increased occupancy in the mutant line as positive and wild-type as negative. Blue dots represent genes significantly enriched between the +/-CHX conditions, red dots represent genes significantly enriched between mutant/wild-type conditions.

150

Figure 4.2 Disrupting PYM Interaction is Largely Inconsequential at Canonical Positions Scatterplot as in Figure 4.1, but now comparing mutant (positive values) versus wild-type (negative values) on both axes, with no CHX treatment on the X-axis and CHX treatment/translation inhibited on the Y-axis.

4.4.2 EJCs lacking PYM interaction bind more at non-canonical positions

When considering total EJC occupancy at the gene level a similar picture emerges

(Figure 4.3), although there is now a minor increase in the range of expression with the mutant condition, which must necessarily arise from the addition of non-canonical EJC occupancy. Perhaps then any disassembly function of PYM is more important for non- canonical, potentially aberrantly placed EJCs? Meta-exon analysis at the end of exons, focused on the canonical EJC binding site, confirms this idea (Figure 4.4). Regardless of

151

CHX treatment wild-type MAGOH pulldown experiments show clear coverage of the canonical position. Meanwhile mutant EJCs with PYM binding inhibited have clearly lowered footprints in the same area, again regardless of CHX treatment. This trend may mean either that genes in the mutant cell lines have lower canonical EJC binding or higher non-canonical EJC binding, but due to the lack of expression change in canonical regions in particular (Figures 4.1 and 4.2) we can infer the latter. To further investigate this trend, we looked at the ratio or relative abundance of non-canonical to canonical EJC footprints for each of the four conditions (Figure 4.5). As expected, the two MAGOH-

E117R pulldowns have higher average ratios, and thus contain higher levels of non- canonical EJC to canonical EJC binding across genes. These data suggest that a loss of

PYM-EJC interactions causes an increase in non-canonical EJC binding events.

152

Figure 4.3 Disrupting PYM Interaction has a Larger Impact when Considering Entire Transcripts Scatterplot identical to Figure 4.1, but each point represents an entire gene and all footprints contained therein instead of individual canonical sites.

153

Figure 4.4 Meta-Exon Analysis Shows Lower Mutant-EJC Occupancy at the Canonical Binding Position End of exon read distribution with total reads counts normalized to 1 in each sample.

154

Figure 4.5 Non-Canonical/Canonical Ratios Shows Greater Ratios when PYM Interaction is Inhibited Frequency distribution of (log10) non-canonical/canonical RPKM ratios for each of the four samples at the gene level.

4.4.3 MAGOH-E117R-EJCs are enriched on single exon genes

Genes containing only a single exon never undergo intron excision and are therefore expected to not feature any EJC binding. As such these genes are an easy way to test for sensitivity to non-canonical EJCs, since every EJC on a single exon gene is by definition non-canonical. Therefore, on single exon genes we would expect MAGOH-

E117R EJC footprints to show increased occupancy as there is no PYM interaction to aid in disassembly. We tested this directly by looking at log2 fold change of the mutant 155 versus wildtype conditions on single exons compared to all genes (Figures 4.6). This test also showed the expected outcome, solidifying the idea that PYM’s primary disassembly function is focused on non-canonical EJCs.

Figure 4.6 Single Exon Genes Show Higher EJC Levels with PYM Interaction Inhibited Cumulative distribution function of log2 fold changes in the mutant versus wild-type EJC footprints for the set of all genes and single-exon genes.

156

4.4.4 Reproducing NMD upregulation results to test target lists

In order to continue our study of PYM, especially as it pertains to NMD, it became necessary to have a trustworthy list of NMD targets to work from. In a recent work uncovering the role of ICE1 in NMD Baird et al. (Baird, et al., 2018) accomplished this by creating their own list of NMD targets, using a combination of de novo transcript discovery and NMD target prediction. Baird et al. examined three separate experiments:

RNA-Seq (as control), UPF1 knockdown, and ICE1 knockdown. As a well-established component in NMD UPF1 knockdown should cause upregulation of NMD targets, compared to control, and thus be a good benchmark of the viability of an NMD target list.

Likewise, Baird et al. were able to show that ICE1 also plays an important role in NMD, with ICE1 knockdown similarly causing strong upregulation of NMD targets.

We sought to reproduce the results seen by Baird et al., while simultaneously excluding the de novo transcript discovery step in order to have a well-established, universally useable NMD list. That is, a list of NMD targets which could be easily applied to existing data sets and which would not require the recreation of the list with each new experiment. In order to ensure reproducibility of Baird et al.’s results and also have an established data set to work from we downloaded the RNA-Seq experimental data from (Baird, et al., 2018), with each of the three samples in triplicate, and aligned it using the protocol described in 4.3. Like Baird et al. we used Kallisto for isoform abundance quantification. Following quantification DESeq2 (Love, Huber, & Anders,

2014) was used for differential expression analysis of the two conditions, UPF1 knockdown and ICE1 knockdown, versus the control. In addition to the RNA-Seq data

157

Baird et al. also made their NMD target list, as well differential expression fold changes, available and we likewise used those for comparison.

4.4.5 3’UTR length correlates with NMD targeting as expected

As a first check to ensure the reliability of our own data processing as well as confirm expected trends with Bard et al.’s supplied results we examined fold changes of each knockdown experiment versus control in the case of longest 3’UTR length quartiles

(Figures 4.7 and 4.8). Long 3’UTRs are known to induce NMD (Toma, Rebbapragada,

Durand, & Lykke-Andersen, 2015), and so we constructed a list by selecting the transcript from each gene with the longest 3’UTR, and then subdivided this list based on quartiles. As average 3’UTR length increases we expect counts in the knockdown conditions to increase relative to control as genes previously deteriorated by NMD proliferate. Indeed, in all four cases (both Baird et al. and our processing of the data,

UPF1-KD/control and ICE1-KD/control) we observe exactly that trend, confirming that the experiments are behaving as we would expect.

158

Figure 4.7 3’UTR Analysis is Consistent in UPF1-KD/Control (A) Cumulative distribution function of our processing of UPF1-KD/Control L2FC in four classes of longest genic 3’UTR Length: purple is top (longest quartile), red, blue, and finally green is bottom (shortest) quartile.

(B) Cumulative distribution function as in (A) but using Baird et al.’s supplied, already processed data. 159

Figure 4.8 3’UTR Analysis is Consistent in ICE1-KD/Control (A) Cumulative distribution function as in Figure 4.7 but featuring ICE1 knockdown.

(B) Cumulative distribution function as in (A) but using Baird et al.’s supplied, already processed data. 160

4.4.6 Comparing Baird et al.’s de novo list to Ensembl’s NMD biotype

We next wanted to compare the list of NMD target transcripts produced by Baird et al. using IsoformSwitchAnalyzeR (ISAR) to a well-established list of NMD targets, namely Ensembl’s Transcript type: nonsense_mediated_decay as available on Ensembl

BioMart. According to Ensembl this flag is assigned to transcripts wherein the coding sequence ends >50bp from a downstream splice site or if the variant does not cover the full coding sequence but NMD is unavoidable (regardless of exon structure). Some discrepancy is expected between these lists because Baird et al. included the closest matching Ensembl transcript IDs of de novo predicted transcripts in their list. Those transcripts are definitively different from the actual Ensembl transcripts they are matched to, and a disparity of even a single nucleotide would be enough to potentially change whether a transcript is flagged as containing a PTC (i.e., a de novo transcript which is flagged as an NMD target may most resemble an Ensembl transcript which does not meet the same criteria).

Initial tests comparing ICE1 knockdown to control were promising, showing the expected trends except in the case of our processing with Baird et al.’s list, likely due to the mismatch between transcript coordinates for the same Ensembl ID, as described above (Figures 4.10 and 4.11). The data produced by Baird et al. showed a stronger trend with their own ISAR NMD (PTC) list (Figure 4.10A), whereas our processing showed much clearer results with the Ensembl NMD list (4.11B). This discrepancy is likely due to the better matching of list to data. That is, each transcript ID points to exactly the coordinates and expression level for the Baird et al. list that was used to produce the list,

161 and likewise for our processing and the Ensembl list, whereas the cross-tests are susceptible to the mismatch disparity described above.

162

Figure 4.9 NMD Analysis of ICE1-KD/Control with Baird Processing is Consistent with Expectations (A) Cumulative distribution function as in Figure 4.9B using Baird’s processing and NMD (PTC) list.

(B) Cumulative distribution function as in (A) but using Ensembl’s NMD list.

163

Figure 4.10 NMD Analysis of ICE1-KD/Control with Our Processing is Consistent with Expectations (A) Cumulative distribution function as in Figure 4.9A using our processing and Baird et al.’s NMD (PTC) list.

(B) Cumulative distribution function as in (A) but using Ensembl’s NMD list.

164

Upon conducting the same tests with the UPF1 knockdown we encountered unexpected results. Baird et al.’s processing with their own list shows no effect with

UPF1 knockdown (Figure 4.11A). This runs counter to the idea that UPF1 is a necessary component of the NMD mechanism, which is well established. Also, while Baird et al. reported an identical figure with ICE1 in their analysis which matched our reproduction, they reported no such figure for UPF1 and so a complete check is impossible. Meanwhile

Baird et al.’s processing of their own data but using the Ensembl NMD target list does show the expected results (Figure 4.11B), drawing the reliability of the Baird et al. list into question. Our processing of Baird et al.’s data shows similar trends (Figure 4.12) allowing us to remove a difference in processing procedure as the cause of this discrepancy.

165

Figure 4.11 NMD Analysis of UPF1-KD/Control with Baird Processing is Consistent only with Ensembl’s List (A) Cumulative distribution function of UPF1 knockdown over control using both Baird et al.’s list and data

(B) Cumulative distribution function as in A but using Ensembl’s list

166

Figure 4.12 NMD Analysis of UPF1-KD/Control with Our Processing is Consistent only with Ensembl’s List (A) Cumulative distribution function as in Figure 4.11A but with our data processing

(B) Cumulative distribution function as in Figure 4.11B but with our data processing 167

4.4.7 Creating our own lists and understanding ISAR’s dependencies

At this point to better investigate the clear discrepancies in the results produced when using either the NMD list made by Baird et al. with ISAR or provided by Ensembl we created several more lists de novo. First, we made a Python script to create a list of transcripts that exactly follows the definition of a PTC-based NMD target. i.e., this script takes a list of transcripts with exon and UTR coordinates provided by Ensembl and flags them as PTC containing or lacking, and thus an NMD target or not, based on whether the

3’UTR begins more than 50 nt upstream of the final exon junction (including in any preceding exons). For clarity this will be referred to as the Python list. We then attempted to replicate the ISAR list created by Baird et al. ISAR requires RNA-Seq data in order to analyze isoform switches, but also uses that data to predict PTCs. Therefore, the type, depth, and number of replicates used will have a direct impact on which transcripts are considered for analysis. This already makes any list produced by ISAR experiment- specific in nature, and liable to lack any true NMD targets which are poorly represented.

Since no information in (Baird, et al., 2018) was given regarding ISAR parameters, for the following analysis we used all 9 (3 experiments in triplicate) samples provided by

Baird et al., processed in our own pipeline, as the input.

In addition to being input-data dependent ISAR has several options regarding how open reading frames (ORFs) are predicted. Default settings attempt to predict the ORF based on supplied sequence information, and label targets based on the same 50 nt rule described above. In the following figures the list produced this way is labeled

“ENST_PTC.” Another method is “longest” which attempts to find the longest ORF,

168 again based on provided sequence information (“ENST_PTC-longest”). Finally, the

“longestAnnotated” method makes use of an additional annotation file with CDS coordinates and uses the annotated CDS start while still attempting to find the longest

ORF (“ENST_PTC-longestAnnotated”).

Figure 4.13 shows how the choice of method for ISAR can make a large difference in what PTC targets are identified. Default, “longest”, and “longestAnnotated” show increasing stringency for selection, but also considerable discrepancies. In the case of ENST_PTC and ENST_PTC-longest the former contains nearly twice as many targets.

Extra noteworthy are the NMD targets which appear in only one method and cast considerable doubt on the reproducibility of an ISAR-based NMD list.

169

Figure 4.13 Comparison of ISAR Methods Venn diagram showing the overlap of NMD targets identified by ISAR. Red: ENST_PTC (default settings), Blue: ENST_PTC-longest (“longest” method), Green: ENST_PTC-longestAnnotated (“longestAnnotated” method).

Next, we compared the default and most stringent ISAR methods to the list produced by Baird et al., also using ISAR and with the same RNA-Seq data (Figure

4.14). Since Baird et al. did de novo transcript discovery and we are also unsure if they used all 9 samples in their processing, this time we limited the transcript pool to those transcripts which ISAR gave either a TRUE or FALSE PTC label in all three lists (i.e. transcripts which were not considered in one method for any reason were excluded).

170

Once again there are large discrepancies, especially with the Baird et al. list which has fewer transcripts in common with either of our lists than it has unique.

Figure 4.14 Comparison of ISAR Methods with Baird et al.’s List Venn diagram showing the overlap of NMD targets identified by ISAR, both in our processing and Baird et al.’s. Red: ENST_PTC (default settings), Blue: Baird’s list, Green: ENST_PTC-longestAnnotated (“longestAnnotated” method).

At this point we move away from Baird et al.’s list in order to focus on a more universal definition (independent of novel transcripts) and to uncover why these discrepancies arise. A comparison of Ensembl, Python, and ISAR most stringent lists once again exhibits marked differences, especially in the case of the ISAR method

(Figure 4.15). We next analyzed transcripts unique to each list (Figure 4.16) and found

171 somewhat unexpected results. In the Ensembl list we see a transcript with 3’UTR beginning in the final exon, which does not meet the PTC rule. This may be identified as an NMD target for a different, unknown reason. In the Python list there is a transcript which meets the PTC criterion but also has a very long 3’UTR covering almost the entire transcript, which may have disqualified it in the other two conditions. Finally, the ISAR list features a transcript that does have a PTC in the penultimate exon, but it is only 46 nt away from the final junction which should disqualify it as an NMD target, yet it is still reported.

172

Figure 4.15 Comparison of Ensembl, Python, and ISAR Stringent Lists Venn diagram showing the overlap of NMD targets identified by three methods. Red: Ensembl NMD biotype, Blue: In-house generated based on PTC definition, Green: ENST_PTC-longestAnnotated (“longestAnnotated” ISAR method).

173

Figure 4.16 Visual of Uniqe NMD Targets Illustration (not to scale) of the exon and intron structure of NMD targets unique to each list in Figure 4.15 (color-coded) aligned by their PTC. Red: Ensembl NMD biotype, which contains the PTC in the final exon. Blue: In-house generated based on PTC definition, which finds a PTC in the 2nd exon (of 10). Green: ENST_PTC- longestAnnotated (“longestAnnotated” ISAR method), which features the PTC in the penultimate exon but only 46 nt away from the final junction.

4.4.8 Construction of a final set of NMD targets and revisiting UPF1

At this point we feel confident that any list produced by ISAR is unreliable for use as a universal NMD annotation and is instead better suited for studies of specific experiments, especially when investigating novel transcripts. In particular ISAR’s sample dependence as well as disagreement with more strictly defined PTC-based NMD definitions make it a poor choice for general use. Meanwhile Ensembl has a well- established list of transcripts that undergo NMD, but not all of them meet the strict EJC-

PTC-based NMD rule. Likewise, the targets unique to our in-house generated list may not meet unknown NMD criteria despite having a PTC. Thus, to create a stringent list of

NMD targets which are both flagged by Ensembl and also meet the strict 50 nt upstream

PTC definition, we combine those transcripts found in both the Ensembl and Python lists

(the purple overlap in Figure 4.17). We also exclude any transcript which is labeled as an

174

NMD target either by Ensembl or the Python list, but not both, ensuring our non-NMD control list is also stringently constructed.

Figure 4.17 Area of the Finalized List Venn diagram showing the overlap of NMD targets predicted in-house (blue) and the NMD biotype from Ensembl (red) which make up our new definitive list of NMD targets (purple overlap).

To test the viability of this new list we now return to the ICE1 and UPF1 knockdown data which in the case of UPF1 lacked the clear and expected trends previously. We also create a stricter NMD list containing only transcripts from genes which contain both a PTC and non-PTC transcript, ensuring that we analyze only genes which are targeted for NMD depending on which isoform is expressed. Now both for the

175 less strict (Figure 4.18) and more strict (Figure 4.19) lists, and critically in both the case of ICE1 and UPF1 knockdown we see the expected results clearly defined, solidifying these lists as accurate representations of the NMD-targeted transcriptome.

176

Figure 4.18 NMD Analysis of ICE1-KD/Control and UPF1-KD/Control with Our Processing and New NMD List Behaves as Expected (A) Cumulative distribution function of ICE1-KD/Control for PTC-True and PTC-False transcripts (NMD target and non-target, respectively).

(B) Cumulative distribution function as in (A) but using with UPF1-KD/Control. 177

Figure 4.19 NMD Analysis of ICE1-KD/Control and UPF1-KD/Control with Our Processing and New, Strict, NMD List Behaves as Expected (A) Cumulative distribution function of ICE1-KD/Control for PTC-True and PTC- False transcripts (NMD target and non-target, respectively).

(B) Cumulative distribution function as in (A) but using with UPF1-KD/Control. 178

4.5 Discussion

4.5.1 PYM is unnecessary for canonical EJC removal, but does reduce levels of non- canonical EJCs

The current model of PYM suggests that its primary function is the disassembly of canonically located EJCs (Bono, et al., 2004; Gehring, Lamprinaki, Hentze, &

Kulozik, 2009). However that work is based on exogenous overexpression of PYM, and

PYM is not even a vital protein (Ghosh, Obrdlik, Marchand, & Ephrussi, 2014) which tells us that in the absence of PYM translation still takes place, and some other mechanism must be capable of removing EJCs. Indeed, our work with mutant MAGOH containing EJCs incapable of binding to PYM clearly shows that PYM-EJC interaction is unnecessary for the removal of canonical EJCs. The question of how EJCs are displaced or potentially removed co-translationally is still not well understood.

However, loss of PYM does result in an increase of non-canonical EJC binding events. This is especially clear when examining single exon genes, on which any EJC deposition is by definition non-canonical. Therefore, PYM’s disassembly function may be more important in the process of removing non-canonical EJCs, or perhaps it acts to prevent spontaneous EJC assembly in non-canonical positions within the cytoplasm by binding to cytosolic MAGOH:Y14 heterodimers. Just as the exact mechanism by which non-canonical EJCs are deposited in the first place remains unknown, how PYM selectively targets these non-canonical EJCs is a mystery.

179

4.5.2 ISAR is unreliable for PTC prediction, but a combination Ensembl and PTC-based

NMD list performs admirably

In order to continue our study of PYM it is necessary to have a reliable and universal list of NMD target transcripts. We attempted and were successful in reproducing evidence for ICE1’s role in NMD by (Baird, et al., 2018) but were surprised to find that their NMD list and data showed little upregulation of NMD targets when

UPF1, a known NMD factor, was knocked down. To uncover why this might be we created and compared several plausible NMD lists, including de novo lists created by

ISAR in the manner of Baird et al. but found only more discrepancies.

By analyzing specific transcripts from the various NMD target lists we have shown that ISAR is heavily dependent on the RNA-Seq samples required for isoform switch analysis and can also produce very different results depending on the ORF prediction method. To avoid this sample dependence and any conflict between methods we created a finalized list using only the Ensembl NMD biotype and an in-house script to check for a transcript’s strict adherence to the 50 nt PTC rule. Using this list as well as a stricter version we were able to garner the expected results with both the UPF1 and ICE1 knockdown experiments, confirming its viability as a universal NMD target reference list. This list is still based purely on the concept of premature termination codons, however, and cannot perfectly capture the multitude of variables which influence whether a transcript will be targeted for NMD. Further exploration of how these individual transcripts behave when NMD is impaired, for instance with a knockdown or knockout of

180

UPF1 and other proteins involved in the NMD pathway, could solidify experimentally the true nature of each transcript.

181

Chapter 5

Conclusions

5.1 A New Understanding of the EJC Makeup and its Implications

The Exon Junction Complex is a group of proteins which are deposited on pre- mRNA immediately following their conception, by the same splicing machinery which dictates the final sequence which will eventually become a protein. As a physical marker of splice sites, it fulfills a very unique role among proteins. Namely, it does not bind any particular motif but is placed on all multi-exon mRNAs at specific locations, which allows the cell to later “remember” where splicing events took place. This function makes the EJC vital in the Nonsense-Mediated Decay pathway, which degrades mRNAs which have been improperly spliced, mutated, or evolved to use NMD as a regulatory mechanism.

Whereas the EJC was previously thought to be made up of a 4 protein core including CASC3, we have shown that at initial deposition it is in fact a trimeric core featuring one of two alternate proteins, RNPS1, which only later after export into the cytoplasm is exchanged for CASC3. While bound by RNPS1 the EJCs work together with a host of peripheral proteins to turn the mRNA into a compact package, aiding in transport across the nuclear membrane. Likewise, while still bound by RNPS1-EJCs mRNAs are more sensitive to NMD, implying that mRNA which go on to reach maturity in the cytoplasm become more stable against this process. Together these finding paint a

182 picture of the EJC which is far more dynamic and influences the mRNAs to which they bind in an increasingly complex manner.

5.2 The Importance of Choosing the Right Crosslinking Method for an Experiment

Our study of the EJC in Chapter 2 made us question why previous research had not yet uncovered the major RNPS1 binding composition of the EJC. Working with both

CLIP-Seq and RIPiT-Seq data sets we found in Chapter 3 that while CLIP-Seq may allow for unrivaled precision when analyzing RNA Binding Proteins, it fails to accurately capture proteins which bind indirectly to, or photo crosslink poorly with RNA. We name the latter type of RNA Binding Proteins RAFs, or RBP Associating Factors. Indeed, our work shows that a majority of proteins which interact with RNA are likely RAFs, and so cannot be well studied using CLIP-Seq approaches. We were able to demonstrate this by using the alternate EJC composition: CASC3 is an RBP, while RNPS1 is an RAF. The same difference in composition which previously made one composition of the EJC unknown has now been used to prove the importance of wisely selecting a crosslinking method.

To further demonstrate the need for chemical crosslinking when examining RAFs, we looked at a previous study on EJCs which considered their impact on Recursively

Spliced (RS) exons. Using the RIPiT-Seq protocol described in Chapter 2 we were able to identify RNPS1-EJCs not found by CLIP-Seq on exons upstream of these RS exon junctions, demonstrating that the RNPS1-EJC also plays a role in inhibiting recursive splicing. The distinction in the applicable range of experimental methods shown in

Chapter 3 represents a new understanding of how RNA:protein binding experiments

183 should be crafted and will ideally lead to more accurate profiles of RBP and RAF binding in the future.

5.3 Examining the Role of PYM and Finding a Trustworthy NMD Annotation

Continuing our work with the EJC, but now focusing on the end of its lifecycle, in

Chapter 4 we tested the EJC associating protein PYM’s potential role in EJC disassembly. Previous work had suggested that PYM acts to remove canonically placed

EJCs co-translationally, but by impeding translation we found that this function, if it exists, is not necessary for EJC removal. On the other hand, while we saw no significant influence on canonical EJCs when using a mutated cell line which prevented PYM-EJC binding, we did find evidence for PYM as an antagonist of specifically non-canonical

EJCs. We confirmed this trend by observing how single exon genes, on which all EJCs are by definition non-canonical, show increased abundance when PYM cannot bind to

EJCs. This work begins to paint a picture of PYM not so much as a general EJC disassembly factor, but as a protein which works to inhibit or disassemble non-canonical

EJCs.

There is evidence that PYM may play a direct role in NMD. To follow this line of inquiry we would like to test PYM overexpression and knockdown or knockout data against a trusted list of NMD targets. To create such a list we first conducted an exercise in reproducibility, re-creating figures and trends showing the importance of ICE1 to the

NMD pathway (Baird, et al., 2018) using our own pipeline. We then went on to examine several ways of defining an NMD list, including making such a list de novo with the program IsoformSwitchAnalyzeR, as was done by Baird et al. At length we found that

184 the best way to ensure we had a universal list (i.e. independent of experiment or sample) while also having certainty regarding the placement of PTCs was to combine two lists; one reported by the Ensembl BioMart and one created in-house using a strict PTC definition. This new list showed promising results with the reproduced data, a good omen for future research with PYM.

5.4 Future Directions

The vast cohort of proteins which associate with the EJC and their implications for mRNA processing have only begun to be uncovered. In this work we showcase two proteins closely associated with the core, CASC3 and RNPS1, but have only cracked the surface in understanding the differences between these alternate compositions. PYM represents another example of an EJC associating protein on which we have only now begun to shed more light. That advancement is thanks in part to the choice of methods, namely the use of chemical crosslinking to study the EJC. We have already shown the large differences in RBP/RAF bound fragment capture brought about by either CLIP-Seq or RIP-Seq – what other factors and methodologies might better illuminate the large body of RBPs and RAFs which are yet to be studied? Our findings that PYM is unnecessary for EJC removal begs the question of what, then, is the exact process by which EJCs are in fact removed. Likewise while we have shown that PYM impacts the abundance of non-canonical EJCs that mechanism also remains unknown – does PYM bind to already deposited ncEJCs and spur their disassembly, or does it bind to free MAGOH:Y14 complexes in the cytoplasm, preventing spontaneous ncEJC deposition?

185

This work has revealed many novel insights into both the structure and function of the EJC, as well as some of its peripheral proteins. It has also demonstrated how the method used to research these questions can be instrumental in seeing the whole picture.

However, each question answered leads to new of lines of inquiry; there is still much to be learned regarding the dynamic fate of the Exon Junction Complex.

186

Bibliography

Adivarahan, S., Livingston, N., Nicholson, B., Rahman, S., Wu, B., Rissland, O., & Zenklusen, D. (2018). Spatial Organization of Single mRNPs at Different Stages of the Gene Expression Pathway. Mol. Cell, 72, 727-738.e5. Alexandrov, A., Colognori, D., Shu, M.-D., & Steitz, J. (2012). Human spliceosomal protein CWC22 plays a role in coupling splicing to exon junction complex deposition and nonsense-mediated decay. Proc. Natl. Acad. Sci., 109, 21313- 21318. Anantharaman, V., Koonin, E., & Aravind, L. (2002). Comparative genomics and evolution of proteins involved in RNA metabolism. Nucleic Acids Res., 1427- 1464. Andersen, C., Ballut, L., Johansen, J., Chamieh, H., Nielsen, K., Oliveira, C., . . . Andersen, G. (2006). Structure of the Exon Junction Core Complex with a Trapped DEAD-Box ATPase Bound to RNA. Science, 313, 1968-1972. Änkö, M.-L., Müller-McNicoll, M., Brandl, H., Curk, T., Gorup, C., Henry, I., . . . Neugebauer, K. (2012). The RNA-binding landscapes of two SR proteins reveal unique functions and binding to diverse RNA classes. Genome Biol., 13, R17. Au, P., Helliwell, C., & Wang, M.-B. (2014). Characterizing RNA-protein interaction using cross-linking and metabolite supplemented nuclear RNA- immunoprecipitation. Mol. Biol. Rep., 41, 2971-2977. Aznarez, I., Nomakuchi, T., Tetenbaum-Novatt, J., Rahman, M., Fregoso, O., Rees, H., & Krainer, A. (2018). Mechanism of Nonsense-Mediated mRNA Decat Stimulation by Splicing Factor SRSF1. Cell Rep., 23, 2186-2198. Baird, T., Cheng, K.-C., Chen, Y.-C., Buehler, E., Martin, S., Inglese, J., & Hogg, J. (2018). ICE1 promotes the link between splicing and nonsense-mediated mRNA decay. eLife, 7, e33178. Ballut, L., Marchadier, B., Baguet, A., Tomasetto, C., Séraphin, B., & Le Hir, H. (2005). The exon junction core complex is locked onto RNA by inhibition of eIF4AIII ATPase activity. Nat. Struct. Mol. Biol., 12, 861-869. Baltz, A., Munschauer, M., Schwanhäusser, B., Vasile, A., Murakawa, Y., Schueler, M., . . . Milek, M. (2012). The mRNA-bound proteome and its global occupancy profile on protein-coding transcripts. Mol. Cell, 46, 674-690. Ban, T., Zhu, J., Melcher, K., & Xu, H. (2015). Structural mechanisms of RNA recognition: sequence-specific and non-specific RNA-binding proteins and the Cas9-RNA-DNA complex. Cell. Mol. Life Sci., 72, 1045-1058. Barik, A., Nithin, C., Pilla, P. S., & Bahadur, R. P. (2015). Molecular architecture of protein-RNA recognition sites. Journal of Biomolecular Structure & Dynamics, 33, 2738-2751. Bessonov, S., Anokhina, M., Will, C., Urlaub, H., & Lührmann, R. (2008). Isolation of an active step I spliceosome and composition of its RNP core. Nature, 452, 846- 850.

187

Blazquez, L., Emmett, W., Faraway, R., Pineda, J., Bajew, S., Gohr, A., . . . Irimia, M. (2018). Exon Junction Complex Shapes the Transcriptome by Repressing Recursive Splicing. Mol. Cell, 72, 496-509.e9. Boehm, V., & Gehring, N. (2016). Exon Junction Complexes: Supervising the Gene Expression Assembly Line. Trends Genet., 32, 724-735. Boehm, V., Britto-Borges, T., Steckelberg, A.-L., Singh, K., Gerbracht, J., Gueney, E., . . . Gehring, N. (2018). Exon Junction Complexes Suppress Spurious Splice Sites to Safeguard Transcriptome Integrity. Mol. Cell, 72, 482-495. e7. Bono, F., Ebert, J., Lorentzen, E., & and Conti, E. (2006). The Crystal Structure of the Exon Junction Complex Reveals How It Maintains a Stable Grip on mRNA. Cell, 126, 713-725. Bono, F., Ebert, J., Unterholzner, L., Güttler, T., Izaurralde, E., & Conti, E. (2004). Molecular insights into the interaction of PYM with Mago-Y14 core of the exon junction complex. EMBO reports, 5(3), 304-310. Boutz, P., Bhutkar, A., & Sharp, P. (2015). Detained introns are a novel, widespread class of post-transcriptional spliced introns. Genes Dev., 29, 63-80. Brannan, K., Jin, W., Huelga, S., Banks, C., Gilmore, J., Florens, L., . . . Schwinn, M. (2016). SONAR Discovers RNA-Binding Proteins from Analysis of Large-Scale Protein-Protein Interactomes. Mol. Cell, 64, 282-293. Bray, N., Pimentel, H., Melsted, P., & Pachter, L. (2016). Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol, 34, 525-527. Brown, T., & Brown, T. J. (2020, 5 29). Nucleic Acid Structure. Retrieved from Atdbio: http://www.atdbio.com/content/5/Nucleic-acid-structure Buchwald, G., Schüssler, S., Basquin, C., Le Hir, H., & Conti, E. (2013). Crystal structure of the human eIF4AIII-CWC22 complex shows how a DEAD-box protein is inhibited by a MIF4G domain. Proc. Natl. Acad. Sci. U. S. A., 110, E4611-4618. Chan, W.-K., Huang, L., Gudikote, J., Chang, Y.-F., Imam, J., MacLean, J., & Wilkinson, M. (2007). An alternative branch of the nonsense-mediated decay pathway. EMBO J., 26, 1820-1830. Chatterjee, K., Majumder, S., Wan, Y., Shah, V., Wu, J., Huang, H.-Y., & Hopper, A. (2017). Sharing the load: Mex67-Mtr2 cofunctions with Los1 in primary tRNA nuclear export. Genes Dev., 31, 2186-2198. Chazal, P.-E., Daguenet, E., Wendling, C., Ulryck, N., Tomasetto, C., Sargueil, B., & Hir, H. (2013). EJC core componenet MLN51 interacts with eIF3 and activates translation. Proc. Natl. Acad. Sci., 110, 5903-5908. Chen, Q., Jagannathan, S., Reid, D., Zheng, T., & and Nicchitta, C. (2011). Hierarchical regulation of mRNA partitioning between the cytoplasm and the endoplasmic reticulum of mammalian cells. Mol. Biol. Cell, 22, 2646-2658. Chen, Y., & Varani, G. (2005). Protein families and RNA recognition. FEBS Journal, 272, 2088-2097. Chu, C., Zhang, Q., da Rocha, S., Flynn, R., Bharadwaj, M., Calabrese, J., . . . Chang, H. (2015). Systematic discovery of Xist RNA binding proteins. Cell, 161, 404-416.

188

Collas, P. (2010). The current state of chromatin immunoprecipitation. Mol. Biotechnol., 45, 87-100. Daguenet, E., Baguet, A., Degot, S., Schmidt, U., Alpy, F., Wendling, C., . . . Le Hir, H. (2012). Perispeckles are major assembly sites for the exon junction core complex. Mol. Biol. Cell, 23, 1765-1782. Degot, S., Hir, H. L., Alpy, F., Kedinger, V., Stoll, I., Wendling, C., . . . Tomasetto, C. (2004). Association of the Breast Cancer Protein MLN51 with the Exon Junction Complex via Its Speckle Localizer and RNA Binding Module. J. Biol. Chem., 279, 33702-33715. Degot, S., Régnier, C., Wendling, C., Chenard, M.-P., Rio, M.-C., & Tomasetto, C. (2002). Metastatic Lymph Node 51, a novel nucleo-cytoplasmic protein overexpressed in breast cancer. Oncogene, 21, 4422-4434. Diem, M., Chan, C., Younis, I., & Dreyfuss, G. (2007). PYM binds the cytoplasmic exon-junction complex and ribosomes to enhance translation of spliced mRNAs. Nat. Struct. Mol. Biol., 14, 1173-1179. Doerr, T., & Yu, Y. (2010). A simple electrostatic model applicable to biomolecular recognition. Phys. Rev. E. Stat. Nonlin. Soft Matter Phys., 81(1). Dostie, J., & Dreyfuss, G. (2002). Translation is Required to Remove Y14 from mRNAs in the Cytoplasm. Curr. Biol., 12, 1060-1067. Duncan, C., & Mata, J. (2017). Effects of cycloheximide on the interpretation of ribosome profiling experiments in Schizosaccharomyces pombe. Sci. Rep., 7. Fabre, B., Lambour, T., Delobel, J., Amalric, F., Monsarrat, B., Burlet-Schiltz, O., & Bousquet-Dubouch, M.-P. (2013). Subcellular distirbution and dynamics of active proteasome complexes unraveled by a workflow combining in vivo complex crosslinking and quantitative proteomics. Mol. Cell. Proteomics, 12, 687-699. G Hendrickson, D., Kelley, D., Tenen, D., Bernstein, B., & Rinn, J. (2016). Widespread RNA binding by chromatin-associated proteins. Genome Biol., 17, 28. Gangras, P., Dayeh, D., Mabin, J., Nakanishi, K., & Singh, G. (2018). Cloning and Identification of Recombinant Argonaute-Bound Small RNAs Using Next- Generation Sequencing Methods. Methods Mol. Biol., 1680, 1-28. Gehring, N., Kunz, J., Neu-Yilik, G., Breit, S., Viegas, M., Hentze, M., & Kulozik, A. (2005). Exon-Junction Complex Components Specify Distinct Routes of Nonsense-Mediated mRNA Decay with Differential Cofactor Requirements. Mol. Cell, 20, 65-75. Gehring, N., Lamprinaki, S., Hentze, M., & Kulozik, A. (2009). The Hierarchy of Exon- Junction Complex Assembly by the Spliceosome Explains Key Features of Mammalian nonsense-Mediated mRNA Decay. PLoS Biol., 7. Gerstberger, S., Hafner, M., & Tuschl, T. (2014). census of human RNA-binding proteins. Nature Reviews Genetics, 15, 829-845. Geyer, P., Meyuhas, O., Perry, R., & Johnson, L. (1982). Regulation of Ribosomal Protein mRNA Content and Translation in Growth-Stimulated Mouse Fibroblasts. Mol. Cell. Biol., 2, 685-693.

189

Ghosh, S., Obrdlik, A., Marchand, V., & Ephrussi, A. (2014). The EJC binding and dissociating activity of PYM is regulated in Drosophila. PLoS genetics, 10(6), e1004455. Gopinath, S. (2009). Mapping of RNA-protein interactions. Analytica Chimica Acta, 636, 117-128. Halstead, J., Lionnet, T., Wilbertz, J., Wippich, F., Ephrussi, A., Singer, R., & Chao, J. (2015). Translation. An RNA biosensor for imaging the first round of translation from single cells to livings animals. Science, 347, 1367-1671. Hauer, C., Sieber, J., Schwarzl, T., Hollerer, I., Curk, T., Alleaume, A.-M., . . . Kulozik, A. (2016). Exon Junction Complexes Show a Distributional Bias toward Alternatively Spliced mRNAs and against mRNAs Coding for Ribosomal Proteins. Cell Rep., 16, 1588-1603. Hayashi, R., Handler, D., Ish-Horowicz, D., & Brennecke, J. (2014). The exon junction complex is required for definition and excision of neighboring introns in Drosophila. Genes Dev., 28, 1722-1785. Haynes, C., & Iakoucheva, L. (2006). Serine/arginine-rich splicing factors belong to a class of intrinsically disordered proteins. Nucleic Acids Res., 34, 305-312. Hentze, M., Castello, A., Schwarzl, T., & Preiss, T. (2018). A brave new world of RNA- binding proteins. Nat. Rev. Mol. Cell Biol., 19, 327-341. Hockensmith, J., Kubasek, W., Vorachek, W., & von Hippel, P. (1986). Laser cross- linking of nucleic acids to proteins. Methodology and first applications to the phage T4 DNA replication system. J. Biol. Chem., 261, 3512-3518. Hockensmith, J., Kubasek, W., Vorachek, W., & von Hippel, P. (1993). Laser cross- linking of proteins to nucleic acids. I. Examining physical parameters of protein- nucleic acid complexes. J. Biol. Chem., 268, 15712-15720. Hoffman, E., Frey, B., Smith, L., & Auble, D. (2015). Formaldehyde crosslinking: a tool for the study of chromatin complexes. J. Biol. Chem., 290, 26404-26411. Hogeweg, P. (2011). The roots of bioinformatics in theoretical biology. PLoS COmput. Biol., 7(3):e1002021. Huang, H.-Y., & Hopper, A. (2015). In vivo biochemical analyses reveal distinct roles of β-importins and eEF1A in tRNA subcellular traffic. Genes Dev., 29, 772-783. Huang, W., Sherman, B., & Lempicki, R. (2009). Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc., 4, 44-57. Hug, N., Longman, D., & Cáceres, J. F. (2016). Mechanism and regulation of the nonsense-mediated decay pathway. Nucleic Acids Res., 44(4): 1483-1495. Huppertz, I., Attig, J., D’Ambrogio, A., Easton, L., Sibley, C., Sugimoto, Y., . . . Ule, J. (2014). iCLIP: protein-RNA interactions at nucleotide resolution. Methods, 65, 274-287. Ingolia, N., Lareau, L., & Weissman, J. (2011). Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell, 147, 789-802. Jan, C., Williams, C., & Weissman, J. (2014). Principles of ER cotranslational translocation revealed by proximity-specific ribosome profilling. Science, 346, 1257521. 190

Jankowsky, E., & Harris, M. E. (2015). Specificity and nonspecifcity in RNA-protein interactions. Nature Reviews Molecular Cell Biology, 16, 533-544. Jones, S., Daley, D., Luscombe, N., Berman, H., & Thornton, J. (2001). Protein-RNA interactions: a structural analysis. Nucleic Acids Research, 29, 943-954. Joon, H. R., Madhu, T., Pulakesh, A., Kimoon, K., Briber, R., & Woodson, S. A. (2011). Charge screening in RNA: an integral route for dynamical enhancements. Soft Matter, 11, 8741-8745. Kahvejian, A., Svitkin, Y., Sukarieh, R., M’Boutchou, M.-N., & Sonenberg, N. (2005). Mammalian poly(A)-binding protein is a , which acts via multiple mechanisms. Genes Dev., 19, 104-113. Karousis, E., Nasif, S., & Mühlemann, O. (2016). Nonsense-mediated mRNA decay: novel mechanistic insights and biological impact. Wiley Interdiscip. Rev. RNA. Kim, B., & Kim, V. (2019). fCLIP-seq for transcriptomic footprinting of dsRNA-binding proteins: Lesson from DROSHA. Methods, 152, 3-11. Kiss, D., Baez, W., Huebner, K., Bundschuh, R., & Schoenberg, D. (2017). Impact of FHIT loss on the translation of cancer-associated mRNAs. Cancer, 16, 179. Kwon, I., Kato, M., Xiang, S., Wu, L., Theodoropoulos, P., Mirzaei, H., . . . McKnight, S. (2013). Phosphorylation-Regulated Binding of RNA Polymerase II to Fibrous Polymers of Low-Complexity Domains. Cell, 155, 1049-1060. Lau, C. K., Diem, M., Dreyfuss, G., & Van Duyne, G. (2003). Structure of the Y14- Magoh Core of the Exon Junction Complex. Curr. Biol., 13, 933-941. Le Hir, H., Izaurralde, E., Maquat, L. E., & Moore, M. J. (2000). The spliceosome deposits multiple proteins 20–24 nucleotides upstream of mRNA exon–exon junctions. EMBO J, 19: 6860-6869. Le Hir, H., Saulie`re, J., & Wang, Z. (2016). The exon junction complex as a node of post-transcriptional networks. Nat. Rev. Mol. Cell Biol., 17, 41-54. Lee, F., & Ule, J. (2018). Advances in CLIP Technologies for Studies of Protein-RNA Interactions. Mol. Cell, 69, 354-369. Lee, S., & Lykke-Andersen, J. (2013). Emerging roles for ribonucleoprotein modification and remodeling in controlling RNA fate. Trends Cell Biol., 23, 504-510. Linder, P., & Jankowsky, E. (2011). From unwinding to clamping – the DEAD box RNA helicase family. Nature Reviews Molecular Cell Biology, 12, 505-516. Lipfert, J., Sim, A., Herschlag, D., & Doniach, S. (2010). Dissecting electrostatic screening, specific ion binding, and ligand binding in an energetic model forglycine riboswitch folding. RNA, 16, 708-719. Love, M., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol., 15, 550. Lunde, B., Moore, C., & Varani, G. (2017). RNA-binding proteins: modular design for efficient function. Nature Reviews Molecular Cell Biology, 8, 479-490. Lykke-Andersen, J., Shu, M., & Steitz, J. (2001). Communication of the position of exon- exon junctions to the mRNA surveillance machinery by the proetein RNPS1. Science, 293, 1836-1839. Lykke-Andersen, S., & Jensen, T. (2015). Nonsense-mediated mRNA decay: an intricate machinery that shapes transcriptomes. Nat. Rev. Mol. Cell Biol., 16, 665-677. 191

Mabin, J., Woodward, L., Patton, R., Yi, Z., Jia, M., Wysocki, V., . . . Singh, G. (2018). The Exon junction Complex Undergoes a Compositional Switch that Alters mRNP Structure and Nonsense-Mediated mRNA Decay Activity. Cell Rep., 25, 2431-2446. e7. Macchi, P., Kroening, S., Palacios, I., Baldassa, S., Grunewald, B., Ambrosino, C., . . . Kiebler, M. (2003). Barentsz, a New Component of the Staufen-Containing RibonucleoproteinParticles in Mammalian Cells, Interacts with Staufen in an RNA-Dependent Manner. J. Neurosci., 23, 5778-5788. Malone, C., Mestdagh, C., Akhtar, J., Kreim, N., Deinhard, P., Sachidanandam, R., . . . Roignant, J.-Y. (2014). The exon junction complex controls transposable element activity by ensuring faithful splicing of the piwi transcript. Genes Dev., 28, 1786- 1799. Mao, H., Brown, H., & Silver, D. (2017). Mouse models of Casc3 reveal developmental functions distinct from other components of the exon junction complex. RNA N. Y. N., 23, 23-31. Mao, H., McMahon, J., Tsai, Y.-H., Wang, Z., & Silver, D. (2016). Haploinsufficiency for Core Exon Junction Complex Components Disrupts Embryonic Neurogenesis and Causes p53-Mediated Microcephaly. PLoS Genet., 12, e1006282. Marcotrigiano, J., Gingras, A., Sonenberg, N., & Burley, S. (1997). Cocrystal structure of the messenger RNA 5' cap-binding protein (eIF4E) bound to 7-methyl-GDP. Cell, 89, 951-961. Margueron, R., & Reinberg, D. (2011). The Polycomb complex PRC2 and its mark in life. Nature, 469, 343-349. McMahon, J., Miller, E., & Silver, D. (2016). The exon junction complex in neural development and neurodevelopmental disease. Int. J. Dev. Neurosci. Off. J. Int. Soc. Dev. Neurosci., 55, 117-123. Metkar, M., Ozadam, H., Lajoie, B., Imakaev, M., Mirny, L., Dekker, J., & Moore, M. (2018). Higher-Order Organization Principles of Pre-translational mRNPs. Mol. Cell, 72, 715-726.e3. Meyuhas, O., & Kahan, T. (2015). The race to decipher the top secrets of TOP mRNAs. Biochim. Biophys. Acta, 1849, 801-811. Mili, S., & Steitz, J. (2004). Evidence for reassociation of RNA-binding proteins after cell lysis: implications for the interpretation of immunoprecipitation analyses. RNA, 10, 1692-1694. Mishler, D., Christ, A., & Steitz, J. (2008). Flexibility in the site of exon junction complex deposition revealed by functional group and RNA secondary structure alterations in the splicing substrate. RNA N. Y. N., 14, 2657-2670. Müller-McNicoll, M., & Neugebauer, K. (2013). How cells get the message: dynamic assembly and function of mRNA–protein complexes. Nat. Rev. Genet., 14, 275- 287. Murachelli, A., Ebert, J., Basquin, C., Le Hir, H., & Conti, E. (2012). The structure of the ASAP core complex reveals the existence of a -containing PSAP complex. Nat. Struct. Mol. Biol., 19, 378-386.

192

Neve, J., Burger, K., Li, W., Hoque, M., Patel, R., Tian, B., . . . Furger, A. (2016). Subcellular RNA profiling links splicing and nuclear DICER1 to alternative cleavage and polyadenylation. Genome Res., 26, 24-35. Nicholson, C., Friedersdorf, M., & Keene, J. (2017). Quantifying RNA binding sites transcriptome-wide using DO-RIP-seq. RNA, 23, 32-46. Niranjanakumari, S., Lasda, E., Brazas, R., & Garcia-Blanco, M. (2002). Reversible cross-linking combined with immunoprecipitation to study RNA-protein interactions in vivo. Methods, 26, 182-190. Obrdlik, A., Lin, G., Haberman, N., Ule, J., & Ephrussi, A. (2019). The Transcriptome- wide. Cell Rep., 28, 1219-1236.e11. Paix, A., Folkmann, A., Goldman, D., Kulaga, H., Grzelak, M., Rasoloson, D., . . . Seydoux, G. (2017). Precision genome editing using synthesis-dependent repair of Cas9-induced RNS breaks. Proc. Natl. Acad. Sci., 114, E10745-E10754. Pandit, S. Z., Shiue, L., Coutinho-Mansfield, G., Li, H., Qiu, J., Huang, J., . . . Fu, X.-D. (2013). Genome-wide analysis reveals SR protein cooperation and competition in regulated splicing. Mol. Cell, 50, 223-235. Patton, R., Sanjeev, M., Woodward, L., Mabin, J., Bundschuh, R., & Singh, G. (2020). Chemical crosslinking enhances RNA immunoprecipitation for efficient identification of binding sites of proteins that photo-crosslink poorly with RNA. RNA. doi:https://doi.org/10.1261/rna.074856.120 Patursky-Polischuk, I., Stolovich-Rain, M., Hausner-Hanochi, M., Kasir, J., Cybulski, N., Avruch, J., . . . Meyuhas, O. (2009). The TSC-mTOR Pathway Mediates Translational Activation of TOP mRNAs by Insulin Largely in a Raptor- or Rictor-Independent Manner. Mol. Cell. Biol., 29, 640-649. Queiroz, R., Smith, T., Villanueva, E., Marti-Solano, M., Monti, M., Pizzinga, M., . . . Dezi, V. (2019). Comprehensive identification of RNA-protein interactions in any organism using orthogonal organic phase separation (OOPS). Nat. Biotechnol, 37, 169-178. Quinlan, A., & Hall, I. (2010). BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26, 841-842. Reichert, V., Le Hir, H., Jurica, M., & Moore, M. (2002). 5′ exon interactions within the human spliceosome establish a framework for exon junction complex structure and assembly. Genes Dev., 16, 2778-2791. Ricci, E., Kucukural, A., Cenik, C., Mercier, B., Singh, G., Heyer, E., . . . Moore, M. (2014). Staufen1 senses overall transcript secondary structure to regulate translation. Nat. Struct. Mol. Biol., 21, 26-35. Rodriguez, J., Maietta, P., Ezkurdia, I., Pietrelli, A., Wesselink, J.-J., Lopez, G., . . . Tress, M. (2013). APPRIS: annotation of principal and alternative splice isoforms. Nucleic Acids Res., 41, D110-D117. Roh, J., Tyagi, M., Aich, P., Kim, K., Briber, R., & Woodson, S. (2015). Charge screening in RNA: an integral route for dynamical enhancements. Soft Matter, 11, 8741-8745.

193

Roignant, J.-Y., & Treisman, J. (2010). Exon Junction Complex Subunits Are Required to Splice Drosophila MAP Kinase, a Large Heterochromatic Gene. Cell, 143, 238-250. Sato, H., Hosoda, N., & Maquat, L. (2008). Efficiency of the pioneer round of translation affects the cellular site of nonsense-mediated mRNA decay. Mol. Cell, 29, 255- 262. Saulière, J., Murigneux, V., Wang, Z., Marquenet, E., Barbosa, I., Le Tonquèze, O., . . . Le Hir, H. (2012). CLIP-seq of eIF4AIII reveals transcriptome-wide mapping of the human exon junction complex. Nat. Struct. Mol. Biol., 19, 1124-1131. Schwanhäusser, B., Busse, D., Li, N., Dittmar, G., Schuchhardt, J., Wolf, J., . . . Selbach, M. (2011). Global quantification of mammalian gene expression control. Nature, 473, 337-342. Shazman, S., & Mandel-Gutfreund, Y. (2008). Classifying RNA-Binding Proteins Based on Electrostatic Properties. PLoS Comput. Biol., 4(8): e1000146. Shepard, P. J., & Hertel, K. J. (2009). The SR protein family. Genome Biol., 10: 242. Shulman-Peleg, A., Shatsky, M., Nussinov, R., & Wolfson, H. J. (2008). Prediction of Interacting Single-Stranded RNA Bases by Protein-Binding Patterns. J. Mol. Biol., 379, 299-316. Singh, G., Kucukural, A., Cenik, C., Leszyk, J., Shaffer, S., Weng, Z., & and Moore. (2012). The Cellular EJC Interactome Reveals Higher-Order mRNP Structure and an EJC-SR Protein Nexus. Cell, 151, 750-764. Singh, G., Pratt, G., Yeo, G. W., & Moore, M. J. (2015). The Clothes Make the mRNA: Past and Present Trends in mRNP Fashion. Annual Review of Biochemistry, 84:1, 325-354. Singh, G., Ricci, E., & Moore, M. (2014). RIPiT-Seq: a high-throughput approach for footprinting RNA:protein complexes. Methods, 65, 320-332. Smith, K. (1969). Photochemical addition of amino acids to 14C-uracil. Biochem. Biophys. Res. Commun., 34, 354-357. Solomon, M., & Varshavsky, A. (1985). Formaldehyde-mediated DNA-protein crosslinking: a probe for in vivo chromatin structures. Proc. Natl. Acad. Sci. U. S. A., 82, 6470-6474. Stawiski, E., Gregoret, L., & Mandel-Gutfreund, Y. (2003). Annotating nucleic acid- binding function base don protein structure. J. Mol. Biol., 326, 1065-1079. Steffen, P., & Giegerich, R. (2005). Versatile and declarative dynamic programming using pair algebras. BMC Bioinformatics, 6:224. Sutherland, B., Toews, J., & Kast, J. (2008). Utility of formaldehyde cross-linking and mass spectrometry in the study of protein-protein interactions. J. Mass Spectrom., 43, 699-715. Szczepińska, T., Kalisiak, K., Tomecki, R., Labno, A., Borowski, L., Kulinski, T., . . . Dziembowski, A. (2015). DIS3 shapes the RNA polymerase II transcriptome in humans by degrading a variety of unwanted transcripts. Genome Res., 25, 1622- 1633.

194

Tacke, R., & Manley, J. (1995). The human splicing factors ASF/SF2 and SC35 possess distinct, functionally significant RNA binding specificities. EMBO J., 14, 3540- 3551. Tacke, R., Tohyama, M., Ogawa, S., & Manley, J. (1998). Human Tra2 proteins are sequence-specific activators of pre-mRNA splicing. Cell, 93, 139-148. Tange, T., Shibuya, T., Jurica, M., & Moore, M. (2005). Biochemical analysis of the EJC reveals two new factors and a stable tetrameric protein core. RNA, 11, 1869-1883. Tani, H., Mizutani, R., Salam, K., Tano, K., Ijiri, K., Wakamatsu, A., . . . Akimitsu, N. (2012). Genome-wide determination of RNA stability reveals hundres of short- lived noncoding transcripts in mammals. Genome Res., 22, 947-956. Thandapani, P., O'Connor, T., Bailey, T., & Richard, S. (2013). Defining the RGG/RG Motif. Molecular Cell Review, 50, 613-623. Toma, K. G., Rebbapragada, I., Durand, S., & Lykke-Andersen, J. (2015). Identification of elements in human long 3' UTRs that inhibit nonsense-mediated decay. RNA, 21(5), 887-897. Trapnell, C., Pachter, L., & Salzberg, S. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics, 25, 1105-1111. Trcek, T., Sato, H., Singer, R., & Maquat, L. (2013). Temporal and spatial characterization of nonsense-mediated mRNA decay. Genes Dev., 27, 541-551. Trendel, J., Schwarzl, T., Horos, R., Prakash, A., Bateman, A., Hentze, M., & Krijgsveld, J. (2019). The Human RNA-Binding Proteome and Its Dynamics during Translational Arrest. Cell, 176, 391-403.e19. Urdaneta, E., Vieira-Vieira, C., Hick, T., Wessels, H.-H., Figini, D., Moschall, R., . . . Selbach, M. (2019). Purification of cross-linked RNA-protein complexes by phenol-toluol extraction. Nat. Commun., 10, 990. van Eeden, F., Palacios, I., Petronczki, M., Weston, M., & St Johnston, D. (2001). Barentsz is essential for the posterior localization of oskar mRNA and colocalizes with it to the posterior pole. J. Cell Biol., 154, 511-524. Viegas, M., Gehring, N., Breit, S., Hentze, M., & Kulozik, A. (2007). The abundance of RNPS1, a protein component of the exon junction complex, can determine the variability in efficiency of the Nonsense Mediated Decay pathway. Nucleic Acids Res., 35, 4542-4551. Viegas, M., Gehring, N., Breit, S., Hentze, M., & Kulozik, A. (2007). The abundance of RNPS1, a protein component of the exon junction complex, can determine the variability in efficiency of the Nonsense Mediated Decay pathway. Nucleic Acids Res., 35, 4542-4551. Vitting-Seerup, K., & Sendelin, A. (2019). IsoformSwitchAnalyzeR: analysis of changes in genome-wide patterns of alternative splicing and its functional consequences. Bioinformatics, 35(21), 4469-4471. Walter, P., Ibrahimi, I., & Blobel, G. (1981). Translocation of proteins across the endoplasmic reticulum. I. Signal recognition protein (SRP) binds to in-vitro- assembled polysomes synthesizing secretory protein. J. Cell Biol., 91, 545-550.

195

Wang, Z., Ballut, L., Barbosa, I., & Le Hir, H. (2018). Exon Junction Complexes can have distinct functional flavours to regulate specific splicing events. Sci. Rep., 8, 9509. Woodward, L., Gangras, P., & Singh, G. (2019). Identification of Footprints of RNA:Protein Complexes via RNA Immunoprecipitation in Tandem Followed by Sequencing (RIPiT-Seq). J. Vis. Exp. JoVE. Woodward, L., Mabin, J., Gangras, P., & Singh, G. (2017). The exon junction complex: a lifelong guardian of mRNA fate. Wiley Interdiscip. Rev. RNA, 8, e1411. Xu, X., & Chen, S. (2015). Physics-based RNA structure prediction. Biophysics Reports, 1, 2-13. Yamashita, A. (2013). Role of SMG-1-mediated Upf1 phosphorylation in mammalian nonsense-mediate mRNA decay. Genes Cells Devoted Mol. Cell. Mech., 18, 161- 175. Yang, Y., Flynn, R., Chen, Y., Qu, K., Wan, B., Wang, K., . . . Chang, H. (2014). Essential role of lncRNA binding for WDR5 maintenance of active chromatin and embryonic stem cell pluripotency. Elife, 3, e02046. Yeo, G., & Burge, C. (2004). Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. Comput. Biol., 11, 377-394. yourgenome. (2020). Retrieved from yourgenome: https://www.yourgenome.org/facts/what-is-the-central-dogma Zhang, X., Yan, C., Hang, J., Finci, L., Lei, J., & Shi, Y. (2017). An Atomic Structure of the Human Spliceosome. Cell, 169, 918-929.e14. Zhang, Y., Wen, Z., Washburn, M., & lorens, L. (2010). Refinements to label free proteome quantiitation: how to deal with peptides shared by multiple proteins. Anal. Chem., 82, 2272-2281. Zhang, Z., & Krainer, A. (2004). Inolvement of the SR proteins in mRNA surveillance. Mol. Cell, 16, 597-607. Zhong, X.-Y., Ding, J.-H., Adams, J., Ghosh, G., & Fu, X.-D. (2009). Regulation of SR protein phosphorylation and alternative splicing by modulating kinetic interactions of SRPK1 with molecular chaperones. Genes Dev., 23, 482-495. Zhou, Z., & Fu, X.-D. (2013). Regulation of splicing by SR proteins and SR protein- specific kinases. Chromosoma, 122, 191-207. Zhou, Z., Licklider, L., Gygi, S., & Reed, R. (2002). Comprehensice proteomic analysis of the human spliceosome. Nature, 419, 182-185.

196